Topic model
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as bioinformatics and computer vision.
History
An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. Another one, called probabilistic latent semantic analysis, was created by Thomas Hofmann in 1999. Latent Dirichlet allocation, perhaps the most common topic model currently in use, is a generalization of PLSA. Developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words. Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.Topic models for context information
Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the Pennsylvania Gazette during 1728–1800. Griffiths & Steyvers used topic modeling on abstracts from the journal PNAS to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan used topic modeling on full-text research articles retrieved from DJLIT journal from 1981–2018. In the field of library and information science, Lamba & Madhusudhan applied topic modeling on different Indian resources like journal articles and electronic theses and resources. Nelson has been analyzing change in topics over time in the Richmond Times-Dispatch to understand social and political changes and continuities in Richmond during the American Civil War. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829–2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.Yin et al. introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference.
Chang and Blei included network information between linked documents in the relational topic model, to model the links between websites.
The author-topic model by Rosen-Zvi et al. models the topics associated with authors of documents to improve the topic detection for documents with authorship information.
HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is called . The resulting topics are used to index the papers at to help researchers , and help conference organizers and journal editors .
Algorithms
In practice researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A recent survey by Blei describes this suite of algorithms.Several groups of researchers starting with Papadimitriou et al. have attempted to design algorithms with probable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include singular value decomposition and the method of moments. In 2012 an algorithm based upon non-negative matrix factorization was introduced that also generalizes to topic models with correlations among topics.
Software/libraries
- BigARTM
- Mallet
- Stanford Topic Modeling Toolkit
- Gensim – Topic Modeling for Humans
- topicmodels R package
- Lettier's LDA Topic Modeling - a PureScript, browser-based implementation of LDA topic modeling.
- A Java package for topic modeling on normal or short texts. jLDADMM includes implementations of the LDA topic model and the one-topic-per-document Dirichlet Multinomial Mixture model. jLDADMM also provides an implementation for document clustering evaluation to compare topic models.
- TopicModelsVB.jl Julia package
- STTM A Java package for short text topic modeling. STTM includes these following algorithms: Dirichlet Multinomial Mixture in conference KDD2014, Biterm Topic Model in journal TKDE2016, Word Network Topic Model in journal KAIS2018, Pseudo-Document-Based Topic Model in conference KDD2016, Self-Aggregation-Based Topic Model in conference IJCAI2015, in conference PAKDD2017, Generalized P´olya Urn based Dirichlet Multinomial Mixturemodel in conference SIGIR2016, Generalized P´olya Urn based Poisson-based Dirichlet Multinomial Mixturemodel in journal TIS2017 and Latent Feature Model with DMM in journal TACL2015. STTM also includes six short text corpus for evaluation. STTM presents three aspects about how to evaluate the performance of the algorithms.
- A Java implement of HLTA: https://github.com/kmpoon/hlta