Explicit semantic analysis

In natural language processing and information retrieval, explicit semantic analysis is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.
ESA was designed by Evgeniy Gabrilovich and Shaul Markovitch as a means of improving text categorization
and has been used by this pair of researchers to compute what they refer to as "semantic relatedness" by means of cosine similarity between the aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans", where Wikipedia articles are equated with concepts.
The name "explicit semantic analysis" contrasts with latent semantic analysis, because the use of a knowledge base makes it possible to assign human-readable labels to the concepts that make up the vector space.

Model

To perform the basic variant of ESA, one starts with a collection of texts, say, all Wikipedia articles; let the number of documents in the collection be. These are all turned into "bags of words", i.e., term frequency histograms, stored in an inverted index. Using this inverted index, one can find for any word the set of Wikipedia articles containing this word; in the vocabulary of Egozi, Markovitch and Gabrilovitch, "each word appearing in the Wikipedia corpus can be seen as triggering each of the concepts it points to in the inverted index."
The output of the inverted index for a single word query is a list of indexed documents, each given a score depending on how often the word in question occurred in them. Mathematically, this list is an -dimensional vector of word-document scores, where a document not containing the query word has score zero. To compute the relatedness of two words, one compares the vectors by computing the cosine similarity,
and this gives numeric estimate of the semantic relatedness of the words. The scheme is extended from single words to multi-word texts by simply summing the vectors of all words in the text.

Analysis

ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that the knowledge base contains topically orthogonal concepts. However, it was later shown by Anderka and Stein that ESA also improves the performance of information retrieval systems when it is based not on Wikipedia, but on the Reuters corpus of newswire articles, which does not satisfy the orthogonality property; in their experiments, Anderka and Stein used newswire stories as "concepts".
To explain this observation, links have been shown between ESA and the generalized vector space model.
Gabrilovich and Markovitch replied to Anderka and Stein by pointing out that their experimental result was achieved using "a single application of ESA " and "just a single, extremely small and homogenous test collection of 50 news documents".

Applications

Word relatedness

ESA is considered by its authors a measure of semantic relatedness. On datasets used to benchmark relatedness of words, ESA outperforms other algorithms, including WordNet semantic similarity measures and skip-gram Neural Network Language Model.

Document relatedness

ESA is used in commercial software packages for computing relatedness of documents. Domain-specific restrictions on the ESA model are sometimes used to provide more robust document matching.

Extensions

Cross-language explicit semantic analysis is a multilingual generalization of ESA.
CL-ESA exploits a document-aligned multilingual reference collection to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...