To perform the basic variant of ESA, one starts with a collection of texts, say, all Wikipedia articles; let the number of documents in the collection be. These are all turned into "bags of words", i.e., term frequency histograms, stored in an inverted index. Using this inverted index, one can find for any word the set of Wikipedia articles containing this word; in the vocabulary of Egozi, Markovitch and Gabrilovitch, "each word appearing in the Wikipedia corpus can be seen as triggering each of the concepts it points to in the inverted index." The output of the inverted index for a single word query is a list of indexed documents, each given a score depending on how often the word in question occurred in them. Mathematically, this list is an -dimensional vector of word-document scores, where a document not containing the query word has score zero. To compute the relatedness of two words, one compares the vectors by computing the cosine similarity, and this gives numeric estimate of the semantic relatedness of the words. The scheme is extended from single words to multi-word texts by simply summing the vectors of all words in the text.
Analysis
ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that the knowledge base contains topically orthogonal concepts. However, it was later shown by Anderka and Stein that ESA also improves the performance of information retrieval systems when it is based not on Wikipedia, but on the Reuters corpus of newswire articles, which does not satisfy the orthogonality property; in their experiments, Anderka and Stein used newswire stories as "concepts". To explain this observation, links have been shown between ESA and the generalized vector space model. Gabrilovich and Markovitch replied to Anderka and Stein by pointing out that their experimental result was achieved using "a single application of ESA " and "just a single, extremely small and homogenous test collection of 50 news documents".
Applications
Word relatedness
ESA is considered by its authors a measure of semantic relatedness. On datasets used to benchmark relatedness of words, ESA outperforms other algorithms, including WordNetsemantic similarity measures and skip-gram Neural NetworkLanguage Model.
Document relatedness
ESA is used in commercial software packages for computing relatedness of documents. Domain-specific restrictions on the ESA model are sometimes used to provide more robust document matching.
Extensions
Cross-language explicit semantic analysis is a multilingual generalization of ESA. CL-ESA exploits a document-aligned multilingual reference collection to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.