Normalized compression distance

Normalized compression distance is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Such a measurement should not be application dependent or arbitrary. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other.
It can be used in information retrieval and data mining for cluster analysis.

Information distance

We assume that the objects one talks about are finite strings of 0s and 1s. Thus we mean string similarity. Every computer file is of this form, that is, if an object is a file in a computer it is of this form. One can define the information distance between strings and as the length of the shortest program that computes from and vice versa. This shortest program is in a fixed programming language. For technical reasons one uses the theoretical notion of Turing machines. Moreover, to express the length of one uses the notion of Kolmogorov complexity. Then, it has been shown
up to logarithmic additive terms which can be ignored. This information distance is shown to be a metric
, is universal.

Normalized information distance (similarity metric)

The information distance is absolute, but if we want to express similarity, then we are more interested in relative ones. For example, if two strings of length 1,000,000 differ by 1000 bits, then we are inclined to think that those strings are relatively more similar than two strings of 1000 bits that have that distance. Hence we need to normalize to obtain a similarity metric. This way one obtains the normalized information distance,
where is algorithmic information of given as input. The NID is called `the similarity metric.' since the function has been shown to satisfy the basic requirements for a metric distance measure. However, it is not computable or even semicomputable.

Normalized compression distance

While the NID metric is not computable, it has an abundance of applications. Simply approximating by real-world compressors, with is the binary length of the file compressed with compressor Z in order to make NID easy to apply. Vitanyi and Cilibrasi rewrote the NID to obtain the Normalized Compression Distance
The NCD is actually a family of distances parametrized with the compressor Z. The better Z is, the closer the NCD approaches the NID, and the better the results are.

Applications

The normalized compression distance has been used to fully automatically reconstruct language and phylogenetic trees. It can also be used for new applications of general clustering and classification of natural data in arbitrary domains, for clustering of heterogeneous data, and for anomaly detection across domains. The NID and NCD have been applied to numerous subjects, including music classification, to analyze network traffic and cluster computer worms and viruses, authorship attribution, gene expression dynamics, predicting useful versus useless stem cells, critical networks, image registration, question-answer systems.

Performance

Researchers from the datamining community use NCD and variants as "parameter-free, feature-free" data-mining tools. One group have experimentally tested a closely related metric on a large variety of sequence benchmarks. Comparing their compression method with 51 major methods found in 7 major data-mining conferences over the past decade, they established superiority of the compression method for clustering heterogeneous data, and for anomaly detection, and competitiveness in clustering domain data.
NCD has an advantage of being robust to noise. However, although NCD appears "parameter-free", practical questions include which compressor to use in computing the NCD and other possible problems.

Comparison with the Normalized Relative Compression (NRC)

In order to measure the information of a string relative to another there is the need to rely on relative semi-distances. These are measures that do not need to respect symmetry and triangle inequality distance properties. Although the NCD and the NRC seem very similar, they address different questions. The NCD measures how similar both strings are, mostly using the information content, while the NRC indicates the fraction of a target string that cannot be constructed using information from another string. For a comparison, with application to the evolution of primate genomes, see.

Normalized Google distance

Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like "the four-letter genome of a mouse," or "the text of `War and Peace' by Tolstoy." There are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like "home" or "red."
We are interested in semantic similarity. Using code-word lengths obtained from the page-hit counts returned by Google from the web, we obtain a semantic distance using the NCD formula and viewing Google as a compressor useful for data mining, text comprehension, classification, and translation. The associated NCD, called the normalized Google distance can be rewritten as
where denotes the number of pages containing the search term, and denotes the number of pages containing both and,) as returned by Google or any search engine capable of returning an aggregate page count. The number can be set to the number of pages indexed although it is more proper to count each page according to the number of search terms or phrases it contains. As rule of the thumb one can multiply the number of pages by, say, a thousand...

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...