Jaccard index

The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:
An alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union.
Jaccard distance is commonly used to calculate an n × n matrix for clustering and multidimensional scaling of n sample sets.
This distance is a metric on the collection of all finite sets.
There is also a version of the Jaccard distance for measures, including probability measures. If is a measure on a measurable space, then we define the Jaccard coefficient by
and the Jaccard distance by
Care must be taken if or, since these formulas are not well defined in these cases.
The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function.

Similarity of asymmetric binary attributes

Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
Each attribute must fall into one of these four categories, meaning that
The Jaccard similarity coefficient, J, is given as
The Jaccard distance, d_J, is given as

Difference with the simple matching coefficient (SMC)

When used for binary attributes, the Jaccard index is very similar to the simple matching coefficient. The main difference is that the SMC has the term in its numerator and denominator, whereas the Jaccard index does not. Thus, the SMC counts both mutual presences and mutual absence as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets.
In market basket analysis, for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC will usually return very high values of similarities even when the baskets bear very little resemblance, thus making the Jaccard index a more appropriate measure of similarity in that context. For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC.
In other contexts, where 0 and 1 carry equivalent information, the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummy variables, such as gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independently of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes, thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing any bias. The SMC remains, however, more computationally efficient in the case of symmetric dummy variables since it does not require adding extra dimensions.

Weighted Jaccard similarity and distance

If and are two vectors with all real, then their Jaccard similarity coefficient is defined as
and Jaccard distance
With even more generality, if and are two non-negative measurable functions on a measurable space with measure, then we can define
where and are pointwise operators. Then Jaccard distance is
Then, for example, for two measurable sets, we have where and are the characteristic functions of the corresponding set.

Probability Jaccard similarity and distance

The weighted Jaccard similarity described above generalizes the Jaccard Index to positive vectors, where a set corresponds to a binary vector given by the indicator function, i.e.. However, it does not generalize the Jaccard Index to probability distributions, where a set corresponds to a uniform probability distribution, i.e.
It is always less if the sets differ in size. If, and then
Instead, a generalization that is continuous between probability distributions and their corresponding support sets is
which is called the "Probability" Jaccard. It has the following bounds against the Weighted Jaccard on probability vectors.
Here the upper bound is the Sørensen–Dice coefficient.
The corresponding distance,, is a metric over probability distributions, and a pseudo-metric over non-negative vectors.
The Probability Jaccard Index has a geometric interpretation as the area of an intersection of simplices. Every point on a unit -simplex corresponds to a probability distribution on elements, because the unit -simplex is the set of points in dimensions that sum to 1. To derive the Probability Jaccard Index geometrically, represent a probability distribution as the unit simplex divided into sub simplices according to the mass of each item. If you overlay two distributions represented in this way on top of each other, and intersect the simplices corresponding to each item, the area that remains is equal to the Probability Jaccard Index of the distributions.

Optimality of the Probability Jaccard Index

Consider the problem of constructing random variables such that they collide with each other as much as possible. That is, if and, we would like to construct and to maximize. If we look at just two distributions in isolation, the highest we can achieve is given by where is the Total Variation distance. However, suppose were weren't just concerned with maximizing that particular pair, suppose we would like to maximize the collision probability of any arbitrary pair. One could construct an infinite number of random variables one for each distribution, and seek to maximize for all pairs. In a fairly strong sense described below, the Probability Jaccard Index is an optimal way to align these random variables.
For any sampling method and discrete distributions, if then for some where and, either or.
That is, no sampling method can achieve more collisions than on one pair without achieving fewer collisions than on another pair, where the reduced pair is more similar under than the increased pair. This theorem is true for the Jaccard Index of sets and the probability Jaccard, but not of the weighted Jaccard.
This theorem has a visual proof on three element distributions using the simplex representation.

Tanimoto similarity and distance

Various forms of functions described as Tanimoto similarity and Tanimoto distance occur in the literature and on the Internet. Most of these are synonyms for Jaccard similarity and Jaccard distance, but some are mathematically different. Many sources cite an IBM Technical Report as the seminal reference. The report is available from .
In "A Computer Program for Classifying Plants", published in October 1960, a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard distance.

Tanimoto's definitions of similarity and distance

In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set in either sample.
Presented in mathematical terms, if samples X and Y are bitmaps, is the ith bit of X, and are bitwise and, or operators respectively, then the similarity ratio is
If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard coefficient of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.
Tanimoto goes on to define a "distance coefficient" based on this ratio, defined for bitmaps with non-zero similarity:
This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

Other definitions of Tanimoto distance

Tanimoto distance is often referred to, erroneously, as a synonym for Jaccard distance. This function is a proper distance metric. "Tanimoto Distance" is often stated as being a proper distance metric, probably because of its confusion with Jaccard distance.
If Jaccard or Tanimoto similarity is expressed over a bit vector, then it can be written as
where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector then
and
This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of do not necessarily extend to. In particular, the difference function does not preserve triangle inequality, and is not therefore a proper distance metric, whereas is.
There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.
Lipkus uses a definition of Tanimoto similarity which is equivalent to, and refers to Tanimoto distance as the function. It is, however, made clear within the paper that the context is restricted by the use of a weighting vector such that, for any vector A being considered, Under these circumstances, the function is a proper distance metric, and so a set of vectors governed by such a weighting vector forms a metric space under this function.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...