The Sørensen–Dice coefficient is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively.
Name
The index is known by several other names, especially Sørensen–Dice index, Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient. Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending. Other names include:
Zijdenbos similarity index, referring to a 1994 paper of Zijdenbos et al.
Formula
Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as where |X| and |Y| are the cardinalities of the two sets. The Sørensen index equals twice the number of elements common to both sets divided bythe sum of the number of elements in each set. When applied to Boolean data, using the definition of true positive, false positive, and false negative, it can be written as It is different from the Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1. It can be viewed as a similarity measure over sets. Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b: which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms. For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information over the sum of cardinalities : When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows: where nt is the number of character bigrams found in both strings, nx is the number of bigrams in stringx and ny is the number of bigrams in string y. For example, to calculate the similarity between: We would find the set of bigrams in each word: Each set has four elements, and the intersection of these two sets has only one element: ht. Inserting these numbers into the formula, we calculate, s = / = 0.25.
Difference from Jaccard
This coefficient is not very different in form from the Jaccard index. In fact, both are equivalent in the sense that given a value for the Sørensen–Dice coefficient, one can calculate the respective Jaccard index value and vice versa, using the equations and. Since the Sørensen–Dice coefficient does not satisfy the triangle inequality, it can be considered a semimetric version of the Jaccard index. The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function is not a proper distance metric as it does not possess the property of triangle inequality. The simplest counterexample of this is given by the three sets,, and, the distance between the first two being 1, and the difference between the third and each of the others being one-third. To satisfy the triangle inequality, the sum of any two of these three sides must be greater than or equal to the remaining side. However, the distance between and plus the distance between and equals 2/3 and is therefore less than the distance between and which is 1.
Applications
The Sørensen–Dice coefficient is useful for ecological community data. Justification for its use is primarily empirical rather than theoretical. As compared to Euclidean distance, the Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers. Recently the Dice score has become popular in computer lexicography for measuring the lexical association score of two given words. It is also commonly used in image segmentation, in particular for comparing algorithm output against reference masks in medical applications.
Abundance version
The expression is easily extended to abundance instead of presence/absence of species. This quantitative version is known by several names: