Damerau–Levenshtein distance

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.
The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations.
In his seminal paper, Damerau stated that more than 80% of all human misspellings can be expressed by a single error of one of the four types. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.

Definition

To express the Damerau–Levenshtein distance between two strings and a function is defined, whose value is a distance between an –symbol prefix of string and a –symbol prefix of.
The restricted distance function is defined recursively as:,
where is the indicator function equal to 0 when and equal to 1 otherwise.
Each recursive call matches one of the cases covered by the Damerau–Levenshtein distance:

corresponds to a deletion.
corresponds to an insertion.
corresponds to a match or mismatch, depending on whether the respective symbols are the same.
corresponds to a transposition between two successive symbols.

The Damerau–Levenshtein distance between and is then given by the function value for full strings: where denotes the length of string and is the length of.

Algorithm

Presented here are two algorithms: the first, simpler one, computes what is known as the optimal string alignment distance or restricted edit distance, while the second one computes the Damerau–Levenshtein distance with adjacent transpositions. Adding transpositions adds significant complexity. The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once, whereas the second one presents no such restriction.
Take for example the edit distance between CA and ABC. The Damerau–Levenshtein distance LD = 2 because CA → AC → ABC, but the optimal string alignment distance OSA = 3 because if the operation CA → AC is used, it is not possible to use AC → ABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore the shortest sequence of operations is CA → A → AB → ABC. Note that for the optimal string alignment distance, the triangle inequality does not hold: OSA + OSA < OSA, and so it is not a true metric.

Optimal string alignment distance

Optimal string alignment distance can be computed using a straightforward extension of the Wagner–Fischer dynamic programming algorithm that computes Levenshtein distance. In pseudocode:
algorithm OSA-distance is
input: strings a, b
output: distance, integer

let d be a 2-d array of integers, dimensions length+1, length+1
// note that d is zero-indexed, while a and b are one-indexed.

for i := 0 to length inclusive do
d := i
for j := 0 to length inclusive do
d := j

for i := 1 to length inclusive do
for j := 1 to length inclusive do
if a = b then
cost := 0
else
cost := 1
d := minimum // substitution
if i > 1 and j > 1 and a = b and a = b then
d := minimum // transposition
return d
The difference from the algorithm for Levenshtein distance is the addition of one recurrence:
if i > 1 and j > 1 and a = b and a = b then
d := minimum // transposition

Distance with adjacent transpositions

The following algorithm computes the true Damerau–Levenshtein distance with adjacent transpositions; this algorithm requires as an additional parameter the size of the alphabet, so that all entries of the arrays are in :
algorithm DL-distance is
input: strings a, b
output: distance, integer

da := new array of |Σ| integers
for i := 1 to |Σ| inclusive do
da := 0

let d be a 2-d array of integers, dimensions length+2, length+2
// note that d has indices starting at −1, while a, b and da are one-indexed.

maxdist := length + length
d := maxdist
for i := 0 to length inclusive do
d := maxdist
d := i
for j := 0 to length inclusive do
d := maxdist
d := j

for i := 1 to length inclusive do
db := 0
for j := 1 to length inclusive do
k := da = b then
cost := 0
db := j
else
cost := 1
d := minimum + 1 + ) //transposition
da
To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: transpose letters and insert an arbitrary number of characters between them, or delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity:, where M and N are string lengths. Using the ideas of Lowrance and Wagner, this naive algorithm can be improved to be in the worst case, which is what the above pseudocode does.
It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of for an example of such an adaptation.

Applications

Damerau–Levenshtein distance plays an important role in natural language processing. In natural languages, strings are short and the number of errors rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. Oommen and Loke even mitigated the limitation of the restricted edit distance by introducing generalized transpositions. Nevertheless, one must remember that the restricted edit distance usually does not satisfy the triangle inequality and, thus, cannot be used with metric trees.

DNA

Since DNA frequently undergoes insertions, deletions, substitutions, and transpositions, and each of these operations occurs on approximately the same timescale, the Damerau–Levenshtein distance is an appropriate metric of the variation between two strands of DNA. More common in DNA, protein, and other bioinformatics related alignment tasks is the use of closely related algorithms such as Needleman–Wunsch algorithm or Smith–Waterman algorithm.

Fraud detection

The algorithm can be used with any set of words, like vendor names. Since entry is manual by nature there is a risk of entering a false vendor. A fraudster employee may enter one real vendor such as "Rich Heir Estate Services" versus a false vendor "Rich Hier State Services". The fraudster would then create a false bank account and have the company route checks to the real vendor and false vendor. The Damerau–Levenshtein algorithm will detect the transposed and dropped letter and bring attention of the items to a fraud examiner.

Export control

The U.S. Government uses the Damerau–Levenshtein distance with its Consolidated Screening List API.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...