Suffix tree

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.
The construction of such a tree for the string takes time and space linear in the length of. Once constructed, several operations can be performed quickly, for instance locating a substring in, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provide one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

Definition

The suffix tree for the string of length is defined as a tree such that:

The tree has exactly n leaves numbered from 1 to n.
Except for the root, every internal node has at least two children.
Each edge is labelled with a non-empty substring of S.
No two edges starting out of a node can have string-labels beginning with the same character.
The string obtained by concatenating all the string-labels found on the path from the root to leaf i spells out suffix S, for i from 1 to n.

Since such a tree does not exist for all strings, is padded with a terminal symbol not seen in the string. This ensures that no suffix is a prefix of another, and that there will be leaf nodes, one for each of the suffixes of. Since all internal non-root nodes are branching, there can be at most n − 1 such nodes, and n + + 1 = 2n nodes in total.
Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based on Farach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string, where is a single character and is a string, it has a suffix link to the internal node representing. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.
A generalized suffix tree is a suffix tree made for a set of strings instead of a single string. It represents all suffixes from this set of strings. Each string must be terminated by a different termination symbol.

History

The concept was first introduced by.
Rather than the suffix S, Weiner stored in his trie the prefix identifier for each position, that is, the shortest string starting at i and occurring only once in S. His Algorithm D takes an uncompressed trie for S and extends it into a trie for S. This way, starting from the trivial trie for S, a trie for S can be built by n-1 successive calls to Algorithm D; however, the overall run time is O. Weiner's Algorithm B maintains several auxiliary data structures, to achieve an over all run time linear in the size of the constructed trie. The latter can still be O nodes, e.g. for S = aⁿbⁿaⁿbⁿ$. Weiner's Algorithm C finally uses compressed tries to achieve linear overall storage size and run time.
Donald Knuth subsequently characterized the latter as "Algorithm of the Year 1973".
The text book reproduced Weiner's results in a simplified and more elegant form, introducing the term position tree.
was the first to build a trie of all suffixes of S. Although the suffix starting at i is usually longer than the prefix identifier, their path representations in a compressed trie do not differ in size. On the other hand, McCreight could dispense with most of Weiner's auxiliary data structures; only suffix links remained.
further simplified the construction. He provided the first online-construction of suffix trees, now known as Ukkonen's algorithm, with running time that matched the then fastest algorithms.
These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of in general.
gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm
for strings drawn from an alphabet of integers in a polynomial range. Farach's algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc.

Functionality

A suffix tree for a string of length can be built in time, if the letters come from an alphabet of integers in a polynomial range.
For larger alphabets, the running time is dominated by first sorting the letters to bring them into a range of size ; in general, this takes time.
The costs below are given under the assumption that the alphabet is constant.
Assume that a suffix tree has been built for the string of length, or that a generalised suffix tree has been built for the set of strings of total length.
You can:

Search for strings:
* Check if a string of length is a substring in time.
* Find the first occurrence of the patterns of total length as substrings in time.
* Find all occurrences of the patterns of total length as substrings in time.
* Search for a regular expression P in time expected sublinear in.
* Find for each suffix of a pattern, the length of the longest match between a prefix of and a substring in in time. This is termed the matching statistics for.
Find properties of the strings:
* Find the longest common substrings of the string and in time.
* Find all maximal pairs, maximal repeats or supermaximal repeats in time.
* Find the Lempel–Ziv decomposition in time.
* Find the longest repeated substrings in time.
* Find the most frequently occurring substrings of a minimum length in time.
* Find the shortest strings from that do not occur in, in time, if there are such strings.
* Find the shortest substrings occurring only once in time.
* Find, for each, the shortest substrings of not occurring elsewhere in in time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in time. One can then also:

Find the longest common prefix between the suffixes and in.
Search for a pattern P of length m with at most k mismatches in time, where z is the number of hits.
Find all maximal palindromes in, or time if gaps of length are allowed, or if mismatches are allowed.
Find all tandem repeats in, and k-mismatch tandem repeats in.
Find the longest common substrings to at least strings in for in time.
Find the longest palindromic substring of a given string in linear time.
Applications

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas. Primary applications include:

String search, in O complexity, where m is the length of the sub-string
Finding the longest repeated substring
Finding the longest common substring
Finding the longest palindrome in a string

Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences. The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can be used for the sorting stage of the Burrows–Wheeler transform. Variants of the LZW compression schemes use suffix trees. A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines.

Implementation

If each node and edge can be represented in space, the entire tree can be represented in space. The total length of all the strings on all of the edges in the tree is, but each edge can be stored as the position and length of a substring of, giving a total space usage of computer words. The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full nodes.
An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Other implementations with efficient running time properties use hash maps, sorted or unsorted arrays, or balanced search trees. We are interested in:

The cost of finding the child on a given character.
The cost of inserting a child.
The cost of enlisting all children of a node.

Let be the size of the alphabet. Then you have the following costs:
The insertion cost is amortised, and that the costs for hashing are given for perfect hashing.
The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of 8 This factor depends on the properties and may reach 2 with usage of 4-byte wide characters on 32-bit systems. Researchers have continued to find smaller indexing structures.

Parallel construction

Various parallel algorithms to speed up suffix tree construction have been proposed.
Recently, a practical parallel algorithm for suffix tree construction with work and span has been developed. The algorithm achieves good parallel scalability on shared-memory multicore machines and can index the human genome – approximately 3GB – in under 3 minutes using a 40-core machine.

External construction

Though linear, the memory usage of a suffix tree is significantly higher
than the actual size of the sequence collection. For a large text,
construction may require external memory approaches.
There are theoretical results for constructing suffix trees in external
memory.
The algorithm by
is theoretically optimal, with an I/O complexity equal to that of sorting.
However the overall intricacy of this algorithm has prevented, so far, its
practical implementation.
On the other hand, there have been practical works for constructing
disk-based suffix trees
which scale to GB/hours.
The state of the art methods are TDD,
TRELLIS,
DiGeST,
and
B²ST.
TDD and TRELLIS scale up to the entire human genome resulting in a disk-based suffix tree of a size in the tens of gigabytes. However, these methods cannot handle efficiently collections of sequences exceeding 3GB. DiGeST performs significantly better and is able to handle collections of sequences in the order of 6GB in about 6 hours.
All these methods can efficiently build suffix trees for the case when the
tree does not fit in main memory,
but the input does.
The most recent method, B²ST, scales to handle
inputs that do not fit in main memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16GB RAM. On a simple Linux cluster with 16 nodes, ERA can index the entire human genome in less than 9 minutes.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...