Suffix automaton


In computer science, a suffix automaton is the smallest partial deterministic finite automaton that recognizes the set of suffixes of a given string. The state graph of the suffix automaton is called the directed acyclic word graph. The term DAWG is also sometimes used for any deterministic acyclic finite state automaton. The term suffix automaton is also sometimes used for any automaton recognizing suffixes of a text.
For example, the suffix automaton of the string "suffix" accepts the strings "suffix", "uffix", "ffix", "fix", "ix", "x" and the empty string. The automaton can be thought of as a compressed form of the trie of all suffixes of a text. It can be constructed in linear time in the length of the string. For a string of length at least 2, the suffix automaton has at most states and at most transitions.
Suffix automata have applications in approximate string matching.

Properties

The states of the suffix automaton correspond to classes of end-position equivalent strings, defined as follows:
Suppose we have a string. We say that substrings and of are end-position equivalent if the sets of all end positions of occurrences of and in are the same. This relation partitions the substrings of into equivalence classes which are in a one-to-one correspondence with the states of the suffix automaton. The strings in the class of a node are spelled out by paths from the source to.
We call the longest string in an equivalence class of a state the representative of that class, and it is spelled out by the longest path from the source to. The automaton has one sink state, which represents the whole string.
There is an edge labelled with from the state represented by to the state represented by if and only if is end-position equivalent with. Each edge of the suffix automaton is either primary or secondary. An edge labelled with character from the state represented by is primary if and only if is the representative of an equivalence class. The distinction between primary and secondary edges is important in the construction of the automaton and the analysis of the size of the automaton. For practical applications the automaton is also usually augmented with suffix pointers, which point from a state represented by to the state represented by the longest proper suffix of that represents a class.

Applications

The suffix automaton of can be used to efficiently decide whether a query string is a substring of. This is true if and only if there exists a path starting from the source of the automaton that spells. This works because every substring of is a prefix of a suffix of.
When combined with dynamic programming, the automaton can be used to answer many interesting questions about. For example, the number of distinct substrings of is equal to the number of distinct paths in the automaton starting from the source, which can be counted using dynamic programming on the automaton. The number of occurrences of any substring of can be counted by finding the state such that a path from the source to spells and then counting the number of paths from to the sink state of the automaton.

Relationships with other suffix structures

The directed acyclic word graph of can be obtained by taking the trie of suffixes of and merging all isomorphic subtrees. The DAWG also has another deep connection with the suffix tree of : The suffix pointers of the DAWG of form a tree which is identical to the suffix tree of the reverse of.
Replacing all non-branching paths of the DAWG with a single edge gives a data structure that is called the compact DAWG, or the CDAWG of. The CDAWG is identical to the DAG obtained by merging all isomorphic subtrees of the suffix tree of.