(appeared in GInfo 15/7, November 2005)
Suffix arrays – a programming contest approach
Adrian Vladu and Cosmin Negru
ş
eri
Summary
An important algorithmic area often used in practical tasks is that of algorithms on character
strings. Many programming contests have used problems that could have been easily solved if one
efficiently managed to determine if one string was a subsequence of another string, or had found an order
relation within a string’s suffixes. We shall present a versatile structure that allows this along with other
useful operations on a given string.
Keywords
:
suffix sorting, suffix arrays, suffix trees
1
1 Introduction
What are suffix arrays?
In order to let the reader gain a better vista on suffix arrays, we shall make a short presentation of
two
data structures called
trie
, respectively suffix tree [1] – which is a special case of a trie. A trie is a tree
meant to store strings. Each of its nodes will have a number of sons equal to the size of the alphabet used
by the strings that are needed to be stored. In our case, with strings containing only small letters of the
English alphabet, each node will have at most 26 sons. Every edge going from the
father toward its sons is
labeled with a different letter of the alphabet. The labels on a path starting from the root and ending in a
leaf will form a string stored in that tree. As it can be easily seen, finding whether a string is contained in
this data structure is very efficient and is done in O(M) time, where M is the string’s length. Therefore, the
searching time does not depend on the number of words stored in the structure,
this making it an ideal
structure for implementing dictionaries.
Let’s see what a suffix trie is:
Given a string A = a
0
a
1
…a
n – 1,
denote by A
i
= a
i
a
i + 1
…a
n – 1
the suffix of A that begins at position i. Let n =
length of A. The suffix trie is made by compressing all the suffixes of A
1
…A
n – 1
into a trie, as in the
figure below.
The suffix trie corresponding to the string “abac” is:
2
Operations on this structure are very easily done:
-
checking whether a string W is a substring of A – it is enough to traverse the nodes starting from
the root and going through the edges labeled correspondingly to the characters in W (complexity
O(|W|))
-
searching the longest common prefix for two suffixes of A – choose nodes u and v in the trie,
corresponding to the ends of the two suffixes, then, with a LCA algorithm (least
common
ancestor), find the node corresponding to the end of the searched prefix. For example, for “abac”
and “ac”, the corresponding nodes are 5 and 6. Their least common ancestor is 2, that gives the
prefix “a”. The authors are strongly recommending [2] for an O(
√
n) solution, [3] for an accessible
presentation of a solution in O(lg n) or O(1), and [4] for a state of the art algorithm.
-
finding the k-th suffix in lexicographic order - (complexity O(n), with a corresponding
preprocessing). For example, the 3
rd
suffix of “abac” is represented in the trie by the 3
rd
leaf.
Even if the idea of a suffix trie would be
very pleasing at first sight, the simplist implementation,
where at every step one of the strings suffixes is inserted into the structure leads to an O(n
2
) complexity
algorithm.There is a structure called suffix tree[1] that can be built in linear time, which is a suffix trie
where the chains containing only nodes with the out-degree equal to 1 were compressed into a single edge
(in the example above, these are represented by the chains 2 -3 – 4 – 5 and 1 – 7 – 8 – 9).
Implementing
the linear algorithm is scarcely possible in a short time, such as during a contest, this determining us to
search another structure, easier to implement.
Let’s see which are the suffixes of A, by a depth first traversal of the trie. Noting
that during the depth
first search we have to consider the nodes in the ascending lexicographic order of the edges linking them
to their father, we gain the following suffix array:
: