2016-12-28 23:54:51 +01:00
|
|
|
|
\chapter{String algorithms}
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
This chapter deals with efficient algorithms
|
2017-02-11 18:12:29 +01:00
|
|
|
|
for string processing.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
Many string problems can be easily solved
|
|
|
|
|
in $O(n^2)$ time, but the challenge is to
|
|
|
|
|
find algorithms that work in $O(n)$ or $O(n \log n)$
|
2017-02-18 17:27:38 +01:00
|
|
|
|
time.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
|
|
|
|
|
\index{pattern matching}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
For example, a fundamental string processing
|
|
|
|
|
problem is the \key{pattern matching} problem:
|
2017-02-11 18:11:50 +01:00
|
|
|
|
given a string of length $n$ and a pattern of length $m$,
|
2017-05-12 20:15:25 +02:00
|
|
|
|
our task is to find the occurrences of the pattern
|
|
|
|
|
in the string.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
For example, the pattern \texttt{ABC} occurs two
|
|
|
|
|
times in the string \texttt{ABABCBABC}.
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
The pattern matching problem can be easily solved
|
2017-02-11 18:11:50 +01:00
|
|
|
|
in $O(nm)$ time by a brute force algorithm that
|
2017-05-12 20:15:25 +02:00
|
|
|
|
tests all positions where the pattern may
|
2017-02-11 18:11:50 +01:00
|
|
|
|
occur in the string.
|
2017-02-18 17:27:38 +01:00
|
|
|
|
However, in this chapter, we will see that there
|
2017-02-11 18:11:50 +01:00
|
|
|
|
are more efficient algorithms that require only
|
|
|
|
|
$O(n+m)$ time.
|
|
|
|
|
|
2017-01-22 12:15:41 +01:00
|
|
|
|
\index{string}
|
|
|
|
|
|
2017-02-18 17:27:38 +01:00
|
|
|
|
\section{String terminology}
|
2017-02-11 18:11:50 +01:00
|
|
|
|
|
|
|
|
|
\index{alphabet}
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
Throughout the chapter, we assume that
|
|
|
|
|
zero-based indexing is used in strings.
|
|
|
|
|
Thus, a string \texttt{s} of length $n$
|
|
|
|
|
consists of characters
|
|
|
|
|
$\texttt{s}[0],\texttt{s}[1],\ldots,\texttt{s}[n-1]$.
|
|
|
|
|
The set of characters that may appear
|
|
|
|
|
in strings is called an \key{alphabet}.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
For example, the alphabet
|
|
|
|
|
$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
|
|
|
|
|
consists of the capital letters of English.
|
|
|
|
|
|
|
|
|
|
\index{substring}
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
A \key{substring} is a sequence of consecutive
|
2017-05-12 20:15:25 +02:00
|
|
|
|
characters in a string.
|
|
|
|
|
We use the notation $\texttt{s}[a \ldots b]$
|
|
|
|
|
to refer to a substring of \texttt{s}
|
|
|
|
|
that begins at position $a$ and ends at position $b$.
|
|
|
|
|
A string of length $n$ has $n(n+1)/2$ substrings.
|
|
|
|
|
For example, the substrings of
|
2017-02-11 18:11:50 +01:00
|
|
|
|
\texttt{ABCD} are
|
|
|
|
|
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
|
|
|
|
\texttt{AB}, \texttt{BC}, \texttt{CD},
|
|
|
|
|
\texttt{ABC}, \texttt{BCD} and \texttt{ABCD}.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
|
|
|
|
\index{subsequence}
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
A \key{subsequence} is a sequence of
|
|
|
|
|
(not necessarily consecutive) characters
|
2017-05-12 20:15:25 +02:00
|
|
|
|
in a string in their original order.
|
|
|
|
|
A string of length $n$ has $2^n-1$ subsequences.
|
|
|
|
|
For example, the subsequences of
|
2017-02-11 18:11:50 +01:00
|
|
|
|
\texttt{ABCD} are
|
|
|
|
|
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
|
|
|
|
\texttt{AB}, \texttt{AC}, \texttt{AD},
|
|
|
|
|
\texttt{BC}, \texttt{BD}, \texttt{CD},
|
|
|
|
|
\texttt{ABC}, \texttt{ABD}, \texttt{ACD},
|
|
|
|
|
\texttt{BCD} and \texttt{ABCD}.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
|
|
|
|
\index{prefix}
|
|
|
|
|
\index{suffix}
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
A \key{prefix} is a subtring that starts at the beginning
|
|
|
|
|
of a string,
|
|
|
|
|
and a \key{suffix} is a substring that ends at the end
|
|
|
|
|
of a string.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
For example,
|
|
|
|
|
the prefixes of \texttt{ABCD} are
|
|
|
|
|
\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD},
|
|
|
|
|
and the suffixes of \texttt{ABCD} are
|
2017-02-11 18:11:50 +01:00
|
|
|
|
\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
|
|
|
|
\index{rotation}
|
|
|
|
|
|
|
|
|
|
A \key{rotation} can be generated by moving
|
2017-05-12 20:15:25 +02:00
|
|
|
|
the characters of a string one by one from the beginning
|
|
|
|
|
to the end (or vice versa).
|
|
|
|
|
For example, the rotations of \texttt{ABCD} are
|
2017-02-11 18:11:50 +01:00
|
|
|
|
\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
|
|
|
|
\index{period}
|
|
|
|
|
|
|
|
|
|
A \key{period} is a prefix of a string such that
|
2017-02-11 18:11:50 +01:00
|
|
|
|
the string can be constructed by repeating the period.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
The last repetition may be partial and contain
|
|
|
|
|
only a prefix of the period.
|
|
|
|
|
For example, the shortest period of
|
|
|
|
|
\texttt{ABCABCA} is \texttt{ABC}.
|
|
|
|
|
|
|
|
|
|
\index{border}
|
|
|
|
|
|
|
|
|
|
A \key{border} is a string that is both
|
|
|
|
|
a prefix and a suffix of a string.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
For example, the borders of \texttt{ABACABA}
|
2017-02-11 18:11:50 +01:00
|
|
|
|
are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
|
|
|
|
\index{lexicographical order}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
Strings are compared using the \key{lexicographical order}
|
|
|
|
|
(which corresponds to the alphabetical order).
|
2017-02-11 18:11:50 +01:00
|
|
|
|
It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
|
|
|
|
|
or there is a position $k$ such that
|
2017-01-22 12:15:41 +01:00
|
|
|
|
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
|
|
|
|
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\section{Trie structure}
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
|
|
|
|
\index{trie}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
A \key{trie} is a rooted tree that
|
2017-01-22 12:15:41 +01:00
|
|
|
|
maintains a set of strings.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
Each string in the set is stored as
|
|
|
|
|
a chain of characters that starts at the root.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
If two strings have a common prefix,
|
2017-02-11 18:11:50 +01:00
|
|
|
|
they also have a common chain in the tree.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
For example, consider the following trie:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.9]
|
|
|
|
|
\node[draw, circle] (1) at (0,20) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (2) at (-1.5,19) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (3) at (1.5,19) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (4) at (-1.5,17.5) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (5) at (-1.5,16) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (6) at (-2.5,14.5) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (7) at (-0.5,14.5) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (8) at (-2.5,13) {*};
|
|
|
|
|
\node[draw, circle] (9) at (-0.5,13) {*};
|
|
|
|
|
\node[draw, circle] (10) at (1.5,17.5) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (11) at (1.5,16) {*};
|
|
|
|
|
\node[draw, circle] (12) at (1.5,14.5) {$\phantom{1}$};
|
|
|
|
|
\node[draw, circle] (13) at (1.5,13) {*};
|
|
|
|
|
|
2017-01-22 12:15:41 +01:00
|
|
|
|
\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{C}] {} (2);
|
|
|
|
|
\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{T}] {} (3);
|
|
|
|
|
\path[draw,thick,->] (2) -- node[font=\small,label=left:\texttt{A}] {} (4);
|
|
|
|
|
\path[draw,thick,->] (4) -- node[font=\small,label=left:\texttt{N}] {} (5);
|
|
|
|
|
\path[draw,thick,->] (5) -- node[font=\small,label=left:\texttt{A}] {} (6);
|
|
|
|
|
\path[draw,thick,->] (5) -- node[font=\small,label=right:\texttt{D}] {} (7);
|
|
|
|
|
\path[draw,thick,->] (6) -- node[font=\small,label=left:\texttt{L}] {}(8);
|
|
|
|
|
\path[draw,thick,->] (7) -- node[font=\small,label=right:\texttt{Y}] {} (9);
|
|
|
|
|
\path[draw,thick,->] (3) -- node[font=\small,label=right:\texttt{H}] {} (10);
|
|
|
|
|
\path[draw,thick,->] (10) -- node[font=\small,label=right:\texttt{E}] {} (11);
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\path[draw,thick,->] (11) -- node[font=\small,label=right:\texttt{R}] {} (12);
|
2017-01-22 12:15:41 +01:00
|
|
|
|
\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
2017-02-11 18:11:50 +01:00
|
|
|
|
|
|
|
|
|
This trie corresponds to the set
|
|
|
|
|
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
The character * in a node means that
|
2017-05-12 20:15:25 +02:00
|
|
|
|
a string in the set ends at the node.
|
2017-04-21 22:19:29 +02:00
|
|
|
|
Such a character is needed, because a string
|
2017-01-22 12:15:41 +01:00
|
|
|
|
may be a prefix of another string.
|
2017-04-21 22:19:29 +02:00
|
|
|
|
For example, in the above trie, \texttt{THE}
|
2017-02-11 18:11:50 +01:00
|
|
|
|
is a prefix of \texttt{THERE}.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
We can check in $O(n)$ time whether a trie
|
|
|
|
|
contains a string of length $n$,
|
2017-02-11 18:11:50 +01:00
|
|
|
|
because we can follow the chain that starts at the root node.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
We can also add a string of length $n$ to the trie
|
|
|
|
|
in $O(n)$ time by first following the chain
|
|
|
|
|
and then adding new nodes to the trie if necessary.
|
|
|
|
|
|
|
|
|
|
Using a trie, we can find
|
|
|
|
|
the longest prefix of a given string
|
|
|
|
|
such that the prefix belongs to the set.
|
|
|
|
|
Moreover, by storing additional information
|
2017-02-11 18:11:50 +01:00
|
|
|
|
in each node,
|
2017-05-12 20:15:25 +02:00
|
|
|
|
we can calculate the number of
|
|
|
|
|
strings that belong to the set and have a
|
|
|
|
|
given string as a prefix.
|
2017-01-22 12:15:41 +01:00
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
A trie can be stored in an array
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\begin{lstlisting}
|
2017-05-12 20:15:25 +02:00
|
|
|
|
int trie[N][A];
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\end{lstlisting}
|
2017-01-22 12:15:41 +01:00
|
|
|
|
where $N$ is the maximum number of nodes
|
2017-02-11 18:11:50 +01:00
|
|
|
|
(the maximum total length of the strings in the set)
|
2017-01-22 12:15:41 +01:00
|
|
|
|
and $A$ is the size of the alphabet.
|
|
|
|
|
The nodes of a trie are numbered
|
2017-05-12 20:15:25 +02:00
|
|
|
|
$0,1,2,\ldots$ so that the number of the root is 0,
|
|
|
|
|
and $\texttt{trie}[s][c]$ is the next node in the chain
|
|
|
|
|
when we move from node $s$ using character $c$.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\section{String hashing}
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\index{hashing}
|
|
|
|
|
\index{string hashing}
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\key{String hashing} is a technique that
|
|
|
|
|
allows us to efficiently check whether two
|
2017-04-21 22:19:29 +02:00
|
|
|
|
strings are equal\footnote{The technique
|
2017-02-21 17:10:08 +01:00
|
|
|
|
was popularized by the Karp–Rabin pattern matching
|
|
|
|
|
algorithm \cite{kar87}.}.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
The idea in string hashing is to compare hash values of
|
2017-04-21 22:19:29 +02:00
|
|
|
|
strings instead of their individual characters.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\subsubsection*{Calculating hash values}
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\index{hash value}
|
|
|
|
|
\index{polynomial hashing}
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
A \key{hash value} of a string is
|
|
|
|
|
a number that is calculated from the characters
|
|
|
|
|
of the string.
|
|
|
|
|
If two strings are the same,
|
|
|
|
|
their hash values are also the same,
|
|
|
|
|
which makes it possible to compare strings
|
|
|
|
|
based on their hash values.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
A usual way to implement string hashing
|
2017-04-21 22:19:29 +02:00
|
|
|
|
is \key{polynomial hashing}, which means
|
2017-05-12 20:15:25 +02:00
|
|
|
|
that the hash value of a string \texttt{s}
|
|
|
|
|
of length $n$ is
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B ,\]
|
2017-05-12 20:15:25 +02:00
|
|
|
|
where $s[0],s[1],\ldots,s[n-1]$
|
|
|
|
|
are interpreted as the codes of the characters of \texttt{s},
|
2017-01-24 20:59:20 +01:00
|
|
|
|
and $A$ and $B$ are pre-chosen constants.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
For example, the codes of the characters
|
2017-05-12 20:15:25 +02:00
|
|
|
|
of \texttt{ALLEY} are:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\draw (0,0) grid (5,2);
|
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\node at (0.5, 1.5) {\texttt{A}};
|
|
|
|
|
\node at (1.5, 1.5) {\texttt{L}};
|
|
|
|
|
\node at (2.5, 1.5) {\texttt{L}};
|
|
|
|
|
\node at (3.5, 1.5) {\texttt{E}};
|
|
|
|
|
\node at (4.5, 1.5) {\texttt{Y}};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\node at (0.5, 0.5) {65};
|
|
|
|
|
\node at (1.5, 0.5) {76};
|
|
|
|
|
\node at (2.5, 0.5) {76};
|
|
|
|
|
\node at (3.5, 0.5) {69};
|
|
|
|
|
\node at (4.5, 0.5) {89};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
Thus, if $A=3$ and $B=97$, the hash value
|
2017-05-12 20:15:25 +02:00
|
|
|
|
of \texttt{ALLEY} is
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\subsubsection*{Preprocessing}
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
Using polynomial hashing, we can calculate the hash value of any substring
|
|
|
|
|
of a string \texttt{s} in $O(1)$ time after an $O(n)$ time preprocessing.
|
|
|
|
|
The idea is to construct an array \texttt{h} such that
|
|
|
|
|
$\texttt{h}[k]$ contains the hash value of the prefix $\texttt{s}[0 \ldots k]$.
|
2017-01-24 20:59:20 +01:00
|
|
|
|
The array values can be recursively calculated as follows:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\[
|
|
|
|
|
\begin{array}{lcl}
|
2017-05-12 20:15:25 +02:00
|
|
|
|
\texttt{h}[0] & = & \texttt{s}[0] \\
|
|
|
|
|
\texttt{h}[k] & = & (\texttt{h}[k-1] A + \texttt{s}[k]) \bmod B \\
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\end{array}
|
|
|
|
|
\]
|
2017-05-12 20:15:25 +02:00
|
|
|
|
In addition, we construct an array $\texttt{p}$
|
|
|
|
|
where $\texttt{p}[k]=A^k \bmod B$:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\[
|
|
|
|
|
\begin{array}{lcl}
|
2017-05-12 20:15:25 +02:00
|
|
|
|
\texttt{p}[0] & = & 1 \\
|
|
|
|
|
\texttt{p}[k] & = & (\texttt{p}[k-1] A) \bmod B. \\
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\end{array}
|
|
|
|
|
\]
|
2017-01-24 20:59:20 +01:00
|
|
|
|
Constructing these arrays takes $O(n)$ time.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
After this, the hash value of any substring
|
|
|
|
|
$\texttt{s}[a \ldots b]$
|
2017-01-24 20:59:20 +01:00
|
|
|
|
can be calculated in $O(1)$ time using the formula
|
2017-05-12 20:15:25 +02:00
|
|
|
|
\[(\texttt{h}[b]-\texttt{h}[a-1] \texttt{p}[b-a+1]) \bmod B\]
|
2017-04-17 17:20:20 +02:00
|
|
|
|
assuming that $a>0$.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
If $a=0$, the hash value is simply $\texttt{h}[b]$.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
\subsubsection*{Using hash values}
|
|
|
|
|
|
|
|
|
|
We can efficiently compare strings using hash values.
|
2017-02-18 17:27:38 +01:00
|
|
|
|
Instead of comparing the individual characters of the strings,
|
2017-01-24 20:59:20 +01:00
|
|
|
|
the idea is to compare their hash values.
|
|
|
|
|
If the hash values are equal,
|
|
|
|
|
the strings are \emph{probably} equal,
|
|
|
|
|
and if the hash values are different,
|
|
|
|
|
the strings are \emph{certainly} different.
|
|
|
|
|
|
|
|
|
|
Using hashing, we can often make a brute force
|
|
|
|
|
algorithm efficient.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
As an example, consider the pattern matching problem:
|
|
|
|
|
given a string $s$ and a pattern $p$,
|
|
|
|
|
find the positions where $p$ occurs in $s$.
|
|
|
|
|
A brute force algorithm goes through all positions
|
2017-02-18 17:27:38 +01:00
|
|
|
|
where $p$ may occur and compares the strings
|
2017-01-24 20:59:20 +01:00
|
|
|
|
character by character.
|
|
|
|
|
The time complexity of such an algorithm is $O(n^2)$.
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
We can make the brute force algorithm more efficient
|
2017-04-21 22:19:29 +02:00
|
|
|
|
by using hashing, because the algorithm compares
|
2017-01-24 20:59:20 +01:00
|
|
|
|
substrings of strings.
|
|
|
|
|
Using hashing, each comparison only takes $O(1)$ time,
|
2017-04-21 22:19:29 +02:00
|
|
|
|
because only hash values of substrings are compared.
|
2017-01-24 20:59:20 +01:00
|
|
|
|
This results in an algorithm with time complexity $O(n)$,
|
|
|
|
|
which is the best possible time complexity for this problem.
|
|
|
|
|
|
|
|
|
|
By combining hashing and \emph{binary search},
|
2017-02-11 18:11:50 +01:00
|
|
|
|
it is also possible to find out the lexicographic order of
|
2017-01-24 20:59:20 +01:00
|
|
|
|
two strings in logarithmic time.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
This can be done by calculating the length
|
2017-01-24 20:59:20 +01:00
|
|
|
|
of the common prefix of the strings using binary search.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
Once we know the length of the common prefix,
|
|
|
|
|
we can just check the next character after the prefix,
|
|
|
|
|
because this determines the order of the strings.
|
2017-01-24 20:59:20 +01:00
|
|
|
|
|
|
|
|
|
\subsubsection*{Collisions and parameters}
|
|
|
|
|
|
|
|
|
|
\index{collision}
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
An evident risk when comparing hash values is
|
|
|
|
|
a \key{collision}, which means that two strings have
|
2017-01-24 20:59:20 +01:00
|
|
|
|
different contents but equal hash values.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
In this case, an algorithm that relies on
|
|
|
|
|
the hash values concludes that the strings are equal,
|
|
|
|
|
but in reality they are not,
|
2017-01-24 20:59:20 +01:00
|
|
|
|
and the algorithm may give incorrect results.
|
|
|
|
|
|
|
|
|
|
Collisions are always possible,
|
|
|
|
|
because the number of different strings is larger
|
|
|
|
|
than the number of different hash values.
|
|
|
|
|
However, the probability of a collision is small
|
|
|
|
|
if the constants $A$ and $B$ are carefully chosen.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
A usual way is to choose random constants
|
|
|
|
|
near $10^9$, for example as follows:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\[
|
|
|
|
|
\begin{array}{lcl}
|
|
|
|
|
A & = & 911382323 \\
|
|
|
|
|
B & = & 972663749 \\
|
|
|
|
|
\end{array}
|
|
|
|
|
\]
|
2017-02-11 18:11:50 +01:00
|
|
|
|
|
|
|
|
|
Using such constants,
|
2017-01-24 20:59:20 +01:00
|
|
|
|
the \texttt{long long} type can be used
|
2017-04-21 22:19:29 +02:00
|
|
|
|
when calculating hash values,
|
2017-02-11 18:11:50 +01:00
|
|
|
|
because the products $AB$ and $BB$ will fit in \texttt{long long}.
|
|
|
|
|
But is it enough to have about $10^9$ different hash values?
|
2017-01-24 20:59:20 +01:00
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
Let us consider three scenarios where hashing can be used:
|
2017-01-24 20:59:20 +01:00
|
|
|
|
|
|
|
|
|
\textit{Scenario 1:} Strings $x$ and $y$ are compared with
|
|
|
|
|
each other.
|
|
|
|
|
The probability of a collision is $1/B$ assuming that
|
|
|
|
|
all hash values are equally probable.
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
\textit{Scenario 2:} A string $x$ is compared with strings
|
2016-12-28 23:54:51 +01:00
|
|
|
|
$y_1,y_2,\ldots,y_n$.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
The probability of one or more collisions is
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
\[1-(1-\frac{1}{B})^n.\]
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
\textit{Scenario 3:} All pairs of strings $x_1,x_2,\ldots,x_n$
|
2017-01-24 20:59:20 +01:00
|
|
|
|
are compared with each other.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
The probability of one or more collisions is
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
|
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
The following table shows the collision probabilities
|
2017-02-11 18:11:50 +01:00
|
|
|
|
when $n=10^6$ and the value of $B$ varies:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tabular}{rrrr}
|
2017-01-24 20:59:20 +01:00
|
|
|
|
constant $B$ & scenario 1 & scenario 2 & scenario 3 \\
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\hline
|
|
|
|
|
$10^3$ & $0.001000$ & $1.000000$ & $1.000000$ \\
|
|
|
|
|
$10^6$ & $0.000001$ & $0.632121$ & $1.000000$ \\
|
|
|
|
|
$10^9$ & $0.000000$ & $0.001000$ & $1.000000$ \\
|
|
|
|
|
$10^{12}$ & $0.000000$ & $0.000000$ & $0.393469$ \\
|
|
|
|
|
$10^{15}$ & $0.000000$ & $0.000000$ & $0.000500$ \\
|
|
|
|
|
$10^{18}$ & $0.000000$ & $0.000000$ & $0.000001$ \\
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-01-24 20:59:20 +01:00
|
|
|
|
The table shows that in scenario 1,
|
|
|
|
|
the probability of a collision is negligible
|
|
|
|
|
when $B \approx 10^9$.
|
|
|
|
|
In scenario 2, a collision is possible but the
|
|
|
|
|
probability is still quite small.
|
|
|
|
|
However, in scenario 3 the situation is very different:
|
|
|
|
|
a collision will almost always happen when
|
|
|
|
|
$B \approx 10^9$.
|
|
|
|
|
|
|
|
|
|
\index{birthday paradox}
|
|
|
|
|
|
|
|
|
|
The phenomenon in scenario 3 is known as the
|
|
|
|
|
\key{birthday paradox}: if there are $n$ people
|
2017-05-12 20:15:25 +02:00
|
|
|
|
in a room, the probability that \emph{some} two people
|
2017-01-24 20:59:20 +01:00
|
|
|
|
have the same birthday is large even if $n$ is quite small.
|
|
|
|
|
In hashing, correspondingly, when all hash values are compared
|
|
|
|
|
with each other, the probability that some two
|
2017-02-11 18:11:50 +01:00
|
|
|
|
hash values are equal is large.
|
2017-01-24 20:59:20 +01:00
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
We can make the probability of a collision
|
|
|
|
|
smaller by calculating \emph{multiple} hash values
|
2017-01-24 20:59:20 +01:00
|
|
|
|
using different parameters.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
It is unlikely that a collision would occur
|
2017-01-24 20:59:20 +01:00
|
|
|
|
in all hash values at the same time.
|
|
|
|
|
For example, two hash values with parameter
|
2017-01-24 21:45:47 +01:00
|
|
|
|
$B \approx 10^9$ correspond to one hash
|
2017-01-24 20:59:20 +01:00
|
|
|
|
value with parameter $B \approx 10^{18}$,
|
|
|
|
|
which makes the probability of a collision very small.
|
|
|
|
|
|
|
|
|
|
Some people use constants $B=2^{32}$ and $B=2^{64}$,
|
|
|
|
|
which is convenient, because operations with 32 and 64
|
|
|
|
|
bit integers are calculated modulo $2^{32}$ and $2^{64}$.
|
2017-05-12 20:15:25 +02:00
|
|
|
|
However, this is \emph{not} a good choice, because it is possible
|
2017-01-24 20:59:20 +01:00
|
|
|
|
to construct inputs that always generate collisions when
|
2017-02-21 00:17:36 +01:00
|
|
|
|
constants of the form $2^x$ are used \cite{pac13}.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 21:45:47 +01:00
|
|
|
|
\section{Z-algorithm}
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 21:45:47 +01:00
|
|
|
|
\index{Z-algorithm}
|
|
|
|
|
\index{Z-array}
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
The \key{Z-array} \texttt{z} of a string \texttt{s}
|
|
|
|
|
of length $n$ contains for each $k=0,1,\ldots,n-1$
|
|
|
|
|
the length of the longest substring of \texttt{s}
|
|
|
|
|
that begins at position $k$ and is a prefix of \texttt{s}.
|
|
|
|
|
Thus, $\texttt{z}[k]=p$ tells us that
|
|
|
|
|
$\texttt{s}[0 \ldots p-1]$ equals $\texttt{s}[k \ldots k+p-1]$.
|
|
|
|
|
Many string processing problems can be efficiently solved
|
|
|
|
|
using the Z-array.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
For example, the Z-array of
|
2017-01-24 21:45:47 +01:00
|
|
|
|
\texttt{ACBACDACBACBACDA} is as follows:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {\texttt{A}};
|
|
|
|
|
\node at (1.5, 1.5) {\texttt{C}};
|
|
|
|
|
\node at (2.5, 1.5) {\texttt{B}};
|
|
|
|
|
\node at (3.5, 1.5) {\texttt{A}};
|
|
|
|
|
\node at (4.5, 1.5) {\texttt{C}};
|
|
|
|
|
\node at (5.5, 1.5) {\texttt{D}};
|
|
|
|
|
\node at (6.5, 1.5) {\texttt{A}};
|
|
|
|
|
\node at (7.5, 1.5) {\texttt{C}};
|
|
|
|
|
\node at (8.5, 1.5) {\texttt{B}};
|
|
|
|
|
\node at (9.5, 1.5) {\texttt{A}};
|
|
|
|
|
\node at (10.5, 1.5) {\texttt{C}};
|
|
|
|
|
\node at (11.5, 1.5) {\texttt{B}};
|
|
|
|
|
\node at (12.5, 1.5) {\texttt{A}};
|
|
|
|
|
\node at (13.5, 1.5) {\texttt{C}};
|
|
|
|
|
\node at (14.5, 1.5) {\texttt{D}};
|
|
|
|
|
\node at (15.5, 1.5) {\texttt{A}};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {2};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {0};
|
|
|
|
|
\node at (6.5, 0.5) {5};
|
|
|
|
|
\node at (7.5, 0.5) {0};
|
|
|
|
|
\node at (8.5, 0.5) {0};
|
|
|
|
|
\node at (9.5, 0.5) {7};
|
|
|
|
|
\node at (10.5, 0.5) {0};
|
|
|
|
|
\node at (11.5, 0.5) {0};
|
|
|
|
|
\node at (12.5, 0.5) {2};
|
|
|
|
|
\node at (13.5, 0.5) {0};
|
|
|
|
|
\node at (14.5, 0.5) {0};
|
|
|
|
|
\node at (15.5, 0.5) {1};
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
In this case, for example, $\texttt{z}[6]=5$,
|
2017-01-24 21:45:47 +01:00
|
|
|
|
because the substring \texttt{ACBAC} of length 5
|
2017-05-12 20:15:25 +02:00
|
|
|
|
is a prefix of \texttt{s},
|
2017-01-24 21:45:47 +01:00
|
|
|
|
but the substring \texttt{ACBACB} of length 6
|
2017-05-12 20:15:25 +02:00
|
|
|
|
is not a prefix of \texttt{s}.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
|
|
|
|
|
\subsubsection*{Algorithm description}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
Next we describe an algorithm,
|
|
|
|
|
called the \key{Z-algorithm}\footnote{The Z-algorithm
|
|
|
|
|
was presented in \cite{gus97} as the simplest known
|
|
|
|
|
method for linear-time pattern matching, and the original idea
|
|
|
|
|
was attributed to \cite{mai84}.},
|
|
|
|
|
that efficiently constructs the Z-array in $O(n)$ time.
|
|
|
|
|
The algorithm calculates the Z-array values
|
|
|
|
|
from left to right by both using information
|
|
|
|
|
already stored in the Z-array and comparing substrings
|
|
|
|
|
character by character.
|
|
|
|
|
|
|
|
|
|
To efficiently calculate the Z-array values,
|
|
|
|
|
the algorithm maintains a range $[x,y]$ such that
|
|
|
|
|
$\texttt{s}[x \ldots y]$ is a prefix of \texttt{s}
|
|
|
|
|
and $y$ is as large as possible.
|
|
|
|
|
Since we know that $\texttt{s}[0 \ldots y-x]$
|
|
|
|
|
and $\texttt{s}[x \ldots y]$ are equal,
|
|
|
|
|
we can use this information when calculating
|
|
|
|
|
Z-values for positions $x+1,x+2,\ldots,y$.
|
|
|
|
|
|
|
|
|
|
At each position $k$, we first
|
|
|
|
|
check the value of $\texttt{z}[k-x]$.
|
|
|
|
|
If $k+\texttt{z}[k-x]<y$, we know that $\texttt{z}[k]=\texttt{z}[k-x]$.
|
|
|
|
|
However, if $k+\texttt{z}[k-x] \ge y$,
|
|
|
|
|
$\texttt{s}[0 \ldots y-k]$ equals
|
|
|
|
|
$\texttt{s}[k \ldots y]$, and to determine the
|
|
|
|
|
value of $\texttt{z}[k]$ we need to compare
|
|
|
|
|
the substrings character by character.
|
|
|
|
|
Still, the algorithm works in $O(n)$ time,
|
|
|
|
|
because we start comparing at positions
|
|
|
|
|
$y-k+1$ and $y+1$.
|
2017-01-24 21:45:47 +01:00
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
For example, let us construct the following Z-array:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {C};
|
|
|
|
|
\node at (2.5, 1.5) {B};
|
|
|
|
|
\node at (3.5, 1.5) {A};
|
|
|
|
|
\node at (4.5, 1.5) {C};
|
|
|
|
|
\node at (5.5, 1.5) {D};
|
|
|
|
|
\node at (6.5, 1.5) {A};
|
|
|
|
|
\node at (7.5, 1.5) {C};
|
|
|
|
|
\node at (8.5, 1.5) {B};
|
|
|
|
|
\node at (9.5, 1.5) {A};
|
|
|
|
|
\node at (10.5, 1.5) {C};
|
|
|
|
|
\node at (11.5, 1.5) {B};
|
|
|
|
|
\node at (12.5, 1.5) {A};
|
|
|
|
|
\node at (13.5, 1.5) {C};
|
|
|
|
|
\node at (14.5, 1.5) {D};
|
|
|
|
|
\node at (15.5, 1.5) {A};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {?};
|
|
|
|
|
\node at (2.5, 0.5) {?};
|
|
|
|
|
\node at (3.5, 0.5) {?};
|
|
|
|
|
\node at (4.5, 0.5) {?};
|
|
|
|
|
\node at (5.5, 0.5) {?};
|
|
|
|
|
\node at (6.5, 0.5) {?};
|
|
|
|
|
\node at (7.5, 0.5) {?};
|
|
|
|
|
\node at (8.5, 0.5) {?};
|
|
|
|
|
\node at (9.5, 0.5) {?};
|
|
|
|
|
\node at (10.5, 0.5) {?};
|
|
|
|
|
\node at (11.5, 0.5) {?};
|
|
|
|
|
\node at (12.5, 0.5) {?};
|
|
|
|
|
\node at (13.5, 0.5) {?};
|
|
|
|
|
\node at (14.5, 0.5) {?};
|
|
|
|
|
\node at (15.5, 0.5) {?};
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
After calculating the value $\texttt{z}[6]=5$,
|
|
|
|
|
the current $[x,y]$ range is $[6,10]$:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\fill[color=lightgray] (6,0) rectangle (7,1);
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {C};
|
|
|
|
|
\node at (2.5, 1.5) {B};
|
|
|
|
|
\node at (3.5, 1.5) {A};
|
|
|
|
|
\node at (4.5, 1.5) {C};
|
|
|
|
|
\node at (5.5, 1.5) {D};
|
|
|
|
|
\node at (6.5, 1.5) {A};
|
|
|
|
|
\node at (7.5, 1.5) {C};
|
|
|
|
|
\node at (8.5, 1.5) {B};
|
|
|
|
|
\node at (9.5, 1.5) {A};
|
|
|
|
|
\node at (10.5, 1.5) {C};
|
|
|
|
|
\node at (11.5, 1.5) {B};
|
|
|
|
|
\node at (12.5, 1.5) {A};
|
|
|
|
|
\node at (13.5, 1.5) {C};
|
|
|
|
|
\node at (14.5, 1.5) {D};
|
|
|
|
|
\node at (15.5, 1.5) {A};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {2};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {0};
|
|
|
|
|
\node at (6.5, 0.5) {5};
|
|
|
|
|
\node at (7.5, 0.5) {?};
|
|
|
|
|
\node at (8.5, 0.5) {?};
|
|
|
|
|
\node at (9.5, 0.5) {?};
|
|
|
|
|
\node at (10.5, 0.5) {?};
|
|
|
|
|
\node at (11.5, 0.5) {?};
|
|
|
|
|
\node at (12.5, 0.5) {?};
|
|
|
|
|
\node at (13.5, 0.5) {?};
|
|
|
|
|
\node at (14.5, 0.5) {?};
|
|
|
|
|
\node at (15.5, 0.5) {?};
|
|
|
|
|
|
|
|
|
|
\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
|
|
|
|
|
|
|
|
|
|
\node at (6.5,3.50) {$x$};
|
|
|
|
|
\node at (10.5,3.50) {$y$};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
Now we can calculate
|
|
|
|
|
subsequent Z-array values
|
|
|
|
|
efficiently,
|
2017-02-11 18:11:50 +01:00
|
|
|
|
because we know that
|
2017-05-12 20:15:25 +02:00
|
|
|
|
$\texttt{s}[0 \ldots 4]$ and
|
|
|
|
|
$\texttt{s}[6 \ldots 10]$ are equal.
|
|
|
|
|
First, since $\texttt{z}[1] = \texttt{z}[2] = 0$,
|
|
|
|
|
we immediately know that also
|
|
|
|
|
$\texttt{z}[7] = \texttt{z}[8] = 0$:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\fill[color=lightgray] (7,0) rectangle (9,1);
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {C};
|
|
|
|
|
\node at (2.5, 1.5) {B};
|
|
|
|
|
\node at (3.5, 1.5) {A};
|
|
|
|
|
\node at (4.5, 1.5) {C};
|
|
|
|
|
\node at (5.5, 1.5) {D};
|
|
|
|
|
\node at (6.5, 1.5) {A};
|
|
|
|
|
\node at (7.5, 1.5) {C};
|
|
|
|
|
\node at (8.5, 1.5) {B};
|
|
|
|
|
\node at (9.5, 1.5) {A};
|
|
|
|
|
\node at (10.5, 1.5) {C};
|
|
|
|
|
\node at (11.5, 1.5) {B};
|
|
|
|
|
\node at (12.5, 1.5) {A};
|
|
|
|
|
\node at (13.5, 1.5) {C};
|
|
|
|
|
\node at (14.5, 1.5) {D};
|
|
|
|
|
\node at (15.5, 1.5) {A};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {2};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {0};
|
|
|
|
|
\node at (6.5, 0.5) {5};
|
|
|
|
|
\node at (7.5, 0.5) {0};
|
|
|
|
|
\node at (8.5, 0.5) {0};
|
|
|
|
|
\node at (9.5, 0.5) {?};
|
|
|
|
|
\node at (10.5, 0.5) {?};
|
|
|
|
|
\node at (11.5, 0.5) {?};
|
|
|
|
|
\node at (12.5, 0.5) {?};
|
|
|
|
|
\node at (13.5, 0.5) {?};
|
|
|
|
|
\node at (14.5, 0.5) {?};
|
|
|
|
|
\node at (15.5, 0.5) {?};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
|
|
|
|
|
|
|
|
|
|
\node at (6.5,3.50) {$x$};
|
|
|
|
|
\node at (10.5,3.50) {$y$};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\draw[thick,<->] (7.5,-0.25) .. controls (7,-1.25) and (2,-1.25) .. (1.5,-0.25);
|
|
|
|
|
\draw[thick,<->] (8.5,-0.25) .. controls (8,-1.25) and (3,-1.25) .. (2.5,-0.25);
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
Then, since $\texttt{z}[3]=2$, we know that $\texttt{z}[9] \ge 2$:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\fill[color=lightgray] (9,0) rectangle (10,1);
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {C};
|
|
|
|
|
\node at (2.5, 1.5) {B};
|
|
|
|
|
\node at (3.5, 1.5) {A};
|
|
|
|
|
\node at (4.5, 1.5) {C};
|
|
|
|
|
\node at (5.5, 1.5) {D};
|
|
|
|
|
\node at (6.5, 1.5) {A};
|
|
|
|
|
\node at (7.5, 1.5) {C};
|
|
|
|
|
\node at (8.5, 1.5) {B};
|
|
|
|
|
\node at (9.5, 1.5) {A};
|
|
|
|
|
\node at (10.5, 1.5) {C};
|
|
|
|
|
\node at (11.5, 1.5) {B};
|
|
|
|
|
\node at (12.5, 1.5) {A};
|
|
|
|
|
\node at (13.5, 1.5) {C};
|
|
|
|
|
\node at (14.5, 1.5) {D};
|
|
|
|
|
\node at (15.5, 1.5) {A};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {2};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {0};
|
|
|
|
|
\node at (6.5, 0.5) {5};
|
|
|
|
|
\node at (7.5, 0.5) {0};
|
|
|
|
|
\node at (8.5, 0.5) {0};
|
|
|
|
|
\node at (9.5, 0.5) {?};
|
|
|
|
|
\node at (10.5, 0.5) {?};
|
|
|
|
|
\node at (11.5, 0.5) {?};
|
|
|
|
|
\node at (12.5, 0.5) {?};
|
|
|
|
|
\node at (13.5, 0.5) {?};
|
|
|
|
|
\node at (14.5, 0.5) {?};
|
|
|
|
|
\node at (15.5, 0.5) {?};
|
|
|
|
|
|
|
|
|
|
\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
|
|
|
|
|
|
|
|
|
|
\node at (6.5,3.50) {$x$};
|
|
|
|
|
\node at (10.5,3.50) {$y$};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\draw[thick,<->] (9.5,-0.25) .. controls (9,-1.25) and (4,-1.25) .. (3.5,-0.25);
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
However, we have no information about the string
|
|
|
|
|
after position 10, so we need to compare the substrings
|
2017-02-11 18:11:50 +01:00
|
|
|
|
character by character:
|
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\fill[color=lightgray] (9,0) rectangle (10,1);
|
|
|
|
|
\fill[color=lightgray] (2,1) rectangle (7,2);
|
|
|
|
|
\fill[color=lightgray] (11,1) rectangle (16,2);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {C};
|
|
|
|
|
\node at (2.5, 1.5) {B};
|
|
|
|
|
\node at (3.5, 1.5) {A};
|
|
|
|
|
\node at (4.5, 1.5) {C};
|
|
|
|
|
\node at (5.5, 1.5) {D};
|
|
|
|
|
\node at (6.5, 1.5) {A};
|
|
|
|
|
\node at (7.5, 1.5) {C};
|
|
|
|
|
\node at (8.5, 1.5) {B};
|
|
|
|
|
\node at (9.5, 1.5) {A};
|
|
|
|
|
\node at (10.5, 1.5) {C};
|
|
|
|
|
\node at (11.5, 1.5) {B};
|
|
|
|
|
\node at (12.5, 1.5) {A};
|
|
|
|
|
\node at (13.5, 1.5) {C};
|
|
|
|
|
\node at (14.5, 1.5) {D};
|
|
|
|
|
\node at (15.5, 1.5) {A};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {2};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {0};
|
|
|
|
|
\node at (6.5, 0.5) {5};
|
|
|
|
|
\node at (7.5, 0.5) {0};
|
|
|
|
|
\node at (8.5, 0.5) {0};
|
|
|
|
|
\node at (9.5, 0.5) {?};
|
|
|
|
|
\node at (10.5, 0.5) {?};
|
|
|
|
|
\node at (11.5, 0.5) {?};
|
|
|
|
|
\node at (12.5, 0.5) {?};
|
|
|
|
|
\node at (13.5, 0.5) {?};
|
|
|
|
|
\node at (14.5, 0.5) {?};
|
|
|
|
|
\node at (15.5, 0.5) {?};
|
|
|
|
|
|
|
|
|
|
\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
|
|
|
|
|
|
|
|
|
|
\node at (6.5,3.50) {$x$};
|
|
|
|
|
\node at (10.5,3.50) {$y$};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2017-02-11 18:11:50 +01:00
|
|
|
|
|
|
|
|
|
%\draw[thick,<->] (11.5,-0.25) .. controls (11,-1.25) and (3,-1.25) .. (2.5,-0.25);
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
It turns out that $\texttt{z}[9]=7$,
|
|
|
|
|
so the new $[x,y]$ range is $[9,15]$:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\fill[color=lightgray] (9,0) rectangle (10,1);
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {C};
|
|
|
|
|
\node at (2.5, 1.5) {B};
|
|
|
|
|
\node at (3.5, 1.5) {A};
|
|
|
|
|
\node at (4.5, 1.5) {C};
|
|
|
|
|
\node at (5.5, 1.5) {D};
|
|
|
|
|
\node at (6.5, 1.5) {A};
|
|
|
|
|
\node at (7.5, 1.5) {C};
|
|
|
|
|
\node at (8.5, 1.5) {B};
|
|
|
|
|
\node at (9.5, 1.5) {A};
|
|
|
|
|
\node at (10.5, 1.5) {C};
|
|
|
|
|
\node at (11.5, 1.5) {B};
|
|
|
|
|
\node at (12.5, 1.5) {A};
|
|
|
|
|
\node at (13.5, 1.5) {C};
|
|
|
|
|
\node at (14.5, 1.5) {D};
|
|
|
|
|
\node at (15.5, 1.5) {A};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {2};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {0};
|
|
|
|
|
\node at (6.5, 0.5) {5};
|
|
|
|
|
\node at (7.5, 0.5) {0};
|
|
|
|
|
\node at (8.5, 0.5) {0};
|
|
|
|
|
\node at (9.5, 0.5) {7};
|
|
|
|
|
\node at (10.5, 0.5) {?};
|
|
|
|
|
\node at (11.5, 0.5) {?};
|
|
|
|
|
\node at (12.5, 0.5) {?};
|
|
|
|
|
\node at (13.5, 0.5) {?};
|
|
|
|
|
\node at (14.5, 0.5) {?};
|
|
|
|
|
\node at (15.5, 0.5) {?};
|
|
|
|
|
|
|
|
|
|
\draw [decoration={brace}, decorate, line width=0.5mm] (9,3.00) -- (16,3.00);
|
|
|
|
|
|
|
|
|
|
\node at (9.5,3.50) {$x$};
|
|
|
|
|
\node at (15.5,3.50) {$y$};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
% \draw[thick,<->] (9.5,-0.25) .. controls (9,-1.25) and (4,-1.25) .. (3.5,-0.25);
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
After this, all the remaining Z-array values
|
|
|
|
|
can be determined by using the information
|
|
|
|
|
already stored in the Z-array:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\draw (0,0) grid (16,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {C};
|
|
|
|
|
\node at (2.5, 1.5) {B};
|
|
|
|
|
\node at (3.5, 1.5) {A};
|
|
|
|
|
\node at (4.5, 1.5) {C};
|
|
|
|
|
\node at (5.5, 1.5) {D};
|
|
|
|
|
\node at (6.5, 1.5) {A};
|
|
|
|
|
\node at (7.5, 1.5) {C};
|
|
|
|
|
\node at (8.5, 1.5) {B};
|
|
|
|
|
\node at (9.5, 1.5) {A};
|
|
|
|
|
\node at (10.5, 1.5) {C};
|
|
|
|
|
\node at (11.5, 1.5) {B};
|
|
|
|
|
\node at (12.5, 1.5) {A};
|
|
|
|
|
\node at (13.5, 1.5) {C};
|
|
|
|
|
\node at (14.5, 1.5) {D};
|
|
|
|
|
\node at (15.5, 1.5) {A};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {2};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {0};
|
|
|
|
|
\node at (6.5, 0.5) {5};
|
|
|
|
|
\node at (7.5, 0.5) {0};
|
|
|
|
|
\node at (8.5, 0.5) {0};
|
|
|
|
|
\node at (9.5, 0.5) {7};
|
|
|
|
|
\node at (10.5, 0.5) {0};
|
|
|
|
|
\node at (11.5, 0.5) {0};
|
|
|
|
|
\node at (12.5, 0.5) {2};
|
|
|
|
|
\node at (13.5, 0.5) {0};
|
|
|
|
|
\node at (14.5, 0.5) {0};
|
|
|
|
|
\node at (15.5, 0.5) {1};
|
|
|
|
|
|
|
|
|
|
\draw [decoration={brace}, decorate, line width=0.5mm] (9,3.00) -- (16,3.00);
|
|
|
|
|
|
|
|
|
|
\node at (9.5,3.50) {$x$};
|
|
|
|
|
\node at (15.5,3.50) {$y$};
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
|
|
|
|
\node at (14.5, 2.5) {14};
|
|
|
|
|
\node at (15.5, 2.5) {15};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
|
|
|
|
|
2017-01-24 21:45:47 +01:00
|
|
|
|
\subsubsection{Using the Z-array}
|
|
|
|
|
|
2017-05-12 20:15:25 +02:00
|
|
|
|
It is often a matter of taste whether to use
|
|
|
|
|
string hashing or the Z-algorithm.
|
|
|
|
|
Unlike hashing, the Z-algorithm always works
|
|
|
|
|
and there is no risk for collisions.
|
|
|
|
|
On the other hand, the Z-algorithm is more difficult
|
|
|
|
|
to implement and some problems can only be solved
|
|
|
|
|
using hashing.
|
|
|
|
|
|
|
|
|
|
As an example, consider again
|
2017-02-11 18:11:50 +01:00
|
|
|
|
the pattern matching problem,
|
2017-05-12 20:15:25 +02:00
|
|
|
|
where our task is to find the occurrences
|
|
|
|
|
of a pattern $p$ in a string $s$.
|
2017-02-11 18:11:50 +01:00
|
|
|
|
We already solved this problem efficiently
|
2017-01-24 21:45:47 +01:00
|
|
|
|
using string hashing, but the Z-algorithm
|
|
|
|
|
provides another way to solve the problem.
|
|
|
|
|
|
2017-02-11 18:11:50 +01:00
|
|
|
|
A usual idea in string processing is to
|
|
|
|
|
construct a string that consists of
|
|
|
|
|
multiple strings separated by special characters.
|
2017-01-24 21:45:47 +01:00
|
|
|
|
In this problem, we can construct a string
|
2016-12-28 23:54:51 +01:00
|
|
|
|
$p$\texttt{\#}$s$,
|
2017-01-24 21:45:47 +01:00
|
|
|
|
where $p$ and $s$ are separated by a special
|
2017-02-11 18:11:50 +01:00
|
|
|
|
character \texttt{\#} that does not occur
|
2017-01-24 21:45:47 +01:00
|
|
|
|
in the strings.
|
2017-02-18 17:27:38 +01:00
|
|
|
|
The Z-array of $p$\texttt{\#}$s$ tells us the positions
|
2017-02-11 18:11:50 +01:00
|
|
|
|
where $p$ occurs in $s$,
|
2017-04-21 22:19:29 +02:00
|
|
|
|
because such positions contain the length of $p$.
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
2017-01-24 21:45:47 +01:00
|
|
|
|
For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT},
|
|
|
|
|
the Z-array is as follows:
|
2016-12-28 23:54:51 +01:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
|
\begin{tikzpicture}[scale=0.7]
|
|
|
|
|
\draw (0,0) grid (14,2);
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 1.5) {A};
|
|
|
|
|
\node at (1.5, 1.5) {T};
|
|
|
|
|
\node at (2.5, 1.5) {T};
|
|
|
|
|
\node at (3.5, 1.5) {\#};
|
|
|
|
|
\node at (4.5, 1.5) {H};
|
|
|
|
|
\node at (5.5, 1.5) {A};
|
|
|
|
|
\node at (6.5, 1.5) {T};
|
|
|
|
|
\node at (7.5, 1.5) {T};
|
|
|
|
|
\node at (8.5, 1.5) {I};
|
|
|
|
|
\node at (9.5, 1.5) {V};
|
|
|
|
|
\node at (10.5, 1.5) {A};
|
|
|
|
|
\node at (11.5, 1.5) {T};
|
|
|
|
|
\node at (12.5, 1.5) {T};
|
|
|
|
|
\node at (13.5, 1.5) {I};
|
|
|
|
|
|
|
|
|
|
\node at (0.5, 0.5) {--};
|
|
|
|
|
\node at (1.5, 0.5) {0};
|
|
|
|
|
\node at (2.5, 0.5) {0};
|
|
|
|
|
\node at (3.5, 0.5) {0};
|
|
|
|
|
\node at (4.5, 0.5) {0};
|
|
|
|
|
\node at (5.5, 0.5) {3};
|
|
|
|
|
\node at (6.5, 0.5) {0};
|
|
|
|
|
\node at (7.5, 0.5) {0};
|
|
|
|
|
\node at (8.5, 0.5) {0};
|
|
|
|
|
\node at (9.5, 0.5) {0};
|
|
|
|
|
\node at (10.5, 0.5) {3};
|
|
|
|
|
\node at (11.5, 0.5) {0};
|
|
|
|
|
\node at (12.5, 0.5) {0};
|
|
|
|
|
\node at (13.5, 0.5) {0};
|
|
|
|
|
|
|
|
|
|
\footnotesize
|
2017-04-17 17:20:20 +02:00
|
|
|
|
\node at (0.5, 2.5) {0};
|
|
|
|
|
\node at (1.5, 2.5) {1};
|
|
|
|
|
\node at (2.5, 2.5) {2};
|
|
|
|
|
\node at (3.5, 2.5) {3};
|
|
|
|
|
\node at (4.5, 2.5) {4};
|
|
|
|
|
\node at (5.5, 2.5) {5};
|
|
|
|
|
\node at (6.5, 2.5) {6};
|
|
|
|
|
\node at (7.5, 2.5) {7};
|
|
|
|
|
\node at (8.5, 2.5) {8};
|
|
|
|
|
\node at (9.5, 2.5) {9};
|
|
|
|
|
\node at (10.5, 2.5) {10};
|
|
|
|
|
\node at (11.5, 2.5) {11};
|
|
|
|
|
\node at (12.5, 2.5) {12};
|
|
|
|
|
\node at (13.5, 2.5) {13};
|
2016-12-28 23:54:51 +01:00
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\end{center}
|
2017-02-11 18:11:50 +01:00
|
|
|
|
|
2017-04-17 17:20:20 +02:00
|
|
|
|
The positions 5 and 10 contain the value 3,
|
2017-02-11 18:11:50 +01:00
|
|
|
|
which means that the pattern \texttt{ATT}
|
2017-01-24 21:45:47 +01:00
|
|
|
|
occurs in the corresponding positions
|
2017-05-12 20:15:25 +02:00
|
|
|
|
of \texttt{HATTIVATTI}.
|
2017-01-24 21:45:47 +01:00
|
|
|
|
|
|
|
|
|
The time complexity of the resulting algorithm
|
2017-04-21 22:19:29 +02:00
|
|
|
|
is linear, because it suffices to construct
|
2017-04-17 17:26:00 +02:00
|
|
|
|
the Z-array and go through its values.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Implementation}
|
|
|
|
|
|
|
|
|
|
Here is a short implementation of the Z-algorithm
|
|
|
|
|
that returns a vector that corresponds to the Z-array.
|
|
|
|
|
|
|
|
|
|
\begin{lstlisting}
|
|
|
|
|
vector<int> z(string s) {
|
|
|
|
|
int n = s.size();
|
|
|
|
|
vector<int> z(n);
|
|
|
|
|
int x = 0, y = 0;
|
|
|
|
|
for (int i = 1; i < n; i++) {
|
|
|
|
|
z[i] = max(0,min(z[i-x],y-i+1));
|
|
|
|
|
while (i+z[i] < n && s[z[i]] == s[i+z[i]]) {
|
|
|
|
|
x = i; y = i+z[i]; z[i]++;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
return z;
|
|
|
|
|
}
|
|
|
|
|
\end{lstlisting}
|