cphb/chapter26.tex

\chapter{String algorithms}

This chapter deals with efficient algorithms
for string processing.
Many string problems can be easily solved
in $O(n^2)$ time, but the challenge is to
find algorithms that work in $O(n)$ or $O(n \log n)$
time.

\index{pattern matching}

For example, a fundamental string processing
problem is the \key{pattern matching} problem:
given a string of length $n$ and a pattern of length $m$,
our task is to find the occurrences of the pattern
in the string.
For example, the pattern \texttt{ABC} occurs two
times in the string \texttt{ABABCBABC}.

The pattern matching problem can be easily solved
in $O(nm)$ time by a brute force algorithm that
tests all positions where the pattern may
occur in the string.
However, in this chapter, we will see that there
are more efficient algorithms that require only
$O(n+m)$ time.

\index{string}

\section{String terminology}

\index{alphabet}

Throughout the chapter, we assume that
zero-based indexing is used in strings.
Thus, a string \texttt{s} of length $n$
consists of characters
$\texttt{s}[0],\texttt{s}[1],\ldots,\texttt{s}[n-1]$.
The set of characters that may appear
in strings is called an \key{alphabet}.
For example, the alphabet
$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
consists of the capital letters of English.

\index{substring}

A \key{substring} is a sequence of consecutive
characters in a string.
We use the notation $\texttt{s}[a \ldots b]$
to refer to a substring of \texttt{s}
that begins at position $a$ and ends at position $b$.
A string of length $n$ has $n(n+1)/2$ substrings.
For example, the substrings of
\texttt{ABCD} are
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
\texttt{AB}, \texttt{BC}, \texttt{CD},
\texttt{ABC}, \texttt{BCD} and \texttt{ABCD}.

\index{subsequence}

A \key{subsequence} is a sequence of
(not necessarily consecutive) characters
in a string in their original order.
A string of length $n$ has $2^n-1$ subsequences.
For example, the subsequences of
\texttt{ABCD} are
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
\texttt{AB}, \texttt{AC}, \texttt{AD},
\texttt{BC}, \texttt{BD}, \texttt{CD},
\texttt{ABC}, \texttt{ABD}, \texttt{ACD},
\texttt{BCD} and \texttt{ABCD}.

\index{prefix}
\index{suffix}

A \key{prefix} is a substring that starts at the beginning
of a string,
and a \key{suffix} is a substring that ends at the end
of a string.
For example,
the prefixes of \texttt{ABCD} are
\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD},
and the suffixes of \texttt{ABCD} are
\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.

\index{rotation}

A \key{rotation} can be generated by moving
the characters of a string one by one from the beginning
to the end (or vice versa).
For example, the rotations of \texttt{ABCD} are
\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.

\index{period}

A \key{period} is a prefix of a string such that
the string can be constructed by repeating the period.
The last repetition may be partial and contain
only a prefix of the period.
For example, the shortest period of
\texttt{ABCABCA} is \texttt{ABC}.

\index{border}

A \key{border} is a string that is both
a prefix and a suffix of a string.
For example, the borders of \texttt{ABACABA}
are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.

\index{lexicographical order}

Strings are compared using the \key{lexicographical order}
(which corresponds to the alphabetical order).
It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
or there is a position $k$ such that
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.

\section{Trie structure}

\index{trie}

A \key{trie} is a rooted tree that
maintains a set of strings.
Each string in the set is stored as
a chain of characters that starts at the root.
If two strings have a common prefix,
they also have a common chain in the tree.

For example, consider the following trie:

\begin{center}
\begin{tikzpicture}[scale=0.9]
\node[draw, circle] (1) at (0,20) {$\phantom{1}$};
\node[draw, circle] (2) at (-1.5,19) {$\phantom{1}$};
\node[draw, circle] (3) at (1.5,19) {$\phantom{1}$};
\node[draw, circle] (4) at (-1.5,17.5) {$\phantom{1}$};
\node[draw, circle] (5) at (-1.5,16) {$\phantom{1}$};
\node[draw, circle] (6) at (-2.5,14.5) {$\phantom{1}$};
\node[draw, circle] (7) at (-0.5,14.5) {$\phantom{1}$};
\node[draw, circle] (8) at (-2.5,13) {*};
\node[draw, circle] (9) at (-0.5,13) {*};
\node[draw, circle] (10) at (1.5,17.5) {$\phantom{1}$};
\node[draw, circle] (11) at (1.5,16) {*};
\node[draw, circle] (12) at (1.5,14.5) {$\phantom{1}$};
\node[draw, circle] (13) at (1.5,13) {*};

\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{C}] {} (2);
\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{T}] {} (3);
\path[draw,thick,->] (2) -- node[font=\small,label=left:\texttt{A}] {} (4);
\path[draw,thick,->] (4) -- node[font=\small,label=left:\texttt{N}] {} (5);
\path[draw,thick,->] (5) -- node[font=\small,label=left:\texttt{A}] {} (6);
\path[draw,thick,->] (5) -- node[font=\small,label=right:\texttt{D}] {} (7);
\path[draw,thick,->] (6) -- node[font=\small,label=left:\texttt{L}] {}(8);
\path[draw,thick,->] (7) -- node[font=\small,label=right:\texttt{Y}] {} (9);
\path[draw,thick,->] (3) -- node[font=\small,label=right:\texttt{H}] {} (10);
\path[draw,thick,->] (10) -- node[font=\small,label=right:\texttt{E}] {} (11);
\path[draw,thick,->] (11) -- node[font=\small,label=right:\texttt{R}] {} (12);
\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
\end{tikzpicture}
\end{center}

This trie corresponds to the set
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
The character * in a node means that
a string in the set ends at the node.
Such a character is needed, because a string
may be a prefix of another string.
For example, in the above trie, \texttt{THE}
is a prefix of \texttt{THERE}.

We can check in $O(n)$ time whether a trie
contains a string of length $n$,
because we can follow the chain that starts at the root node.
We can also add a string of length $n$ to the trie
in $O(n)$ time by first following the chain
and then adding new nodes to the trie if necessary.

Using a trie, we can find
the longest prefix of a given string
such that the prefix belongs to the set.
Moreover, by storing additional information
in each node,
we can calculate the number of
strings that belong to the set and have a
given string as a prefix.

A trie can be stored in an array
\begin{lstlisting}
int trie[N][A];
\end{lstlisting}
where $N$ is the maximum number of nodes
(the maximum total length of the strings in the set)
and $A$ is the size of the alphabet.
The nodes of a trie are numbered
$0,1,2,\ldots$ so that the number of the root is 0,
and $\texttt{trie}[s][c]$ is the next node in the chain
when we move from node $s$ using character $c$.

\section{String hashing}

\index{hashing}
\index{string hashing}

\key{String hashing} is a technique that
allows us to efficiently check whether two
strings are equal\footnote{The technique
was popularized by the Karp–Rabin pattern matching
algorithm \cite{kar87}.}.
The idea in string hashing is to compare hash values of
strings instead of their individual characters.

\subsubsection*{Calculating hash values}

\index{hash value}
\index{polynomial hashing}

A \key{hash value} of a string is
a number that is calculated from the characters
of the string.
If two strings are the same,
their hash values are also the same,
which makes it possible to compare strings
based on their hash values.

A usual way to implement string hashing
is \key{polynomial hashing}, which means
that the hash value of a string \texttt{s}
of length $n$ is
\[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B  ,\]
where $s[0],s[1],\ldots,s[n-1]$
are interpreted as the codes of the characters of \texttt{s},
and $A$ and $B$ are pre-chosen constants.

For example, the codes of the characters
of \texttt{ALLEY} are:
\begin{center}
\begin{tikzpicture}[scale=0.7]
\draw (0,0) grid (5,2);

\node at (0.5, 1.5) {\texttt{A}};
\node at (1.5, 1.5) {\texttt{L}};
\node at (2.5, 1.5) {\texttt{L}};
\node at (3.5, 1.5) {\texttt{E}};
\node at (4.5, 1.5) {\texttt{Y}};

\node at (0.5, 0.5) {65};
\node at (1.5, 0.5) {76};
\node at (2.5, 0.5) {76};
\node at (3.5, 0.5) {69};
\node at (4.5, 0.5) {89};

\end{tikzpicture}
\end{center}

Thus, if $A=3$ and $B=97$, the hash value
of \texttt{ALLEY} is
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]

\subsubsection*{Preprocessing}

Using polynomial hashing, we can calculate the hash value of any substring
of a string \texttt{s} in $O(1)$ time after an $O(n)$ time preprocessing.
The idea is to construct an array \texttt{h} such that
$\texttt{h}[k]$ contains the hash value of the prefix $\texttt{s}[0 \ldots k]$.
The array values can be recursively calculated as follows:
\[
\begin{array}{lcl}
\texttt{h}[0] & = & \texttt{s}[0] \\
\texttt{h}[k] & = & (\texttt{h}[k-1] A + \texttt{s}[k]) \bmod B \\
\end{array}
\]
In addition, we construct an array $\texttt{p}$
where $\texttt{p}[k]=A^k \bmod B$:
\[
\begin{array}{lcl}
\texttt{p}[0] & = & 1 \\
\texttt{p}[k] & = & (\texttt{p}[k-1] A) \bmod B. \\
\end{array}
\]
Constructing these arrays takes $O(n)$ time.
After this, the hash value of any substring
$\texttt{s}[a \ldots b]$
can be calculated in $O(1)$ time using the formula
\[(\texttt{h}[b]-\texttt{h}[a-1] \texttt{p}[b-a+1]) \bmod B\]
assuming that $a>0$.
If $a=0$, the hash value is simply $\texttt{h}[b]$.

\subsubsection*{Using hash values}

We can efficiently compare strings using hash values.
Instead of comparing the individual characters of the strings,
the idea is to compare their hash values.
If the hash values are equal,
the strings are \emph{probably} equal,
and if the hash values are different,
the strings are \emph{certainly} different.

Using hashing, we can often make a brute force
algorithm efficient.
As an example, consider the pattern matching problem:
given a string $s$ and a pattern $p$,
find the positions where $p$ occurs in $s$.
A brute force algorithm goes through all positions
where $p$ may occur and compares the strings
character by character.
The time complexity of such an algorithm is $O(n^2)$.

We can make the brute force algorithm more efficient
by using hashing, because the algorithm compares
substrings of strings.
Using hashing, each comparison only takes $O(1)$ time,
because only hash values of substrings are compared.
This results in an algorithm with time complexity $O(n)$,
which is the best possible time complexity for this problem.

By combining hashing and \emph{binary search},
it is also possible to find out the lexicographic order of
two strings in logarithmic time.
This can be done by calculating the length
of the common prefix of the strings using binary search.
Once we know the length of the common prefix,
we can just check the next character after the prefix,
because this determines the order of the strings.

\subsubsection*{Collisions and parameters}

\index{collision}

An evident risk when comparing hash values is
a \key{collision}, which means that two strings have
different contents but equal hash values.
In this case, an algorithm that relies on
the hash values concludes that the strings are equal,
but in reality they are not,
and the algorithm may give incorrect results.

Collisions are always possible,
because the number of different strings is larger
than the number of different hash values.
However, the probability of a collision is small
if the constants $A$ and $B$ are carefully chosen.
A usual way is to choose random constants
near $10^9$, for example as follows:
\[
\begin{array}{lcl}
A & = & 911382323 \\
B & = & 972663749 \\
\end{array}
\]

Using such constants,
the \texttt{long long} type can be used
when calculating hash values,
because the products $AB$ and $BB$ will fit in \texttt{long long}.
But is it enough to have about $10^9$ different hash values?

Let us consider three scenarios where hashing can be used:

\textit{Scenario 1:} Strings $x$ and $y$ are compared with
each other.
The probability of a collision is $1/B$ assuming that
all hash values are equally probable.

\textit{Scenario 2:} A string $x$ is compared with strings
$y_1,y_2,\ldots,y_n$.
The probability of one or more collisions is

\[1-(1-\frac{1}{B})^n.\]

\textit{Scenario 3:} All pairs of strings $x_1,x_2,\ldots,x_n$
are compared with each other.
The probability of one or more collisions is
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]

The following table shows the collision probabilities
when $n=10^6$ and the value of $B$ varies:

\begin{center}
\begin{tabular}{rrrr}
constant $B$ & scenario 1 & scenario 2 & scenario 3 \\
\hline
$10^3$ & $0.001000$ & $1.000000$ & $1.000000$ \\
$10^6$ & $0.000001$ & $0.632121$ & $1.000000$ \\
$10^9$ & $0.000000$ & $0.001000$ & $1.000000$ \\
$10^{12}$ & $0.000000$ & $0.000000$ & $0.393469$ \\
$10^{15}$ & $0.000000$ & $0.000000$ & $0.000500$ \\
$10^{18}$ & $0.000000$ & $0.000000$ & $0.000001$ \\
\end{tabular}
\end{center}

The table shows that in scenario 1,
the probability of a collision is negligible
when $B \approx 10^9$.
In scenario 2, a collision is possible but the
probability is still quite small.
However, in scenario 3 the situation is very different:
a collision will almost always happen when
$B \approx 10^9$.

\index{birthday paradox}

The phenomenon in scenario 3 is known as the
\key{birthday paradox}: if there are $n$ people
in a room, the probability that \emph{some} two people
have the same birthday is large even if $n$ is quite small.
In hashing, correspondingly, when all hash values are compared
with each other, the probability that some two
hash values are equal is large.

We can make the probability of a collision
smaller by calculating \emph{multiple} hash values
using different parameters.
It is unlikely that a collision would occur
in all hash values at the same time.
For example, two hash values with parameter
$B \approx 10^9$ correspond to one hash
value with parameter $B \approx 10^{18}$,
which makes the probability of a collision very small.

Some people use constants $B=2^{32}$ and $B=2^{64}$,
which is convenient, because operations with 32 and 64
bit integers are calculated modulo $2^{32}$ and $2^{64}$.
However, this is \emph{not} a good choice, because it is possible
to construct inputs that always generate collisions when
constants of the form $2^x$ are used \cite{pac13}.

\section{Z-algorithm}

\index{Z-algorithm}
\index{Z-array}

The \key{Z-array} \texttt{z} of a string \texttt{s}
of length $n$ contains for each $k=0,1,\ldots,n-1$
the length of the longest substring of \texttt{s}
that begins at position $k$ and is a prefix of \texttt{s}.
Thus, $\texttt{z}[k]=p$ tells us that
$\texttt{s}[0 \ldots p-1]$ equals $\texttt{s}[k \ldots k+p-1]$.
Many string processing problems can be efficiently solved
using the Z-array.

For example, the Z-array of
\texttt{ACBACDACBACBACDA} is as follows:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {\texttt{A}};
\node at (1.5, 1.5) {\texttt{C}};
\node at (2.5, 1.5) {\texttt{B}};
\node at (3.5, 1.5) {\texttt{A}};
\node at (4.5, 1.5) {\texttt{C}};
\node at (5.5, 1.5) {\texttt{D}};
\node at (6.5, 1.5) {\texttt{A}};
\node at (7.5, 1.5) {\texttt{C}};
\node at (8.5, 1.5) {\texttt{B}};
\node at (9.5, 1.5) {\texttt{A}};
\node at (10.5, 1.5) {\texttt{C}};
\node at (11.5, 1.5) {\texttt{B}};
\node at (12.5, 1.5) {\texttt{A}};
\node at (13.5, 1.5) {\texttt{C}};
\node at (14.5, 1.5) {\texttt{D}};
\node at (15.5, 1.5) {\texttt{A}};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {7};
\node at (10.5, 0.5) {0};
\node at (11.5, 0.5) {0};
\node at (12.5, 0.5) {2};
\node at (13.5, 0.5) {0};
\node at (14.5, 0.5) {0};
\node at (15.5, 0.5) {1};

\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};

\end{tikzpicture}
\end{center}

In this case, for example, $\texttt{z}[6]=5$,
because the substring \texttt{ACBAC} of length 5
is a prefix of \texttt{s},
but the substring \texttt{ACBACB} of length 6
is not a prefix of \texttt{s}.

\subsubsection*{Algorithm description}

Next we describe an algorithm,
called the \key{Z-algorithm}\footnote{The Z-algorithm
was presented in \cite{gus97} as the simplest known
method for linear-time pattern matching, and the original idea
was attributed to \cite{mai84}.},
that efficiently constructs the Z-array in $O(n)$ time.
The algorithm calculates the Z-array values
from left to right by both using information
already stored in the Z-array and comparing substrings
character by character.

To efficiently calculate the Z-array values,
the algorithm maintains a range $[x,y]$ such that
$\texttt{s}[x \ldots y]$ is a prefix of \texttt{s}
and $y$ is as large as possible.
Since we know that $\texttt{s}[0 \ldots y-x]$
and $\texttt{s}[x \ldots y]$ are equal,
we can use this information when calculating
Z-values for positions $x+1,x+2,\ldots,y$.

At each position $k$, we first
check the value of $\texttt{z}[k-x]$.
If $k+\texttt{z}[k-x]<y$, we know that $\texttt{z}[k]=\texttt{z}[k-x]$.
However, if $k+\texttt{z}[k-x] \ge y$,
$\texttt{s}[0 \ldots y-k]$ equals
$\texttt{s}[k \ldots y]$, and to determine the
value of $\texttt{z}[k]$ we need to compare
the substrings character by character.
Still, the algorithm works in $O(n)$ time,
because we start comparing at positions
$y-k+1$ and $y+1$.

For example, let us construct the following Z-array:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {?};
\node at (2.5, 0.5) {?};
\node at (3.5, 0.5) {?};
\node at (4.5, 0.5) {?};
\node at (5.5, 0.5) {?};
\node at (6.5, 0.5) {?};
\node at (7.5, 0.5) {?};
\node at (8.5, 0.5) {?};
\node at (9.5, 0.5) {?};
\node at (10.5, 0.5) {?};
\node at (11.5, 0.5) {?};
\node at (12.5, 0.5) {?};
\node at (13.5, 0.5) {?};
\node at (14.5, 0.5) {?};
\node at (15.5, 0.5) {?};

\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};

\end{tikzpicture}
\end{center}

After calculating the value $\texttt{z}[6]=5$,
the current $[x,y]$ range is $[6,10]$:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\fill[color=lightgray] (6,0) rectangle (7,1);
\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {?};
\node at (8.5, 0.5) {?};
\node at (9.5, 0.5) {?};
\node at (10.5, 0.5) {?};
\node at (11.5, 0.5) {?};
\node at (12.5, 0.5) {?};
\node at (13.5, 0.5) {?};
\node at (14.5, 0.5) {?};
\node at (15.5, 0.5) {?};

\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);

\node at (6.5,3.50) {$x$};
\node at (10.5,3.50) {$y$};


\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};

\end{tikzpicture}
\end{center}

Now we can calculate
subsequent Z-array values
efficiently,
because we know that
$\texttt{s}[0 \ldots 4]$ and
$\texttt{s}[6 \ldots 10]$ are equal.
First, since $\texttt{z}[1] = \texttt{z}[2] = 0$,
we immediately know that also
$\texttt{z}[7] = \texttt{z}[8] = 0$:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\fill[color=lightgray] (7,0) rectangle (9,1);
\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {?};
\node at (10.5, 0.5) {?};
\node at (11.5, 0.5) {?};
\node at (12.5, 0.5) {?};
\node at (13.5, 0.5) {?};
\node at (14.5, 0.5) {?};
\node at (15.5, 0.5) {?};


\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);

\node at (6.5,3.50) {$x$};
\node at (10.5,3.50) {$y$};


\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};


\draw[thick,<->] (7.5,-0.25) .. controls (7,-1.25) and (2,-1.25) .. (1.5,-0.25);
\draw[thick,<->] (8.5,-0.25) .. controls (8,-1.25) and (3,-1.25) .. (2.5,-0.25);
\end{tikzpicture}
\end{center}

Then, since $\texttt{z}[3]=2$, we know that $\texttt{z}[9] \ge 2$:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\fill[color=lightgray] (9,0) rectangle (10,1);
\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {?};
\node at (10.5, 0.5) {?};
\node at (11.5, 0.5) {?};
\node at (12.5, 0.5) {?};
\node at (13.5, 0.5) {?};
\node at (14.5, 0.5) {?};
\node at (15.5, 0.5) {?};

\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);

\node at (6.5,3.50) {$x$};
\node at (10.5,3.50) {$y$};


\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};

\draw[thick,<->] (9.5,-0.25) .. controls (9,-1.25) and (4,-1.25) .. (3.5,-0.25);
\end{tikzpicture}
\end{center}

However, we have no information about the string
after position 10, so we need to compare the substrings
character by character:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\fill[color=lightgray] (9,0) rectangle (10,1);
\fill[color=lightgray] (2,1) rectangle (7,2);
\fill[color=lightgray] (11,1) rectangle (16,2);


\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {?};
\node at (10.5, 0.5) {?};
\node at (11.5, 0.5) {?};
\node at (12.5, 0.5) {?};
\node at (13.5, 0.5) {?};
\node at (14.5, 0.5) {?};
\node at (15.5, 0.5) {?};

\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);

\node at (6.5,3.50) {$x$};
\node at (10.5,3.50) {$y$};


\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};

%\draw[thick,<->] (11.5,-0.25) .. controls (11,-1.25) and (3,-1.25) .. (2.5,-0.25);
\end{tikzpicture}
\end{center}

It turns out that $\texttt{z}[9]=7$,
so the new $[x,y]$ range is $[9,15]$:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\fill[color=lightgray] (9,0) rectangle (10,1);
\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {7};
\node at (10.5, 0.5) {?};
\node at (11.5, 0.5) {?};
\node at (12.5, 0.5) {?};
\node at (13.5, 0.5) {?};
\node at (14.5, 0.5) {?};
\node at (15.5, 0.5) {?};

\draw [decoration={brace}, decorate, line width=0.5mm] (9,3.00) -- (16,3.00);

\node at (9.5,3.50) {$x$};
\node at (15.5,3.50) {$y$};


\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};

% \draw[thick,<->] (9.5,-0.25) .. controls (9,-1.25) and (4,-1.25) .. (3.5,-0.25);
\end{tikzpicture}
\end{center}

After this, all the remaining Z-array values
can be determined by using the information
already stored in the Z-array:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\draw (0,0) grid (16,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {7};
\node at (10.5, 0.5) {0};
\node at (11.5, 0.5) {0};
\node at (12.5, 0.5) {2};
\node at (13.5, 0.5) {0};
\node at (14.5, 0.5) {0};
\node at (15.5, 0.5) {1};

\draw [decoration={brace}, decorate, line width=0.5mm] (9,3.00) -- (16,3.00);

\node at (9.5,3.50) {$x$};
\node at (15.5,3.50) {$y$};


\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\node at (14.5, 2.5) {14};
\node at (15.5, 2.5) {15};

\end{tikzpicture}
\end{center}

\subsubsection{Using the Z-array}

It is often a matter of taste whether to use
string hashing or the Z-algorithm.
Unlike hashing, the Z-algorithm always works
and there is no risk for collisions.
On the other hand, the Z-algorithm is more difficult
to implement and some problems can only be solved
using hashing.

As an example, consider again
the pattern matching problem,
where our task is to find the occurrences
of a pattern $p$ in a string $s$.
We already solved this problem efficiently
using string hashing, but the Z-algorithm
provides another way to solve the problem.

A usual idea in string processing is to
construct a string that consists of
multiple strings separated by special characters.
In this problem, we can construct a string
$p$\texttt{\#}$s$,
where $p$ and $s$ are separated by a special
character \texttt{\#} that does not occur
in the strings.
The Z-array of $p$\texttt{\#}$s$ tells us the positions
where $p$ occurs in $s$,
because such positions contain the length of $p$.

For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT},
the Z-array is as follows:

\begin{center}
\begin{tikzpicture}[scale=0.7]
\draw (0,0) grid (14,2);

\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {T};
\node at (2.5, 1.5) {T};
\node at (3.5, 1.5) {\#};
\node at (4.5, 1.5) {H};
\node at (5.5, 1.5) {A};
\node at (6.5, 1.5) {T};
\node at (7.5, 1.5) {T};
\node at (8.5, 1.5) {I};
\node at (9.5, 1.5) {V};
\node at (10.5, 1.5) {A};
\node at (11.5, 1.5) {T};
\node at (12.5, 1.5) {T};
\node at (13.5, 1.5) {I};

\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {0};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {3};
\node at (6.5, 0.5) {0};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {0};
\node at (10.5, 0.5) {3};
\node at (11.5, 0.5) {0};
\node at (12.5, 0.5) {0};
\node at (13.5, 0.5) {0};

\footnotesize
\node at (0.5, 2.5) {0};
\node at (1.5, 2.5) {1};
\node at (2.5, 2.5) {2};
\node at (3.5, 2.5) {3};
\node at (4.5, 2.5) {4};
\node at (5.5, 2.5) {5};
\node at (6.5, 2.5) {6};
\node at (7.5, 2.5) {7};
\node at (8.5, 2.5) {8};
\node at (9.5, 2.5) {9};
\node at (10.5, 2.5) {10};
\node at (11.5, 2.5) {11};
\node at (12.5, 2.5) {12};
\node at (13.5, 2.5) {13};
\end{tikzpicture}
\end{center}

The positions 5 and 10 contain the value 3,
which means that the pattern \texttt{ATT}
occurs in the corresponding positions
of \texttt{HATTIVATTI}.

The time complexity of the resulting algorithm
is linear, because it suffices to construct
the Z-array and go through its values.

\subsubsection{Implementation}

Here is a short implementation of the Z-algorithm
that returns a vector that corresponds to the Z-array.

\begin{lstlisting}
vector<int> z(string s) {
    int n = s.size();
    vector<int> z(n);
    int x = 0, y = 0;
    for (int i = 1; i < n; i++) {
        z[i] = max(0,min(z[i-x],y-i+1));
        while (i+z[i] < n && s[z[i]] == s[i+z[i]]) {
            x = i; y = i+z[i]; z[i]++;
        }
    }
    return z;
}
\end{lstlisting}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\chapter{String algorithms}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								This chapter deals with efficient algorithms
-												Word order

											
										
										
											2017-02-11 18:12:29 +01:00
+								for string processing.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								Many string problems can be easily solved
 								in $O(n^2)$ time, but the challenge is to
 								find algorithms that work in $O(n)$ or $O(n \log n)$
-												Corrections

											
										
										
											2017-02-18 17:27:38 +01:00
+								time.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
 								\index{pattern matching}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								For example, a fundamental string processing
 								problem is the \key{pattern matching} problem:
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								given a string of length $n$ and a pattern of length $m$,
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								our task is to find the occurrences of the pattern
 								in the string.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								For example, the pattern \texttt{ABC} occurs two
 								times in the string \texttt{ABABCBABC}.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								The pattern matching problem can be easily solved
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								in $O(nm)$ time by a brute force algorithm that
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								tests all positions where the pattern may
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								occur in the string.
-												Corrections

											
										
										
											2017-02-18 17:27:38 +01:00
+								However, in this chapter, we will see that there
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								are more efficient algorithms that require only
 								$O(n+m)$ time.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								\index{string}
-												Corrections

											
										
										
											2017-02-18 17:27:38 +01:00
+								\section{String terminology}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
 								\index{alphabet}
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								Throughout the chapter, we assume that
 								zero-based indexing is used in strings.
 								Thus, a string \texttt{s} of length $n$
 								consists of characters
 								$\texttt{s}[0],\texttt{s}[1],\ldots,\texttt{s}[n-1]$.
 								The set of characters that may appear
 								in strings is called an \key{alphabet}.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								For example, the alphabet
 								$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
 								consists of the capital letters of English.
 								\index{substring}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								A \key{substring} is a sequence of consecutive
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								characters in a string.
 								We use the notation $\texttt{s}[a \ldots b]$
 								to refer to a substring of \texttt{s}
 								that begins at position $a$ and ends at position $b$.
 								A string of length $n$ has $n(n+1)/2$ substrings.
 								For example, the substrings of
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								\texttt{ABCD} are
 								\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
 								\texttt{AB}, \texttt{BC}, \texttt{CD},
 								\texttt{ABC}, \texttt{BCD} and \texttt{ABCD}.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
 								\index{subsequence}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								A \key{subsequence} is a sequence of
 								(not necessarily consecutive) characters
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								in a string in their original order.
 								A string of length $n$ has $2^n-1$ subsequences.
 								For example, the subsequences of
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								\texttt{ABCD} are
 								\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
 								\texttt{AB}, \texttt{AC}, \texttt{AD},
 								\texttt{BC}, \texttt{BD}, \texttt{CD},
 								\texttt{ABC}, \texttt{ABD}, \texttt{ACD},
 								\texttt{BCD} and \texttt{ABCD}.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
 								\index{prefix}
 								\index{suffix}
-												Fix typo

											
										
										
											2017-05-25 14:58:45 +02:00
+								A \key{prefix} is a substring that starts at the beginning
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								of a string,
 								and a \key{suffix} is a substring that ends at the end
 								of a string.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								For example,
 								the prefixes of \texttt{ABCD} are
 								\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD},
 								and the suffixes of \texttt{ABCD} are
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
 								\index{rotation}
 								A \key{rotation} can be generated by moving
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								the characters of a string one by one from the beginning
 								to the end (or vice versa).
 								For example, the rotations of \texttt{ABCD} are
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
 								\index{period}
 								A \key{period} is a prefix of a string such that
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								the string can be constructed by repeating the period.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								The last repetition may be partial and contain
 								only a prefix of the period.
 								For example, the shortest period of
 								\texttt{ABCABCA} is \texttt{ABC}.
 								\index{border}
 								A \key{border} is a string that is both
 								a prefix and a suffix of a string.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								For example, the borders of \texttt{ABACABA}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
 								\index{lexicographical order}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								Strings are compared using the \key{lexicographical order}
 								(which corresponds to the alphabetical order).
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
 								or there is a position $k$ such that
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\section{Trie structure}
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
 								\index{trie}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								A \key{trie} is a rooted tree that
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								maintains a set of strings.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								Each string in the set is stored as
 								a chain of characters that starts at the root.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								If two strings have a common prefix,
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								they also have a common chain in the tree.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								For example, consider the following trie:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.9]
 								\node[draw, circle] (1) at (0,20) {$\phantom{1}$};
 								\node[draw, circle] (2) at (-1.5,19) {$\phantom{1}$};
 								\node[draw, circle] (3) at (1.5,19) {$\phantom{1}$};
 								\node[draw, circle] (4) at (-1.5,17.5) {$\phantom{1}$};
 								\node[draw, circle] (5) at (-1.5,16) {$\phantom{1}$};
 								\node[draw, circle] (6) at (-2.5,14.5) {$\phantom{1}$};
 								\node[draw, circle] (7) at (-0.5,14.5) {$\phantom{1}$};
 								\node[draw, circle] (8) at (-2.5,13) {*};
 								\node[draw, circle] (9) at (-0.5,13) {*};
 								\node[draw, circle] (10) at (1.5,17.5) {$\phantom{1}$};
 								\node[draw, circle] (11) at (1.5,16) {*};
 								\node[draw, circle] (12) at (1.5,14.5) {$\phantom{1}$};
 								\node[draw, circle] (13) at (1.5,13) {*};
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{C}] {} (2);
 								\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{T}] {} (3);
 								\path[draw,thick,->] (2) -- node[font=\small,label=left:\texttt{A}] {} (4);
 								\path[draw,thick,->] (4) -- node[font=\small,label=left:\texttt{N}] {} (5);
 								\path[draw,thick,->] (5) -- node[font=\small,label=left:\texttt{A}] {} (6);
 								\path[draw,thick,->] (5) -- node[font=\small,label=right:\texttt{D}] {} (7);
 								\path[draw,thick,->] (6) -- node[font=\small,label=left:\texttt{L}] {}(8);
 								\path[draw,thick,->] (7) -- node[font=\small,label=right:\texttt{Y}] {} (9);
 								\path[draw,thick,->] (3) -- node[font=\small,label=right:\texttt{H}] {} (10);
 								\path[draw,thick,->] (10) -- node[font=\small,label=right:\texttt{E}] {} (11);
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\path[draw,thick,->] (11) -- node[font=\small,label=right:\texttt{R}] {} (12);
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\end{tikzpicture}
 								\end{center}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
 								This trie corresponds to the set
 								$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								The character * in a node means that
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								a string in the set ends at the node.
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								Such a character is needed, because a string
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								may be a prefix of another string.
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								For example, in the above trie, \texttt{THE}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								is a prefix of \texttt{THERE}.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								We can check in $O(n)$ time whether a trie
 								contains a string of length $n$,
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								because we can follow the chain that starts at the root node.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								We can also add a string of length $n$ to the trie
 								in $O(n)$ time by first following the chain
 								and then adding new nodes to the trie if necessary.
 								Using a trie, we can find
 								the longest prefix of a given string
 								such that the prefix belongs to the set.
 								Moreover, by storing additional information
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								in each node,
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								we can calculate the number of
 								strings that belong to the set and have a
 								given string as a prefix.
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								A trie can be stored in an array
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\begin{lstlisting}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								int trie[N][A];
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\end{lstlisting}
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								where $N$ is the maximum number of nodes
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								(the maximum total length of the strings in the set)
-												Definitions and trie

											
										
										
											2017-01-22 12:15:41 +01:00
+								and $A$ is the size of the alphabet.
 								The nodes of a trie are numbered
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								$0,1,2,\ldots$ so that the number of the root is 0,
 								and $\texttt{trie}[s][c]$ is the next node in the chain
 								when we move from node $s$ using character $c$.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\section{String hashing}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\index{hashing}
 								\index{string hashing}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\key{String hashing} is a technique that
 								allows us to efficiently check whether two
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								strings are equal\footnote{The technique
-												References etc.

											
										
										
											2017-02-21 17:10:08 +01:00
+								was popularized by the Karp–Rabin pattern matching
 								algorithm \cite{kar87}.}.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								The idea in string hashing is to compare hash values of
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								strings instead of their individual characters.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\subsubsection*{Calculating hash values}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\index{hash value}
 								\index{polynomial hashing}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								A \key{hash value} of a string is
 								a number that is calculated from the characters
 								of the string.
 								If two strings are the same,
 								their hash values are also the same,
 								which makes it possible to compare strings
 								based on their hash values.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								A usual way to implement string hashing
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								is \key{polynomial hashing}, which means
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								that the hash value of a string \texttt{s}
 								of length $n$ is
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B  ,\]
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								where $s[0],s[1],\ldots,s[n-1]$
 								are interpreted as the codes of the characters of \texttt{s},
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								and $A$ and $B$ are pre-chosen constants.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								For example, the codes of the characters
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								of \texttt{ALLEY} are:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\draw (0,0) grid (5,2);
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\node at (0.5, 1.5) {\texttt{A}};
 								\node at (1.5, 1.5) {\texttt{L}};
 								\node at (2.5, 1.5) {\texttt{L}};
 								\node at (3.5, 1.5) {\texttt{E}};
 								\node at (4.5, 1.5) {\texttt{Y}};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\node at (0.5, 0.5) {65};
 								\node at (1.5, 0.5) {76};
 								\node at (2.5, 0.5) {76};
 								\node at (3.5, 0.5) {69};
 								\node at (4.5, 0.5) {89};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\end{tikzpicture}
 								\end{center}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								Thus, if $A=3$ and $B=97$, the hash value
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								of \texttt{ALLEY} is
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\subsubsection*{Preprocessing}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								Using polynomial hashing, we can calculate the hash value of any substring
 								of a string \texttt{s} in $O(1)$ time after an $O(n)$ time preprocessing.
 								The idea is to construct an array \texttt{h} such that
 								$\texttt{h}[k]$ contains the hash value of the prefix $\texttt{s}[0 \ldots k]$.
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								The array values can be recursively calculated as follows:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\[
 								\begin{array}{lcl}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								\texttt{h}[0] & = & \texttt{s}[0] \\
 								\texttt{h}[k] & = & (\texttt{h}[k-1] A + \texttt{s}[k]) \bmod B \\
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\end{array}
 								\]
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								In addition, we construct an array $\texttt{p}$
 								where $\texttt{p}[k]=A^k \bmod B$:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\[
 								\begin{array}{lcl}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								\texttt{p}[0] & = & 1 \\
 								\texttt{p}[k] & = & (\texttt{p}[k-1] A) \bmod B. \\
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\end{array}
 								\]
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								Constructing these arrays takes $O(n)$ time.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								After this, the hash value of any substring
 								$\texttt{s}[a \ldots b]$
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								can be calculated in $O(1)$ time using the formula
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								\[(\texttt{h}[b]-\texttt{h}[a-1] \texttt{p}[b-a+1]) \bmod B\]
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								assuming that $a>0$.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								If $a=0$, the hash value is simply $\texttt{h}[b]$.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								\subsubsection*{Using hash values}
 								We can efficiently compare strings using hash values.
-												Corrections

											
										
										
											2017-02-18 17:27:38 +01:00
+								Instead of comparing the individual characters of the strings,
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								the idea is to compare their hash values.
 								If the hash values are equal,
 								the strings are \emph{probably} equal,
 								and if the hash values are different,
 								the strings are \emph{certainly} different.
 								Using hashing, we can often make a brute force
 								algorithm efficient.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								As an example, consider the pattern matching problem:
 								given a string $s$ and a pattern $p$,
 								find the positions where $p$ occurs in $s$.
 								A brute force algorithm goes through all positions
-												Corrections

											
										
										
											2017-02-18 17:27:38 +01:00
+								where $p$ may occur and compares the strings
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								character by character.
 								The time complexity of such an algorithm is $O(n^2)$.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								We can make the brute force algorithm more efficient
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								by using hashing, because the algorithm compares
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								substrings of strings.
 								Using hashing, each comparison only takes $O(1)$ time,
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								because only hash values of substrings are compared.
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								This results in an algorithm with time complexity $O(n)$,
 								which is the best possible time complexity for this problem.
 								By combining hashing and \emph{binary search},
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								it is also possible to find out the lexicographic order of
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								two strings in logarithmic time.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								This can be done by calculating the length
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								of the common prefix of the strings using binary search.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								Once we know the length of the common prefix,
 								we can just check the next character after the prefix,
 								because this determines the order of the strings.
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
 								\subsubsection*{Collisions and parameters}
 								\index{collision}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								An evident risk when comparing hash values is
 								a \key{collision}, which means that two strings have
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								different contents but equal hash values.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								In this case, an algorithm that relies on
 								the hash values concludes that the strings are equal,
 								but in reality they are not,
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								and the algorithm may give incorrect results.
 								Collisions are always possible,
 								because the number of different strings is larger
 								than the number of different hash values.
 								However, the probability of a collision is small
 								if the constants $A$ and $B$ are carefully chosen.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								A usual way is to choose random constants
 								near $10^9$, for example as follows:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\[
 								\begin{array}{lcl}
 								A & = & 911382323 \\
 								B & = & 972663749 \\
 								\end{array}
 								\]
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
 								Using such constants,
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								the \texttt{long long} type can be used
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								when calculating hash values,
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								because the products $AB$ and $BB$ will fit in \texttt{long long}.
 								But is it enough to have about $10^9$ different hash values?
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								Let us consider three scenarios where hashing can be used:
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
 								\textit{Scenario 1:} Strings $x$ and $y$ are compared with
 								each other.
 								The probability of a collision is $1/B$ assuming that
 								all hash values are equally probable.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								\textit{Scenario 2:} A string $x$ is compared with strings
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								$y_1,y_2,\ldots,y_n$.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								The probability of one or more collisions is
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								\[1-(1-\frac{1}{B})^n.\]
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								\textit{Scenario 3:} All pairs of strings $x_1,x_2,\ldots,x_n$
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								are compared with each other.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								The probability of one or more collisions is
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								The following table shows the collision probabilities
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								when $n=10^6$ and the value of $B$ varies:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tabular}{rrrr}
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								constant $B$ & scenario 1 & scenario 2 & scenario 3 \\
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\hline
 								$10^3$ & $0.001000$ & $1.000000$ & $1.000000$ \\
 								$10^6$ & $0.000001$ & $0.632121$ & $1.000000$ \\
 								$10^9$ & $0.000000$ & $0.001000$ & $1.000000$ \\
 								$10^{12}$ & $0.000000$ & $0.000000$ & $0.393469$ \\
 								$10^{15}$ & $0.000000$ & $0.000000$ & $0.000500$ \\
 								$10^{18}$ & $0.000000$ & $0.000000$ & $0.000001$ \\
 								\end{tabular}
 								\end{center}
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								The table shows that in scenario 1,
 								the probability of a collision is negligible
 								when $B \approx 10^9$.
 								In scenario 2, a collision is possible but the
 								probability is still quite small.
 								However, in scenario 3 the situation is very different:
 								a collision will almost always happen when
 								$B \approx 10^9$.
 								\index{birthday paradox}
 								The phenomenon in scenario 3 is known as the
 								\key{birthday paradox}: if there are $n$ people
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								in a room, the probability that \emph{some} two people
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								have the same birthday is large even if $n$ is quite small.
 								In hashing, correspondingly, when all hash values are compared
 								with each other, the probability that some two
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								hash values are equal is large.
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								We can make the probability of a collision
 								smaller by calculating \emph{multiple} hash values
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								using different parameters.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								It is unlikely that a collision would occur
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								in all hash values at the same time.
 								For example, two hash values with parameter
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								$B \approx 10^9$ correspond to one hash
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								value with parameter $B \approx 10^{18}$,
 								which makes the probability of a collision very small.
 								Some people use constants $B=2^{32}$ and $B=2^{64}$,
 								which is convenient, because operations with 32 and 64
 								bit integers are calculated modulo $2^{32}$ and $2^{64}$.
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								However, this is \emph{not} a good choice, because it is possible
-												String hashing

											
										
										
											2017-01-24 20:59:20 +01:00
+								to construct inputs that always generate collisions when
-												References

											
										
										
											2017-02-21 00:17:36 +01:00
+								constants of the form $2^x$ are used \cite{pac13}.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								\section{Z-algorithm}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								\index{Z-algorithm}
 								\index{Z-array}
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								The \key{Z-array} \texttt{z} of a string \texttt{s}
 								of length $n$ contains for each $k=0,1,\ldots,n-1$
 								the length of the longest substring of \texttt{s}
 								that begins at position $k$ and is a prefix of \texttt{s}.
 								Thus, $\texttt{z}[k]=p$ tells us that
 								$\texttt{s}[0 \ldots p-1]$ equals $\texttt{s}[k \ldots k+p-1]$.
 								Many string processing problems can be efficiently solved
 								using the Z-array.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								For example, the Z-array of
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								\texttt{ACBACDACBACBACDA} is as follows:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {\texttt{A}};
 								\node at (1.5, 1.5) {\texttt{C}};
 								\node at (2.5, 1.5) {\texttt{B}};
 								\node at (3.5, 1.5) {\texttt{A}};
 								\node at (4.5, 1.5) {\texttt{C}};
 								\node at (5.5, 1.5) {\texttt{D}};
 								\node at (6.5, 1.5) {\texttt{A}};
 								\node at (7.5, 1.5) {\texttt{C}};
 								\node at (8.5, 1.5) {\texttt{B}};
 								\node at (9.5, 1.5) {\texttt{A}};
 								\node at (10.5, 1.5) {\texttt{C}};
 								\node at (11.5, 1.5) {\texttt{B}};
 								\node at (12.5, 1.5) {\texttt{A}};
 								\node at (13.5, 1.5) {\texttt{C}};
 								\node at (14.5, 1.5) {\texttt{D}};
 								\node at (15.5, 1.5) {\texttt{A}};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {2};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {0};
 								\node at (6.5, 0.5) {5};
 								\node at (7.5, 0.5) {0};
 								\node at (8.5, 0.5) {0};
 								\node at (9.5, 0.5) {7};
 								\node at (10.5, 0.5) {0};
 								\node at (11.5, 0.5) {0};
 								\node at (12.5, 0.5) {2};
 								\node at (13.5, 0.5) {0};
 								\node at (14.5, 0.5) {0};
 								\node at (15.5, 0.5) {1};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\end{tikzpicture}
 								\end{center}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								In this case, for example, $\texttt{z}[6]=5$,
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								because the substring \texttt{ACBAC} of length 5
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								is a prefix of \texttt{s},
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								but the substring \texttt{ACBACB} of length 6
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								is not a prefix of \texttt{s}.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
 								\subsubsection*{Algorithm description}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								Next we describe an algorithm,
 								called the \key{Z-algorithm}\footnote{The Z-algorithm
 								was presented in \cite{gus97} as the simplest known
 								method for linear-time pattern matching, and the original idea
 								was attributed to \cite{mai84}.},
 								that efficiently constructs the Z-array in $O(n)$ time.
 								The algorithm calculates the Z-array values
 								from left to right by both using information
 								already stored in the Z-array and comparing substrings
 								character by character.
 								To efficiently calculate the Z-array values,
 								the algorithm maintains a range $[x,y]$ such that
 								$\texttt{s}[x \ldots y]$ is a prefix of \texttt{s}
 								and $y$ is as large as possible.
 								Since we know that $\texttt{s}[0 \ldots y-x]$
 								and $\texttt{s}[x \ldots y]$ are equal,
 								we can use this information when calculating
 								Z-values for positions $x+1,x+2,\ldots,y$.
 								At each position $k$, we first
 								check the value of $\texttt{z}[k-x]$.
 								If $k+\texttt{z}[k-x]<y$, we know that $\texttt{z}[k]=\texttt{z}[k-x]$.
 								However, if $k+\texttt{z}[k-x] \ge y$,
 								$\texttt{s}[0 \ldots y-k]$ equals
 								$\texttt{s}[k \ldots y]$, and to determine the
 								value of $\texttt{z}[k]$ we need to compare
 								the substrings character by character.
 								Still, the algorithm works in $O(n)$ time,
 								because we start comparing at positions
 								$y-k+1$ and $y+1$.
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								For example, let us construct the following Z-array:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {C};
 								\node at (2.5, 1.5) {B};
 								\node at (3.5, 1.5) {A};
 								\node at (4.5, 1.5) {C};
 								\node at (5.5, 1.5) {D};
 								\node at (6.5, 1.5) {A};
 								\node at (7.5, 1.5) {C};
 								\node at (8.5, 1.5) {B};
 								\node at (9.5, 1.5) {A};
 								\node at (10.5, 1.5) {C};
 								\node at (11.5, 1.5) {B};
 								\node at (12.5, 1.5) {A};
 								\node at (13.5, 1.5) {C};
 								\node at (14.5, 1.5) {D};
 								\node at (15.5, 1.5) {A};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {?};
 								\node at (2.5, 0.5) {?};
 								\node at (3.5, 0.5) {?};
 								\node at (4.5, 0.5) {?};
 								\node at (5.5, 0.5) {?};
 								\node at (6.5, 0.5) {?};
 								\node at (7.5, 0.5) {?};
 								\node at (8.5, 0.5) {?};
 								\node at (9.5, 0.5) {?};
 								\node at (10.5, 0.5) {?};
 								\node at (11.5, 0.5) {?};
 								\node at (12.5, 0.5) {?};
 								\node at (13.5, 0.5) {?};
 								\node at (14.5, 0.5) {?};
 								\node at (15.5, 0.5) {?};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\end{tikzpicture}
 								\end{center}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								After calculating the value $\texttt{z}[6]=5$,
 								the current $[x,y]$ range is $[6,10]$:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\fill[color=lightgray] (6,0) rectangle (7,1);
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {C};
 								\node at (2.5, 1.5) {B};
 								\node at (3.5, 1.5) {A};
 								\node at (4.5, 1.5) {C};
 								\node at (5.5, 1.5) {D};
 								\node at (6.5, 1.5) {A};
 								\node at (7.5, 1.5) {C};
 								\node at (8.5, 1.5) {B};
 								\node at (9.5, 1.5) {A};
 								\node at (10.5, 1.5) {C};
 								\node at (11.5, 1.5) {B};
 								\node at (12.5, 1.5) {A};
 								\node at (13.5, 1.5) {C};
 								\node at (14.5, 1.5) {D};
 								\node at (15.5, 1.5) {A};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {2};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {0};
 								\node at (6.5, 0.5) {5};
 								\node at (7.5, 0.5) {?};
 								\node at (8.5, 0.5) {?};
 								\node at (9.5, 0.5) {?};
 								\node at (10.5, 0.5) {?};
 								\node at (11.5, 0.5) {?};
 								\node at (12.5, 0.5) {?};
 								\node at (13.5, 0.5) {?};
 								\node at (14.5, 0.5) {?};
 								\node at (15.5, 0.5) {?};
 								\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
 								\node at (6.5,3.50) {$x$};
 								\node at (10.5,3.50) {$y$};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\end{tikzpicture}
 								\end{center}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								Now we can calculate
 								subsequent Z-array values
 								efficiently,
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								because we know that
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								$\texttt{s}[0 \ldots 4]$ and
 								$\texttt{s}[6 \ldots 10]$ are equal.
 								First, since $\texttt{z}[1] = \texttt{z}[2] = 0$,
 								we immediately know that also
 								$\texttt{z}[7] = \texttt{z}[8] = 0$:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\fill[color=lightgray] (7,0) rectangle (9,1);
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {C};
 								\node at (2.5, 1.5) {B};
 								\node at (3.5, 1.5) {A};
 								\node at (4.5, 1.5) {C};
 								\node at (5.5, 1.5) {D};
 								\node at (6.5, 1.5) {A};
 								\node at (7.5, 1.5) {C};
 								\node at (8.5, 1.5) {B};
 								\node at (9.5, 1.5) {A};
 								\node at (10.5, 1.5) {C};
 								\node at (11.5, 1.5) {B};
 								\node at (12.5, 1.5) {A};
 								\node at (13.5, 1.5) {C};
 								\node at (14.5, 1.5) {D};
 								\node at (15.5, 1.5) {A};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {2};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {0};
 								\node at (6.5, 0.5) {5};
 								\node at (7.5, 0.5) {0};
 								\node at (8.5, 0.5) {0};
 								\node at (9.5, 0.5) {?};
 								\node at (10.5, 0.5) {?};
 								\node at (11.5, 0.5) {?};
 								\node at (12.5, 0.5) {?};
 								\node at (13.5, 0.5) {?};
 								\node at (14.5, 0.5) {?};
 								\node at (15.5, 0.5) {?};
 								\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
 								\node at (6.5,3.50) {$x$};
 								\node at (10.5,3.50) {$y$};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\draw[thick,<->] (7.5,-0.25) .. controls (7,-1.25) and (2,-1.25) .. (1.5,-0.25);
 								\draw[thick,<->] (8.5,-0.25) .. controls (8,-1.25) and (3,-1.25) .. (2.5,-0.25);
 								\end{tikzpicture}
 								\end{center}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								Then, since $\texttt{z}[3]=2$, we know that $\texttt{z}[9] \ge 2$:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\fill[color=lightgray] (9,0) rectangle (10,1);
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {C};
 								\node at (2.5, 1.5) {B};
 								\node at (3.5, 1.5) {A};
 								\node at (4.5, 1.5) {C};
 								\node at (5.5, 1.5) {D};
 								\node at (6.5, 1.5) {A};
 								\node at (7.5, 1.5) {C};
 								\node at (8.5, 1.5) {B};
 								\node at (9.5, 1.5) {A};
 								\node at (10.5, 1.5) {C};
 								\node at (11.5, 1.5) {B};
 								\node at (12.5, 1.5) {A};
 								\node at (13.5, 1.5) {C};
 								\node at (14.5, 1.5) {D};
 								\node at (15.5, 1.5) {A};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {2};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {0};
 								\node at (6.5, 0.5) {5};
 								\node at (7.5, 0.5) {0};
 								\node at (8.5, 0.5) {0};
 								\node at (9.5, 0.5) {?};
 								\node at (10.5, 0.5) {?};
 								\node at (11.5, 0.5) {?};
 								\node at (12.5, 0.5) {?};
 								\node at (13.5, 0.5) {?};
 								\node at (14.5, 0.5) {?};
 								\node at (15.5, 0.5) {?};
 								\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
 								\node at (6.5,3.50) {$x$};
 								\node at (10.5,3.50) {$y$};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\draw[thick,<->] (9.5,-0.25) .. controls (9,-1.25) and (4,-1.25) .. (3.5,-0.25);
 								\end{tikzpicture}
 								\end{center}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								However, we have no information about the string
 								after position 10, so we need to compare the substrings
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								character by character:
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\fill[color=lightgray] (9,0) rectangle (10,1);
 								\fill[color=lightgray] (2,1) rectangle (7,2);
 								\fill[color=lightgray] (11,1) rectangle (16,2);
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {C};
 								\node at (2.5, 1.5) {B};
 								\node at (3.5, 1.5) {A};
 								\node at (4.5, 1.5) {C};
 								\node at (5.5, 1.5) {D};
 								\node at (6.5, 1.5) {A};
 								\node at (7.5, 1.5) {C};
 								\node at (8.5, 1.5) {B};
 								\node at (9.5, 1.5) {A};
 								\node at (10.5, 1.5) {C};
 								\node at (11.5, 1.5) {B};
 								\node at (12.5, 1.5) {A};
 								\node at (13.5, 1.5) {C};
 								\node at (14.5, 1.5) {D};
 								\node at (15.5, 1.5) {A};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {2};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {0};
 								\node at (6.5, 0.5) {5};
 								\node at (7.5, 0.5) {0};
 								\node at (8.5, 0.5) {0};
 								\node at (9.5, 0.5) {?};
 								\node at (10.5, 0.5) {?};
 								\node at (11.5, 0.5) {?};
 								\node at (12.5, 0.5) {?};
 								\node at (13.5, 0.5) {?};
 								\node at (14.5, 0.5) {?};
 								\node at (15.5, 0.5) {?};
 								\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
 								\node at (6.5,3.50) {$x$};
 								\node at (10.5,3.50) {$y$};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
 								%\draw[thick,<->] (11.5,-0.25) .. controls (11,-1.25) and (3,-1.25) .. (2.5,-0.25);
 								\end{tikzpicture}
 								\end{center}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								It turns out that $\texttt{z}[9]=7$,
 								so the new $[x,y]$ range is $[9,15]$:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\fill[color=lightgray] (9,0) rectangle (10,1);
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {C};
 								\node at (2.5, 1.5) {B};
 								\node at (3.5, 1.5) {A};
 								\node at (4.5, 1.5) {C};
 								\node at (5.5, 1.5) {D};
 								\node at (6.5, 1.5) {A};
 								\node at (7.5, 1.5) {C};
 								\node at (8.5, 1.5) {B};
 								\node at (9.5, 1.5) {A};
 								\node at (10.5, 1.5) {C};
 								\node at (11.5, 1.5) {B};
 								\node at (12.5, 1.5) {A};
 								\node at (13.5, 1.5) {C};
 								\node at (14.5, 1.5) {D};
 								\node at (15.5, 1.5) {A};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {2};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {0};
 								\node at (6.5, 0.5) {5};
 								\node at (7.5, 0.5) {0};
 								\node at (8.5, 0.5) {0};
 								\node at (9.5, 0.5) {7};
 								\node at (10.5, 0.5) {?};
 								\node at (11.5, 0.5) {?};
 								\node at (12.5, 0.5) {?};
 								\node at (13.5, 0.5) {?};
 								\node at (14.5, 0.5) {?};
 								\node at (15.5, 0.5) {?};
 								\draw [decoration={brace}, decorate, line width=0.5mm] (9,3.00) -- (16,3.00);
 								\node at (9.5,3.50) {$x$};
 								\node at (15.5,3.50) {$y$};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								% \draw[thick,<->] (9.5,-0.25) .. controls (9,-1.25) and (4,-1.25) .. (3.5,-0.25);
 								\end{tikzpicture}
 								\end{center}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								After this, all the remaining Z-array values
 								can be determined by using the information
 								already stored in the Z-array:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\draw (0,0) grid (16,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {C};
 								\node at (2.5, 1.5) {B};
 								\node at (3.5, 1.5) {A};
 								\node at (4.5, 1.5) {C};
 								\node at (5.5, 1.5) {D};
 								\node at (6.5, 1.5) {A};
 								\node at (7.5, 1.5) {C};
 								\node at (8.5, 1.5) {B};
 								\node at (9.5, 1.5) {A};
 								\node at (10.5, 1.5) {C};
 								\node at (11.5, 1.5) {B};
 								\node at (12.5, 1.5) {A};
 								\node at (13.5, 1.5) {C};
 								\node at (14.5, 1.5) {D};
 								\node at (15.5, 1.5) {A};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {2};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {0};
 								\node at (6.5, 0.5) {5};
 								\node at (7.5, 0.5) {0};
 								\node at (8.5, 0.5) {0};
 								\node at (9.5, 0.5) {7};
 								\node at (10.5, 0.5) {0};
 								\node at (11.5, 0.5) {0};
 								\node at (12.5, 0.5) {2};
 								\node at (13.5, 0.5) {0};
 								\node at (14.5, 0.5) {0};
 								\node at (15.5, 0.5) {1};
 								\draw [decoration={brace}, decorate, line width=0.5mm] (9,3.00) -- (16,3.00);
 								\node at (9.5,3.50) {$x$};
 								\node at (15.5,3.50) {$y$};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
 								\node at (14.5, 2.5) {14};
 								\node at (15.5, 2.5) {15};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\end{tikzpicture}
 								\end{center}
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								\subsubsection{Using the Z-array}
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								It is often a matter of taste whether to use
 								string hashing or the Z-algorithm.
 								Unlike hashing, the Z-algorithm always works
 								and there is no risk for collisions.
 								On the other hand, the Z-algorithm is more difficult
 								to implement and some problems can only be solved
 								using hashing.
 								As an example, consider again
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								the pattern matching problem,
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								where our task is to find the occurrences
 								of a pattern $p$ in a string $s$.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								We already solved this problem efficiently
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								using string hashing, but the Z-algorithm
 								provides another way to solve the problem.
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								A usual idea in string processing is to
 								construct a string that consists of
 								multiple strings separated by special characters.
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								In this problem, we can construct a string
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								$p$\texttt{\#}$s$,
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								where $p$ and $s$ are separated by a special
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								character \texttt{\#} that does not occur
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								in the strings.
-												Corrections

											
										
										
											2017-02-18 17:27:38 +01:00
+								The Z-array of $p$\texttt{\#}$s$ tells us the positions
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								where $p$ occurs in $s$,
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								because such positions contain the length of $p$.
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT},
 								the Z-array is as follows:
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
 								\begin{center}
 								\begin{tikzpicture}[scale=0.7]
 								\draw (0,0) grid (14,2);
 								\node at (0.5, 1.5) {A};
 								\node at (1.5, 1.5) {T};
 								\node at (2.5, 1.5) {T};
 								\node at (3.5, 1.5) {\#};
 								\node at (4.5, 1.5) {H};
 								\node at (5.5, 1.5) {A};
 								\node at (6.5, 1.5) {T};
 								\node at (7.5, 1.5) {T};
 								\node at (8.5, 1.5) {I};
 								\node at (9.5, 1.5) {V};
 								\node at (10.5, 1.5) {A};
 								\node at (11.5, 1.5) {T};
 								\node at (12.5, 1.5) {T};
 								\node at (13.5, 1.5) {I};
 								\node at (0.5, 0.5) {--};
 								\node at (1.5, 0.5) {0};
 								\node at (2.5, 0.5) {0};
 								\node at (3.5, 0.5) {0};
 								\node at (4.5, 0.5) {0};
 								\node at (5.5, 0.5) {3};
 								\node at (6.5, 0.5) {0};
 								\node at (7.5, 0.5) {0};
 								\node at (8.5, 0.5) {0};
 								\node at (9.5, 0.5) {0};
 								\node at (10.5, 0.5) {3};
 								\node at (11.5, 0.5) {0};
 								\node at (12.5, 0.5) {0};
 								\node at (13.5, 0.5) {0};
 								\footnotesize
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								\node at (0.5, 2.5) {0};
 								\node at (1.5, 2.5) {1};
 								\node at (2.5, 2.5) {2};
 								\node at (3.5, 2.5) {3};
 								\node at (4.5, 2.5) {4};
 								\node at (5.5, 2.5) {5};
 								\node at (6.5, 2.5) {6};
 								\node at (7.5, 2.5) {7};
 								\node at (8.5, 2.5) {8};
 								\node at (9.5, 2.5) {9};
 								\node at (10.5, 2.5) {10};
 								\node at (11.5, 2.5) {11};
 								\node at (12.5, 2.5) {12};
 								\node at (13.5, 2.5) {13};
-												First commit

											
										
										
											2016-12-28 23:54:51 +01:00
+								\end{tikzpicture}
 								\end{center}
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
-												Use 0-indexing also in strings [closes #27]

											
										
										
											2017-04-17 17:20:20 +02:00
+								The positions 5 and 10 contain the value 3,
-												Corrections

											
										
										
											2017-02-11 18:11:50 +01:00
+								which means that the pattern \texttt{ATT}
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
+								occurs in the corresponding positions
-												Improve language

											
										
										
											2017-05-12 20:15:25 +02:00
+								of \texttt{HATTIVATTI}.
-												Chapter 26 first version

											
										
										
											2017-01-24 21:45:47 +01:00
 								The time complexity of the resulting algorithm
-												Small improvements

											
										
										
											2017-04-21 22:19:29 +02:00
+								is linear, because it suffices to construct
-												Add Z-algorithm implementation [closes #20]

											
										
										
											2017-04-17 17:26:00 +02:00
+								the Z-array and go through its values.
 								\subsubsection{Implementation}
 								Here is a short implementation of the Z-algorithm
 								that returns a vector that corresponds to the Z-array.
 								\begin{lstlisting}
 								vector<int> z(string s) {
 								    int n = s.size();
 								    vector<int> z(n);
 								    int x = 0, y = 0;
 								    for (int i = 1; i < n; i++) {
 								        z[i] = max(0,min(z[i-x],y-i+1));
 								        while (i+z[i] < n && s[z[i]] == s[i+z[i]]) {
 								            x = i; y = i+z[i]; z[i]++;
 								        }
 								    }
 								    return z;
 								}
 								\end{lstlisting}