Corrections

This commit is contained in:
Antti H S Laaksonen 2017-02-11 19:11:50 +02:00
parent f728fdc84f
commit d858bb3c42
1 changed files with 285 additions and 209 deletions

View File

@ -1,11 +1,35 @@
\chapter{String algorithms} \chapter{String algorithms}
\index{string} This chapter deals with efficient algorithms
\index{alphabet} for processing strings.
Many string problems can be easily solved
in $O(n^2)$ time, but the challenge is to
find algorithms that work in $O(n)$ or $O(n \log n)$
time and can process long strings.
A string $s$ of length $n$ \index{pattern matching}
is a sequence of characters
$s[1],s[2],\ldots,s[n]$. For example, a fundamental problem related to strings
is the \key{pattern matching} problem:
given a string of length $n$ and a pattern of length $m$,
our task is to find the positions where the pattern
occurs in the string.
For example, the pattern \texttt{ABC} occurs two
times in the string \texttt{ABABCBABC}.
The pattern matching problem is easy to solve
in $O(nm)$ time by a brute force algorithm that
goes through all positions where the pattern may
occur in the string.
However, in this chapter, we will see, that there
are more efficient algorithms that require only
$O(n+m)$ time.
\index{string}
\section{Terminology}
\index{alphabet}
An \key{alphabet} is a set of characters An \key{alphabet} is a set of characters
that may appear in strings. that may appear in strings.
@ -15,76 +39,73 @@ consists of the capital letters of English.
\index{substring} \index{substring}
A \key{substring} consists of consecutive A \key{substring} is a sequence of consecutive
characters in a string. characters of a string.
The number of substrings in a string is $n(n+1)/2$. The number of substrings of a string is $n(n+1)/2$.
For example, \texttt{ORITH} is a substring For example, the substrings of the string
in \texttt{ALGORITHM}, and it corresponds \texttt{ABCD} are
to \texttt{ALG\underline{ORITH}M}. \texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
\texttt{AB}, \texttt{BC}, \texttt{CD},
\texttt{ABC}, \texttt{BCD} and \texttt{ABCD}.
\index{subsequence} \index{subsequence}
A \key{subsequence} is a subset of characters A \key{subsequence} is a sequence of
in a string in their original order. (not necessarily consecutive) characters
The number of subsequences in a string is $2^n-1$. of a string in their original order.
For example, \texttt{LGRHM} is a subsequece The number of subsequences of a string is $2^n-1$.
in \texttt{ALGORITHM}, and it corresponds For example, the subsequences of the string
to \texttt{A\underline{LG}O\underline{R}IT\underline{HM}}. \texttt{ABCD} are
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
\texttt{AB}, \texttt{AC}, \texttt{AD},
\texttt{BC}, \texttt{BD}, \texttt{CD},
\texttt{ABC}, \texttt{ABD}, \texttt{ACD},
\texttt{BCD} and \texttt{ABCD}.
\index{prefix} \index{prefix}
\index{suffix} \index{suffix}
A \key{prefix} is a subtring that contains the first A \key{prefix} is a subtring that starts at the beginning
character of a string, of a string,
and a \key{suffix} is a substring that contains the last character. and a \key{suffix} is a substring that ends at the end
For example, the prefixes of of a string.
\texttt{STORY} are \texttt{S}, \texttt{ST}, For example, for the string \texttt{ABCD},
\texttt{STO}, \texttt{STOR} and \texttt{STORY}, the prefixes are
and the suffixes are \texttt{Y}, \texttt{RY}, \texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD}
\texttt{ORY}, \texttt{TORY} and \texttt{STORY}. and the suffixes are
A prefix or a suffix is \key{proper} \texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.
if it is not the whole string.
\index{rotation} \index{rotation}
A \key{rotation} can be generated by moving A \key{rotation} can be generated by moving
characters one by one from the beginning to the end characters one by one from the beginning
in a string (or vice versa). to the end of a string (or vice versa).
For example, the rotations of \texttt{STORY} are For example, the rotations of the string
\texttt{STORY}, \texttt{ABCD} are
\texttt{TORYS}, \texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.
\texttt{ORYST},
\texttt{RYSTO} and
\texttt{YSTOR}.
\index{period} \index{period}
A \key{period} is a prefix of a string such that A \key{period} is a prefix of a string such that
we can construct the string by repeating the period. the string can be constructed by repeating the period.
The last repetition may be partial and contain The last repetition may be partial and contain
only a prefix of the period. only a prefix of the period.
Often it is interesting to find the \key{shortest period}
of a string.
For example, the shortest period of For example, the shortest period of
\texttt{ABCABCA} is \texttt{ABC}. \texttt{ABCABCA} is \texttt{ABC}.
In this case, we first repeat the period twice
and then partially.
\index{border} \index{border}
A \key{border} is a string that is both A \key{border} is a string that is both
a prefix and a suffix of a string. a prefix and a suffix of a string.
For example, the borders for \texttt{ABADABA} For example, the borders of the string \texttt{ABACABA}
are \texttt{A}, \texttt{ABA} and \texttt{ABADABA}. are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.
Often we want to find the \key{longest border}
that is not the whole string.
\index{lexicographical order} \index{lexicographical order}
Usually we compare string using the \key{lexicographical order} Strings are usually compared using the \key{lexicographical order}
that corresponds to the alphabetical order. that corresponds to the alphabetical order.
It means that $x<y$ if either $x$ is a proper prefix of $y$, It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
or there is an index $k$ such that or there is a position $k$ such that
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$. $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
\section{Trie structure} \section{Trie structure}
@ -93,15 +114,13 @@ $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
A \key{trie} is a tree structure that A \key{trie} is a tree structure that
maintains a set of strings. maintains a set of strings.
Strings are stored in a trie as chains Each string in a trie corresponds to
of characters that start at the root a chain of characters starting at
of the tree. the root node.
If two strings have a common prefix, If two strings have a common prefix,
they also share a chain in the tree. they also have a common chain in the tree.
For example, the following trie corresponds For example, consider the following trie:
to the set
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.9] \begin{tikzpicture}[scale=0.9]
@ -133,36 +152,40 @@ $\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$:
\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13); \path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
This trie corresponds to the set
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
The character * in a node means that The character * in a node means that
a string ends at the node. one of the string in the set ends at the node.
This character is needed because a string This character is needed, because a string
may be a prefix of another string. may be a prefix of another string.
For example, in this trie, \texttt{THE} For example, in this trie, \texttt{THE}
is a suffix of \texttt{THERE}. is a prefix of \texttt{THERE}.
Inserting and searching a string in a trie take $O(n)$ time We can check if a trie contains a string
where $n$ is the length of the string. in $O(n)$ time where $n$ is the length of the string,
Both operations can be implemented by because we can follow the chain that starts at the root node.
starting at the root node and following the We can also add a new string to the trie
chain of characters that appear in the string. in $O(n)$ time using a similar idea.
If needed, new nodes will be added to the trie. If needed, new nodes will be added to the trie.
Tries can be used for searching both strings Using a trie, we can also find the longest prefix
and prefixes of strings. of a string that belongs to the set.
In addition, it is possible to calculate numbers In addition, by storing additional information
of strings that correspond to each prefix, in each node,
which can be useful in some applications. it is possible to calculate the number of
strings that have a given prefix.
A trie can be stored as an array A trie can be stored in an array
\begin{lstlisting} \begin{lstlisting}
int t[N][A]; int t[N][A];
\end{lstlisting} \end{lstlisting}
where $N$ is the maximum number of nodes where $N$ is the maximum number of nodes
(the total length of the string to be stored) (the maximum total length of the strings in the set)
and $A$ is the size of the alphabet. and $A$ is the size of the alphabet.
The nodes of a trie are numbered The nodes of a trie are numbered
$1,2,3,\ldots$ so that the number of the root is 1, $1,2,3,\ldots$ so that the number of the root is 1,
and $\texttt{t}[s][c]$ is the next node in chain and $\texttt{t}[s][c]$ is the next node in the chain
from node $s$ using character $c$. from node $s$ using character $c$.
\section{String hashing} \section{String hashing}
@ -173,7 +196,7 @@ from node $s$ using character $c$.
\key{String hashing} is a technique that \key{String hashing} is a technique that
allows us to efficiently check whether two allows us to efficiently check whether two
substrings in a string are equal. substrings in a string are equal.
The idea is to compare hash values of the The idea is to compare the hash values of the
substrings instead of their individual characters. substrings instead of their individual characters.
\subsubsection*{Calculating hash values} \subsubsection*{Calculating hash values}
@ -190,7 +213,7 @@ which makes it possible to compare strings
based on their hash values. based on their hash values.
A usual way to implement string hashing A usual way to implement string hashing
is to use polynomial hashing, which means is polynomial hashing, which means
that the hash value is calculated using the formula that the hash value is calculated using the formula
\[(c[1] A^{n-1} + c[2] A^{n-2} + \cdots + c[n] A^0) \bmod B ,\] \[(c[1] A^{n-1} + c[2] A^{n-2} + \cdots + c[n] A^0) \bmod B ,\]
where $c[1],c[2],\ldots,c[n]$ where $c[1],c[2],\ldots,c[n]$
@ -218,7 +241,7 @@ in the string \texttt{ALLEY} are:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
If $A=3$ and $B=97$, the hash value Thus, if $A=3$ and $B=97$, the hash value
for the string \texttt{ALLEY} is for the string \texttt{ALLEY} is
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\] \[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
@ -232,8 +255,8 @@ we can calculate the hash value of any substring
in $O(1)$ time after an $O(n)$ time preprocessing. in $O(1)$ time after an $O(n)$ time preprocessing.
The idea is to construct an array $h$ such that The idea is to construct an array $h$ such that
$h[k]$ contains the hash value for the prefix $h[k]$ contains the hash value of the prefix
of the string that ends at index $k$. of the string that ends at position $k$.
The array values can be recursively calculated as follows: The array values can be recursively calculated as follows:
\[ \[
\begin{array}{lcl} \begin{array}{lcl}
@ -250,9 +273,8 @@ p[k] & = & (p[k-1] A) \bmod B. \\
\end{array} \end{array}
\] \]
Constructing these arrays takes $O(n)$ time. Constructing these arrays takes $O(n)$ time.
After this, the hash value for a substring After this, the hash value of a substring
of the string that begins at position $a$ and ends at position $b$
that begins at index $a$ and ends at index $b$
can be calculated in $O(1)$ time using the formula can be calculated in $O(1)$ time using the formula
\[(h[b]-h[a-1] p[b-a+1]) \bmod B.\] \[(h[b]-h[a-1] p[b-a+1]) \bmod B.\]
@ -268,16 +290,15 @@ the strings are \emph{certainly} different.
Using hashing, we can often make a brute force Using hashing, we can often make a brute force
algorithm efficient. algorithm efficient.
As an example, let's consider a brute force As an example, consider the pattern matching problem:
algorithm that calculates how many times given a string $s$ and a pattern $p$,
a string $p$ occurs as a substring in find the positions where $p$ occurs in $s$.
a string $s$. A brute force algorithm goes through all positions
The algorithm goes through all locations where $p$ may occur, and compares the strings
where $p$ can occur, and compares the strings
character by character. character by character.
The time complexity of such an algorithm is $O(n^2)$. The time complexity of such an algorithm is $O(n^2)$.
However, we can make the algorithm more efficient We can make the brute force algorithm more efficient
using hashing, because the algorithm compares using hashing, because the algorithm compares
substrings of strings. substrings of strings.
Using hashing, each comparison only takes $O(1)$ time, Using hashing, each comparison only takes $O(1)$ time,
@ -286,23 +307,24 @@ This results in an algorithm with time complexity $O(n)$,
which is the best possible time complexity for this problem. which is the best possible time complexity for this problem.
By combining hashing and \emph{binary search}, By combining hashing and \emph{binary search},
it is also possible to check the lexicographic order of it is also possible to find out the lexicographic order of
two strings in logarithmic time. two strings in logarithmic time.
This can be done by finding out the length This can be done by calculating the length
of the common prefix of the strings using binary search. of the common prefix of the strings using binary search.
Once we know the common prefix, Once we know the length of the common prefix,
the next character after the prefix we can just check the next character after the prefix,
indicates the order of the strings. because this determines the order of the strings.
\subsubsection*{Collisions and parameters} \subsubsection*{Collisions and parameters}
\index{collision} \index{collision}
An evident risk in comparing hash values is An evident risk when comparing hash values is
\key{collision}, which means that two strings have a \key{collision}, which means that two strings have
different contents but equal hash values. different contents but equal hash values.
In this case, based on the hash values it seems that In this case, an algorithm that relies on
the strings are equal, but in reality they aren't, the hash values concludes that the strings are equal,
but in reality they are not,
and the algorithm may give incorrect results. and the algorithm may give incorrect results.
Collisions are always possible, Collisions are always possible,
@ -310,49 +332,41 @@ because the number of different strings is larger
than the number of different hash values. than the number of different hash values.
However, the probability of a collision is small However, the probability of a collision is small
if the constants $A$ and $B$ are carefully chosen. if the constants $A$ and $B$ are carefully chosen.
There are two goals: the hash values should be A usual way is to choose random constants
evenly distributed for the strings, near $10^9$, for example as follows:
and the number of different hash values should
be large enough.
A good solution is to use large random numbers
as constants.
A usual way is to choose constants that are
near $10^9$, for example
\[ \[
\begin{array}{lcl} \begin{array}{lcl}
A & = & 911382323 \\ A & = & 911382323 \\
B & = & 972663749 \\ B & = & 972663749 \\
\end{array} \end{array}
\] \]
This choice ensures that the hash values
are distributed evenly enough in the range $0 \ldots B-1$.
The benefit in $10^9$ is that
the \texttt{long long} type can be used
for calculating the hash values,
because the products $AB$ and $BB$ fit in \texttt{long long}.
But is it enough to have $10^9$ different hash values?
Let's consider three scenarios where hashing can be used: Using such constants,
the \texttt{long long} type can be used
when calculating the hash values,
because the products $AB$ and $BB$ will fit in \texttt{long long}.
But is it enough to have about $10^9$ different hash values?
Let us consider three scenarios where hashing can be used:
\textit{Scenario 1:} Strings $x$ and $y$ are compared with \textit{Scenario 1:} Strings $x$ and $y$ are compared with
each other. each other.
The probability of a collision is $1/B$ assuming that The probability of a collision is $1/B$ assuming that
all hash values are equally probable. all hash values are equally probable.
\textit{Tapaus 2:} A string $x$ is compared with strings \textit{Scenario 2:} A string $x$ is compared with strings
$y_1,y_2,\ldots,y_n$. $y_1,y_2,\ldots,y_n$.
The probability for one or more collisions is The probability of one or more collisions is
\[1-(1-1/B)^n.\] \[1-(1-\frac{1}{B})^n.\]
\textit{Tapaus 3:} Strings $x_1,x_2,\ldots,x_n$ \textit{Scenario 3:} Strings $x_1,x_2,\ldots,x_n$
are compared with each other. are compared with each other.
The probability for one or more collisions is The probability of one or more collisions is
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\] \[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
The following table shows the collision probabilities The following table shows the collision probabilities
when the value of $B$ varies and $n=10^6$: when $n=10^6$ and the value of $B$ varies:
\begin{center} \begin{center}
\begin{tabular}{rrrr} \begin{tabular}{rrrr}
@ -384,12 +398,12 @@ in a room, the probability that some two people
have the same birthday is large even if $n$ is quite small. have the same birthday is large even if $n$ is quite small.
In hashing, correspondingly, when all hash values are compared In hashing, correspondingly, when all hash values are compared
with each other, the probability that some two with each other, the probability that some two
hash values are the same is large. hash values are equal is large.
A good way to make the probability of a collision We can make the probability of a collision
smaller is to calculate \emph{multiple} hash values smaller by calculating \emph{multiple} hash values
using different parameters. using different parameters.
It is very unlikely that a collision would occur It is unlikely that a collision would occur
in all hash values at the same time. in all hash values at the same time.
For example, two hash values with parameter For example, two hash values with parameter
$B \approx 10^9$ correspond to one hash $B \approx 10^9$ correspond to one hash
@ -401,37 +415,25 @@ which is convenient, because operations with 32 and 64
bit integers are calculated modulo $2^{32}$ and $2^{64}$. bit integers are calculated modulo $2^{32}$ and $2^{64}$.
However, this is not a good choice, because it is possible However, this is not a good choice, because it is possible
to construct inputs that always generate collisions when to construct inputs that always generate collisions when
constants of the form $2^x$ are used\footnote{ constants of the form $2^x$ are used.
J. Pachocki and Jakub Radoszweski: % \footnote{
''Where to use and how not to use polynomial string hashing''. % J. Pachocki and Jakub Radoszweski:
\textit{Olympiads in Informatics}, 2013. % ''Where to use and how not to use polynomial string hashing''.
}. % \textit{Olympiads in Informatics}, 2013.
% }.
\section{Z-algorithm} \section{Z-algorithm}
\index{Z-algorithm} \index{Z-algorithm}
\index{Z-array} \index{Z-array}
The \key{Z-algorithm} generates a \key{Z-array} The \key{Z-array} of a string
for the string, that contains for each index $k$ contains for each position $k$ in the string
in the string the length of the longest substring the lengt of the longest substring
that begins at index $k$ and is a prefix of the string. that begins at position $k$ and is a prefix of the string.
Many string problems can be efficiently solved Such an array can be efficiently constructed
using the Z-algorithm. using the \key{Z-algorithm}.
It is often a matter of taste whether to use
the Z-algorithm or string hashing.
Unlike hashing, the Z-algorithm always works
and there is no risk for collisions.
On the other hand, the Z-algorithm is more difficult
to implement and some problems can only be solved
using hashing.
\subsubsection*{Description}
The Z-algorithm constructs a Z-array that
indicates for each position the length of the
longest substring that is also a prefix of the string.
For example, the Z-array for the string For example, the Z-array for the string
\texttt{ACBACDACBACBACDA} is as follows: \texttt{ACBACDACBACBACDA} is as follows:
@ -494,45 +496,50 @@ For example, the Z-array for the string
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
For example, the position 7 contains the value 5, For example, the value at position 7 in the
above Z-array is 5,
because the substring \texttt{ACBAC} of length 5 because the substring \texttt{ACBAC} of length 5
is a prefix of the string, is a prefix of the string,
but the substring \texttt{ACBACB} of length 6 but the substring \texttt{ACBACB} of length 6
is not a prefix of the string. is not a prefix of the string.
The Z-algorithm scans the string from the left It is often a matter of taste whether to use
to the right, and calculates for each position string hashing or the Z-algorithm.
Unlike hashing, the Z-algorithm always works
and there is no risk for collisions.
On the other hand, the Z-algorithm is more difficult
to implement and some problems can only be solved
using hashing.
\subsubsection*{Algorithm description}
The Z-algorithm scans the string from left
to right, and calculates for each position
the length of the longest substring that the length of the longest substring that
is a prefix of the string. is a prefix of the string.
The algorithm compares the first characters A straightforward algorithm
of the string would have a time complexity of $O(n^2)$,
and the active substring with each other to but the Z-algorithm has an important
find the length of the common prefix.
A straightforward implementation would yield
an algorithm with time complexity $O(n^2)$
because the common prefixes may be long.
However, the Z-algorithm has one important
optimization which ensures that the time complexity optimization which ensures that the time complexity
is only $O(n)$. is only $O(n)$.
The idea is to maintain a range $[x,y]$ such that The idea is to maintain a range $[x,y]$ such that
the substring from $x$ to $y$ is a prefix of the substring from $x$ to $y$ is a prefix of
the string and $y$ is as large as possible. the string and $y$ is as large as possible.
Since the Z-array already contains information Since the Z-array already contains information
about the characters in the range $[x,y]$, about the characters in the range $[x,y]$,
it is not needed to process them again later in the algorithm. we can use this information to calculate
values for elements in the range $[x,y]$.
The time complexity of the Z-algorithm is $O(n)$, The time complexity of the Z-algorithm is $O(n)$,
because the algorithm always compares substrings because the algorithm always compares strings
character by character only from index $y+1$. character by character starting at position $y+1$.
If the characters match, the value of $y$ increases, If the characters match, the value of $y$ increases,
and it is not needed to inspect the character again, and it is not needed to compare the character at
position $y$ again,
but the information in the Z-array can be used. but the information in the Z-array can be used.
\subsubsection*{Example} For example, let us construct the following Z-array:
Let's construct the following Z-array using
the Z-algorithm:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -595,7 +602,8 @@ the Z-algorithm:
The first interesting position is 7 where the The first interesting position is 7 where the
length of the common prefix is 5. length of the common prefix is 5.
The corresponding range in the string is $[7,11]$: After calculating this value,
the current $[x,y]$ range will be $[7,11]$:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -663,14 +671,17 @@ The corresponding range in the string is $[7,11]$:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
The benefit in the range $[7,11]$ is that the Now, it is possible to calculate the
algorithm can calculate the subsequent values subsequent values for the Z-array
for the Z-array more efficiently. more efficiently,
Since the ranges $[1,5]$ and $[7,11]$ contain because we know that
the same characters, also the Z-array will the ranges $[1,5]$ and $[7,11]$
contain similar values. contain the same characters.
First, the values at indices 8 and 9 First, since the values at
correspond to the values at indices 2 and 3: positions 2 and 3 are 0,
we immediately know that
the values at positions 8 and 9
are also 0:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -742,13 +753,9 @@ correspond to the values at indices 2 and 3:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
After this, the value for index 10 can be After this, we know that the value
calculated using the value at index 4. at position 10 will be at least 2,
The value at index 4 is 2, because the value at position 4 is 2:
so the first two characters
in the substring match the beginning of the string.
However, the characters after index $y=11$ have
not been inspected yet.
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -817,13 +824,85 @@ not been inspected yet.
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
The algorithm compares the substring Since we have no information about the characters
beginning at index $y+1=12$ character by character. after position 11, we have to begin to compare the strings
The previous values in the Z-array cannot be used, character by character:
because this is the first time the characters
after index 11 are inspected. \begin{center}
\begin{tikzpicture}[scale=0.7]
\fill[color=lightgray] (9,0) rectangle (10,1);
\fill[color=lightgray] (2,1) rectangle (7,2);
\fill[color=lightgray] (11,1) rectangle (16,2);
\draw (0,0) grid (16,2);
\node at (0.5, 1.5) {A};
\node at (1.5, 1.5) {C};
\node at (2.5, 1.5) {B};
\node at (3.5, 1.5) {A};
\node at (4.5, 1.5) {C};
\node at (5.5, 1.5) {D};
\node at (6.5, 1.5) {A};
\node at (7.5, 1.5) {C};
\node at (8.5, 1.5) {B};
\node at (9.5, 1.5) {A};
\node at (10.5, 1.5) {C};
\node at (11.5, 1.5) {B};
\node at (12.5, 1.5) {A};
\node at (13.5, 1.5) {C};
\node at (14.5, 1.5) {D};
\node at (15.5, 1.5) {A};
\node at (0.5, 0.5) {--};
\node at (1.5, 0.5) {0};
\node at (2.5, 0.5) {0};
\node at (3.5, 0.5) {2};
\node at (4.5, 0.5) {0};
\node at (5.5, 0.5) {0};
\node at (6.5, 0.5) {5};
\node at (7.5, 0.5) {0};
\node at (8.5, 0.5) {0};
\node at (9.5, 0.5) {?};
\node at (10.5, 0.5) {?};
\node at (11.5, 0.5) {?};
\node at (12.5, 0.5) {?};
\node at (13.5, 0.5) {?};
\node at (14.5, 0.5) {?};
\node at (15.5, 0.5) {?};
\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
\node at (6.5,3.50) {$x$};
\node at (10.5,3.50) {$y$};
\footnotesize
\node at (0.5, 2.5) {1};
\node at (1.5, 2.5) {2};
\node at (2.5, 2.5) {3};
\node at (3.5, 2.5) {4};
\node at (4.5, 2.5) {5};
\node at (5.5, 2.5) {6};
\node at (6.5, 2.5) {7};
\node at (7.5, 2.5) {8};
\node at (8.5, 2.5) {9};
\node at (9.5, 2.5) {10};
\node at (10.5, 2.5) {11};
\node at (11.5, 2.5) {12};
\node at (12.5, 2.5) {13};
\node at (13.5, 2.5) {14};
\node at (14.5, 2.5) {15};
\node at (15.5, 2.5) {16};
%\draw[thick,<->] (11.5,-0.25) .. controls (11,-1.25) and (3,-1.25) .. (2.5,-0.25);
\end{tikzpicture}
\end{center}
It turns out that the length of the common It turns out that the length of the common
prefix is 7, and the range $[x,y]$ will be updated: prefix at position 10 is 7,
and thus the new range $[x,y]$ is $[10,16]$:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -892,9 +971,9 @@ prefix is 7, and the range $[x,y]$ will be updated:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
After this, all subsequent values in the Z-array After this, all subsequent values for the Z-array
can be calculated using the information in can be calculated using the values already
the range $[x,y]$. All the remaining values can be calculated to the array. All the remaining values can be
directly retrieved from the beginning of the Z-array: directly retrieved from the beginning of the Z-array:
\begin{center} \begin{center}
@ -964,29 +1043,26 @@ directly retrieved from the beginning of the Z-array:
\subsubsection{Using the Z-array} \subsubsection{Using the Z-array}
As an example, let's solve a problem As an example, let us once again consider
where our task is to calculate the pattern matching problem,
the number of times a string $p$ where our task is to find the positions
occurs as a substring in a string $s$. where a pattern $p$ occurs in a string $s$.
Previously, we solved this problem We already solved this problem efficiently
using string hashing, but the Z-algorithm using string hashing, but the Z-algorithm
provides another way to solve the problem. provides another way to solve the problem.
A usual idea when using the Z-algorithm A usual idea in string processing is to
is to construct a string that consists of construct a string that consists of
several strings separated by special characters. multiple strings separated by special characters.
In this problem, we can construct a string In this problem, we can construct a string
$p$\texttt{\#}$s$, $p$\texttt{\#}$s$,
where $p$ and $s$ are separated by a special where $p$ and $s$ are separated by a special
character \texttt{\#} that doesn't occur character \texttt{\#} that does not occur
in the strings. in the strings.
After this, the Z-array for the string The Z-array of $p$\texttt{\#}$s$ indicates the positions
$p$\texttt{\#}$s$ indicates the positions where $p$ occurs in $s$,
where $p$ occurs in $s$. because such positions contain the value $p$.
Such positions are those positions in the Z-array
that contain the value $p$.
\begin{samepage}
For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT}, For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT},
the Z-array is as follows: the Z-array is as follows:
@ -1041,12 +1117,12 @@ the Z-array is as follows:
\node at (13.5, 2.5) {14}; \node at (13.5, 2.5) {14};
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
\end{samepage}
The positions 6 and 11 contain the value 3, The positions 6 and 11 contain the value 3,
which means that the substring \texttt{ATT} which means that the pattern \texttt{ATT}
occurs in the corresponding positions occurs in the corresponding positions
in the string \texttt{HATTIVATTI}. in the string \texttt{HATTIVATTI}.
The time complexity of the resulting algorithm The time complexity of the resulting algorithm
is $O(n)$, because it suffices to construct and is $O(n)$, because it suffices to construct
go through the Z-array. the Z-array and go through its values.