Corrections
This commit is contained in:
parent
f728fdc84f
commit
d858bb3c42
494
luku26.tex
494
luku26.tex
|
@ -1,11 +1,35 @@
|
||||||
\chapter{String algorithms}
|
\chapter{String algorithms}
|
||||||
|
|
||||||
\index{string}
|
This chapter deals with efficient algorithms
|
||||||
\index{alphabet}
|
for processing strings.
|
||||||
|
Many string problems can be easily solved
|
||||||
|
in $O(n^2)$ time, but the challenge is to
|
||||||
|
find algorithms that work in $O(n)$ or $O(n \log n)$
|
||||||
|
time and can process long strings.
|
||||||
|
|
||||||
A string $s$ of length $n$
|
\index{pattern matching}
|
||||||
is a sequence of characters
|
|
||||||
$s[1],s[2],\ldots,s[n]$.
|
For example, a fundamental problem related to strings
|
||||||
|
is the \key{pattern matching} problem:
|
||||||
|
given a string of length $n$ and a pattern of length $m$,
|
||||||
|
our task is to find the positions where the pattern
|
||||||
|
occurs in the string.
|
||||||
|
For example, the pattern \texttt{ABC} occurs two
|
||||||
|
times in the string \texttt{ABABCBABC}.
|
||||||
|
|
||||||
|
The pattern matching problem is easy to solve
|
||||||
|
in $O(nm)$ time by a brute force algorithm that
|
||||||
|
goes through all positions where the pattern may
|
||||||
|
occur in the string.
|
||||||
|
However, in this chapter, we will see, that there
|
||||||
|
are more efficient algorithms that require only
|
||||||
|
$O(n+m)$ time.
|
||||||
|
|
||||||
|
\index{string}
|
||||||
|
|
||||||
|
\section{Terminology}
|
||||||
|
|
||||||
|
\index{alphabet}
|
||||||
|
|
||||||
An \key{alphabet} is a set of characters
|
An \key{alphabet} is a set of characters
|
||||||
that may appear in strings.
|
that may appear in strings.
|
||||||
|
@ -15,76 +39,73 @@ consists of the capital letters of English.
|
||||||
|
|
||||||
\index{substring}
|
\index{substring}
|
||||||
|
|
||||||
A \key{substring} consists of consecutive
|
A \key{substring} is a sequence of consecutive
|
||||||
characters in a string.
|
characters of a string.
|
||||||
The number of substrings in a string is $n(n+1)/2$.
|
The number of substrings of a string is $n(n+1)/2$.
|
||||||
For example, \texttt{ORITH} is a substring
|
For example, the substrings of the string
|
||||||
in \texttt{ALGORITHM}, and it corresponds
|
\texttt{ABCD} are
|
||||||
to \texttt{ALG\underline{ORITH}M}.
|
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
||||||
|
\texttt{AB}, \texttt{BC}, \texttt{CD},
|
||||||
|
\texttt{ABC}, \texttt{BCD} and \texttt{ABCD}.
|
||||||
|
|
||||||
\index{subsequence}
|
\index{subsequence}
|
||||||
|
|
||||||
A \key{subsequence} is a subset of characters
|
A \key{subsequence} is a sequence of
|
||||||
in a string in their original order.
|
(not necessarily consecutive) characters
|
||||||
The number of subsequences in a string is $2^n-1$.
|
of a string in their original order.
|
||||||
For example, \texttt{LGRHM} is a subsequece
|
The number of subsequences of a string is $2^n-1$.
|
||||||
in \texttt{ALGORITHM}, and it corresponds
|
For example, the subsequences of the string
|
||||||
to \texttt{A\underline{LG}O\underline{R}IT\underline{HM}}.
|
\texttt{ABCD} are
|
||||||
|
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
||||||
|
\texttt{AB}, \texttt{AC}, \texttt{AD},
|
||||||
|
\texttt{BC}, \texttt{BD}, \texttt{CD},
|
||||||
|
\texttt{ABC}, \texttt{ABD}, \texttt{ACD},
|
||||||
|
\texttt{BCD} and \texttt{ABCD}.
|
||||||
|
|
||||||
\index{prefix}
|
\index{prefix}
|
||||||
\index{suffix}
|
\index{suffix}
|
||||||
|
|
||||||
A \key{prefix} is a subtring that contains the first
|
A \key{prefix} is a subtring that starts at the beginning
|
||||||
character of a string,
|
of a string,
|
||||||
and a \key{suffix} is a substring that contains the last character.
|
and a \key{suffix} is a substring that ends at the end
|
||||||
For example, the prefixes of
|
of a string.
|
||||||
\texttt{STORY} are \texttt{S}, \texttt{ST},
|
For example, for the string \texttt{ABCD},
|
||||||
\texttt{STO}, \texttt{STOR} and \texttt{STORY},
|
the prefixes are
|
||||||
and the suffixes are \texttt{Y}, \texttt{RY},
|
\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD}
|
||||||
\texttt{ORY}, \texttt{TORY} and \texttt{STORY}.
|
and the suffixes are
|
||||||
A prefix or a suffix is \key{proper}
|
\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.
|
||||||
if it is not the whole string.
|
|
||||||
|
|
||||||
\index{rotation}
|
\index{rotation}
|
||||||
|
|
||||||
A \key{rotation} can be generated by moving
|
A \key{rotation} can be generated by moving
|
||||||
characters one by one from the beginning to the end
|
characters one by one from the beginning
|
||||||
in a string (or vice versa).
|
to the end of a string (or vice versa).
|
||||||
For example, the rotations of \texttt{STORY} are
|
For example, the rotations of the string
|
||||||
\texttt{STORY},
|
\texttt{ABCD} are
|
||||||
\texttt{TORYS},
|
\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.
|
||||||
\texttt{ORYST},
|
|
||||||
\texttt{RYSTO} and
|
|
||||||
\texttt{YSTOR}.
|
|
||||||
|
|
||||||
\index{period}
|
\index{period}
|
||||||
|
|
||||||
A \key{period} is a prefix of a string such that
|
A \key{period} is a prefix of a string such that
|
||||||
we can construct the string by repeating the period.
|
the string can be constructed by repeating the period.
|
||||||
The last repetition may be partial and contain
|
The last repetition may be partial and contain
|
||||||
only a prefix of the period.
|
only a prefix of the period.
|
||||||
Often it is interesting to find the \key{shortest period}
|
|
||||||
of a string.
|
|
||||||
For example, the shortest period of
|
For example, the shortest period of
|
||||||
\texttt{ABCABCA} is \texttt{ABC}.
|
\texttt{ABCABCA} is \texttt{ABC}.
|
||||||
In this case, we first repeat the period twice
|
|
||||||
and then partially.
|
|
||||||
|
|
||||||
\index{border}
|
\index{border}
|
||||||
|
|
||||||
A \key{border} is a string that is both
|
A \key{border} is a string that is both
|
||||||
a prefix and a suffix of a string.
|
a prefix and a suffix of a string.
|
||||||
For example, the borders for \texttt{ABADABA}
|
For example, the borders of the string \texttt{ABACABA}
|
||||||
are \texttt{A}, \texttt{ABA} and \texttt{ABADABA}.
|
are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.
|
||||||
Often we want to find the \key{longest border}
|
|
||||||
that is not the whole string.
|
|
||||||
|
|
||||||
\index{lexicographical order}
|
\index{lexicographical order}
|
||||||
|
|
||||||
Usually we compare string using the \key{lexicographical order}
|
Strings are usually compared using the \key{lexicographical order}
|
||||||
that corresponds to the alphabetical order.
|
that corresponds to the alphabetical order.
|
||||||
It means that $x<y$ if either $x$ is a proper prefix of $y$,
|
It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
|
||||||
or there is an index $k$ such that
|
or there is a position $k$ such that
|
||||||
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
||||||
|
|
||||||
\section{Trie structure}
|
\section{Trie structure}
|
||||||
|
@ -93,15 +114,13 @@ $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
||||||
|
|
||||||
A \key{trie} is a tree structure that
|
A \key{trie} is a tree structure that
|
||||||
maintains a set of strings.
|
maintains a set of strings.
|
||||||
Strings are stored in a trie as chains
|
Each string in a trie corresponds to
|
||||||
of characters that start at the root
|
a chain of characters starting at
|
||||||
of the tree.
|
the root node.
|
||||||
If two strings have a common prefix,
|
If two strings have a common prefix,
|
||||||
they also share a chain in the tree.
|
they also have a common chain in the tree.
|
||||||
|
|
||||||
For example, the following trie corresponds
|
For example, consider the following trie:
|
||||||
to the set
|
|
||||||
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.9]
|
\begin{tikzpicture}[scale=0.9]
|
||||||
|
@ -133,36 +152,40 @@ $\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$:
|
||||||
\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
|
\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
|
This trie corresponds to the set
|
||||||
|
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
|
||||||
The character * in a node means that
|
The character * in a node means that
|
||||||
a string ends at the node.
|
one of the string in the set ends at the node.
|
||||||
This character is needed because a string
|
This character is needed, because a string
|
||||||
may be a prefix of another string.
|
may be a prefix of another string.
|
||||||
For example, in this trie, \texttt{THE}
|
For example, in this trie, \texttt{THE}
|
||||||
is a suffix of \texttt{THERE}.
|
is a prefix of \texttt{THERE}.
|
||||||
|
|
||||||
Inserting and searching a string in a trie take $O(n)$ time
|
We can check if a trie contains a string
|
||||||
where $n$ is the length of the string.
|
in $O(n)$ time where $n$ is the length of the string,
|
||||||
Both operations can be implemented by
|
because we can follow the chain that starts at the root node.
|
||||||
starting at the root node and following the
|
We can also add a new string to the trie
|
||||||
chain of characters that appear in the string.
|
in $O(n)$ time using a similar idea.
|
||||||
If needed, new nodes will be added to the trie.
|
If needed, new nodes will be added to the trie.
|
||||||
|
|
||||||
Tries can be used for searching both strings
|
Using a trie, we can also find the longest prefix
|
||||||
and prefixes of strings.
|
of a string that belongs to the set.
|
||||||
In addition, it is possible to calculate numbers
|
In addition, by storing additional information
|
||||||
of strings that correspond to each prefix,
|
in each node,
|
||||||
which can be useful in some applications.
|
it is possible to calculate the number of
|
||||||
|
strings that have a given prefix.
|
||||||
|
|
||||||
A trie can be stored as an array
|
A trie can be stored in an array
|
||||||
\begin{lstlisting}
|
\begin{lstlisting}
|
||||||
int t[N][A];
|
int t[N][A];
|
||||||
\end{lstlisting}
|
\end{lstlisting}
|
||||||
where $N$ is the maximum number of nodes
|
where $N$ is the maximum number of nodes
|
||||||
(the total length of the string to be stored)
|
(the maximum total length of the strings in the set)
|
||||||
and $A$ is the size of the alphabet.
|
and $A$ is the size of the alphabet.
|
||||||
The nodes of a trie are numbered
|
The nodes of a trie are numbered
|
||||||
$1,2,3,\ldots$ so that the number of the root is 1,
|
$1,2,3,\ldots$ so that the number of the root is 1,
|
||||||
and $\texttt{t}[s][c]$ is the next node in chain
|
and $\texttt{t}[s][c]$ is the next node in the chain
|
||||||
from node $s$ using character $c$.
|
from node $s$ using character $c$.
|
||||||
|
|
||||||
\section{String hashing}
|
\section{String hashing}
|
||||||
|
@ -173,7 +196,7 @@ from node $s$ using character $c$.
|
||||||
\key{String hashing} is a technique that
|
\key{String hashing} is a technique that
|
||||||
allows us to efficiently check whether two
|
allows us to efficiently check whether two
|
||||||
substrings in a string are equal.
|
substrings in a string are equal.
|
||||||
The idea is to compare hash values of the
|
The idea is to compare the hash values of the
|
||||||
substrings instead of their individual characters.
|
substrings instead of their individual characters.
|
||||||
|
|
||||||
\subsubsection*{Calculating hash values}
|
\subsubsection*{Calculating hash values}
|
||||||
|
@ -190,7 +213,7 @@ which makes it possible to compare strings
|
||||||
based on their hash values.
|
based on their hash values.
|
||||||
|
|
||||||
A usual way to implement string hashing
|
A usual way to implement string hashing
|
||||||
is to use polynomial hashing, which means
|
is polynomial hashing, which means
|
||||||
that the hash value is calculated using the formula
|
that the hash value is calculated using the formula
|
||||||
\[(c[1] A^{n-1} + c[2] A^{n-2} + \cdots + c[n] A^0) \bmod B ,\]
|
\[(c[1] A^{n-1} + c[2] A^{n-2} + \cdots + c[n] A^0) \bmod B ,\]
|
||||||
where $c[1],c[2],\ldots,c[n]$
|
where $c[1],c[2],\ldots,c[n]$
|
||||||
|
@ -218,7 +241,7 @@ in the string \texttt{ALLEY} are:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
If $A=3$ and $B=97$, the hash value
|
Thus, if $A=3$ and $B=97$, the hash value
|
||||||
for the string \texttt{ALLEY} is
|
for the string \texttt{ALLEY} is
|
||||||
|
|
||||||
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
|
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
|
||||||
|
@ -232,8 +255,8 @@ we can calculate the hash value of any substring
|
||||||
in $O(1)$ time after an $O(n)$ time preprocessing.
|
in $O(1)$ time after an $O(n)$ time preprocessing.
|
||||||
|
|
||||||
The idea is to construct an array $h$ such that
|
The idea is to construct an array $h$ such that
|
||||||
$h[k]$ contains the hash value for the prefix
|
$h[k]$ contains the hash value of the prefix
|
||||||
of the string that ends at index $k$.
|
of the string that ends at position $k$.
|
||||||
The array values can be recursively calculated as follows:
|
The array values can be recursively calculated as follows:
|
||||||
\[
|
\[
|
||||||
\begin{array}{lcl}
|
\begin{array}{lcl}
|
||||||
|
@ -250,9 +273,8 @@ p[k] & = & (p[k-1] A) \bmod B. \\
|
||||||
\end{array}
|
\end{array}
|
||||||
\]
|
\]
|
||||||
Constructing these arrays takes $O(n)$ time.
|
Constructing these arrays takes $O(n)$ time.
|
||||||
After this, the hash value for a substring
|
After this, the hash value of a substring
|
||||||
of the string
|
that begins at position $a$ and ends at position $b$
|
||||||
that begins at index $a$ and ends at index $b$
|
|
||||||
can be calculated in $O(1)$ time using the formula
|
can be calculated in $O(1)$ time using the formula
|
||||||
\[(h[b]-h[a-1] p[b-a+1]) \bmod B.\]
|
\[(h[b]-h[a-1] p[b-a+1]) \bmod B.\]
|
||||||
|
|
||||||
|
@ -268,16 +290,15 @@ the strings are \emph{certainly} different.
|
||||||
|
|
||||||
Using hashing, we can often make a brute force
|
Using hashing, we can often make a brute force
|
||||||
algorithm efficient.
|
algorithm efficient.
|
||||||
As an example, let's consider a brute force
|
As an example, consider the pattern matching problem:
|
||||||
algorithm that calculates how many times
|
given a string $s$ and a pattern $p$,
|
||||||
a string $p$ occurs as a substring in
|
find the positions where $p$ occurs in $s$.
|
||||||
a string $s$.
|
A brute force algorithm goes through all positions
|
||||||
The algorithm goes through all locations
|
where $p$ may occur, and compares the strings
|
||||||
where $p$ can occur, and compares the strings
|
|
||||||
character by character.
|
character by character.
|
||||||
The time complexity of such an algorithm is $O(n^2)$.
|
The time complexity of such an algorithm is $O(n^2)$.
|
||||||
|
|
||||||
However, we can make the algorithm more efficient
|
We can make the brute force algorithm more efficient
|
||||||
using hashing, because the algorithm compares
|
using hashing, because the algorithm compares
|
||||||
substrings of strings.
|
substrings of strings.
|
||||||
Using hashing, each comparison only takes $O(1)$ time,
|
Using hashing, each comparison only takes $O(1)$ time,
|
||||||
|
@ -286,23 +307,24 @@ This results in an algorithm with time complexity $O(n)$,
|
||||||
which is the best possible time complexity for this problem.
|
which is the best possible time complexity for this problem.
|
||||||
|
|
||||||
By combining hashing and \emph{binary search},
|
By combining hashing and \emph{binary search},
|
||||||
it is also possible to check the lexicographic order of
|
it is also possible to find out the lexicographic order of
|
||||||
two strings in logarithmic time.
|
two strings in logarithmic time.
|
||||||
This can be done by finding out the length
|
This can be done by calculating the length
|
||||||
of the common prefix of the strings using binary search.
|
of the common prefix of the strings using binary search.
|
||||||
Once we know the common prefix,
|
Once we know the length of the common prefix,
|
||||||
the next character after the prefix
|
we can just check the next character after the prefix,
|
||||||
indicates the order of the strings.
|
because this determines the order of the strings.
|
||||||
|
|
||||||
\subsubsection*{Collisions and parameters}
|
\subsubsection*{Collisions and parameters}
|
||||||
|
|
||||||
\index{collision}
|
\index{collision}
|
||||||
|
|
||||||
An evident risk in comparing hash values is
|
An evident risk when comparing hash values is
|
||||||
\key{collision}, which means that two strings have
|
a \key{collision}, which means that two strings have
|
||||||
different contents but equal hash values.
|
different contents but equal hash values.
|
||||||
In this case, based on the hash values it seems that
|
In this case, an algorithm that relies on
|
||||||
the strings are equal, but in reality they aren't,
|
the hash values concludes that the strings are equal,
|
||||||
|
but in reality they are not,
|
||||||
and the algorithm may give incorrect results.
|
and the algorithm may give incorrect results.
|
||||||
|
|
||||||
Collisions are always possible,
|
Collisions are always possible,
|
||||||
|
@ -310,49 +332,41 @@ because the number of different strings is larger
|
||||||
than the number of different hash values.
|
than the number of different hash values.
|
||||||
However, the probability of a collision is small
|
However, the probability of a collision is small
|
||||||
if the constants $A$ and $B$ are carefully chosen.
|
if the constants $A$ and $B$ are carefully chosen.
|
||||||
There are two goals: the hash values should be
|
A usual way is to choose random constants
|
||||||
evenly distributed for the strings,
|
near $10^9$, for example as follows:
|
||||||
and the number of different hash values should
|
|
||||||
be large enough.
|
|
||||||
|
|
||||||
A good solution is to use large random numbers
|
|
||||||
as constants.
|
|
||||||
A usual way is to choose constants that are
|
|
||||||
near $10^9$, for example
|
|
||||||
\[
|
\[
|
||||||
\begin{array}{lcl}
|
\begin{array}{lcl}
|
||||||
A & = & 911382323 \\
|
A & = & 911382323 \\
|
||||||
B & = & 972663749 \\
|
B & = & 972663749 \\
|
||||||
\end{array}
|
\end{array}
|
||||||
\]
|
\]
|
||||||
This choice ensures that the hash values
|
|
||||||
are distributed evenly enough in the range $0 \ldots B-1$.
|
|
||||||
The benefit in $10^9$ is that
|
|
||||||
the \texttt{long long} type can be used
|
|
||||||
for calculating the hash values,
|
|
||||||
because the products $AB$ and $BB$ fit in \texttt{long long}.
|
|
||||||
But is it enough to have $10^9$ different hash values?
|
|
||||||
|
|
||||||
Let's consider three scenarios where hashing can be used:
|
Using such constants,
|
||||||
|
the \texttt{long long} type can be used
|
||||||
|
when calculating the hash values,
|
||||||
|
because the products $AB$ and $BB$ will fit in \texttt{long long}.
|
||||||
|
But is it enough to have about $10^9$ different hash values?
|
||||||
|
|
||||||
|
Let us consider three scenarios where hashing can be used:
|
||||||
|
|
||||||
\textit{Scenario 1:} Strings $x$ and $y$ are compared with
|
\textit{Scenario 1:} Strings $x$ and $y$ are compared with
|
||||||
each other.
|
each other.
|
||||||
The probability of a collision is $1/B$ assuming that
|
The probability of a collision is $1/B$ assuming that
|
||||||
all hash values are equally probable.
|
all hash values are equally probable.
|
||||||
|
|
||||||
\textit{Tapaus 2:} A string $x$ is compared with strings
|
\textit{Scenario 2:} A string $x$ is compared with strings
|
||||||
$y_1,y_2,\ldots,y_n$.
|
$y_1,y_2,\ldots,y_n$.
|
||||||
The probability for one or more collisions is
|
The probability of one or more collisions is
|
||||||
|
|
||||||
\[1-(1-1/B)^n.\]
|
\[1-(1-\frac{1}{B})^n.\]
|
||||||
|
|
||||||
\textit{Tapaus 3:} Strings $x_1,x_2,\ldots,x_n$
|
\textit{Scenario 3:} Strings $x_1,x_2,\ldots,x_n$
|
||||||
are compared with each other.
|
are compared with each other.
|
||||||
The probability for one or more collisions is
|
The probability of one or more collisions is
|
||||||
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
|
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
|
||||||
|
|
||||||
The following table shows the collision probabilities
|
The following table shows the collision probabilities
|
||||||
when the value of $B$ varies and $n=10^6$:
|
when $n=10^6$ and the value of $B$ varies:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tabular}{rrrr}
|
\begin{tabular}{rrrr}
|
||||||
|
@ -384,12 +398,12 @@ in a room, the probability that some two people
|
||||||
have the same birthday is large even if $n$ is quite small.
|
have the same birthday is large even if $n$ is quite small.
|
||||||
In hashing, correspondingly, when all hash values are compared
|
In hashing, correspondingly, when all hash values are compared
|
||||||
with each other, the probability that some two
|
with each other, the probability that some two
|
||||||
hash values are the same is large.
|
hash values are equal is large.
|
||||||
|
|
||||||
A good way to make the probability of a collision
|
We can make the probability of a collision
|
||||||
smaller is to calculate \emph{multiple} hash values
|
smaller by calculating \emph{multiple} hash values
|
||||||
using different parameters.
|
using different parameters.
|
||||||
It is very unlikely that a collision would occur
|
It is unlikely that a collision would occur
|
||||||
in all hash values at the same time.
|
in all hash values at the same time.
|
||||||
For example, two hash values with parameter
|
For example, two hash values with parameter
|
||||||
$B \approx 10^9$ correspond to one hash
|
$B \approx 10^9$ correspond to one hash
|
||||||
|
@ -401,37 +415,25 @@ which is convenient, because operations with 32 and 64
|
||||||
bit integers are calculated modulo $2^{32}$ and $2^{64}$.
|
bit integers are calculated modulo $2^{32}$ and $2^{64}$.
|
||||||
However, this is not a good choice, because it is possible
|
However, this is not a good choice, because it is possible
|
||||||
to construct inputs that always generate collisions when
|
to construct inputs that always generate collisions when
|
||||||
constants of the form $2^x$ are used\footnote{
|
constants of the form $2^x$ are used.
|
||||||
J. Pachocki and Jakub Radoszweski:
|
% \footnote{
|
||||||
''Where to use and how not to use polynomial string hashing''.
|
% J. Pachocki and Jakub Radoszweski:
|
||||||
\textit{Olympiads in Informatics}, 2013.
|
% ''Where to use and how not to use polynomial string hashing''.
|
||||||
}.
|
% \textit{Olympiads in Informatics}, 2013.
|
||||||
|
% }.
|
||||||
|
|
||||||
\section{Z-algorithm}
|
\section{Z-algorithm}
|
||||||
|
|
||||||
\index{Z-algorithm}
|
\index{Z-algorithm}
|
||||||
\index{Z-array}
|
\index{Z-array}
|
||||||
|
|
||||||
The \key{Z-algorithm} generates a \key{Z-array}
|
The \key{Z-array} of a string
|
||||||
for the string, that contains for each index $k$
|
contains for each position $k$ in the string
|
||||||
in the string the length of the longest substring
|
the lengt of the longest substring
|
||||||
that begins at index $k$ and is a prefix of the string.
|
that begins at position $k$ and is a prefix of the string.
|
||||||
Many string problems can be efficiently solved
|
Such an array can be efficiently constructed
|
||||||
using the Z-algorithm.
|
using the \key{Z-algorithm}.
|
||||||
|
|
||||||
It is often a matter of taste whether to use
|
|
||||||
the Z-algorithm or string hashing.
|
|
||||||
Unlike hashing, the Z-algorithm always works
|
|
||||||
and there is no risk for collisions.
|
|
||||||
On the other hand, the Z-algorithm is more difficult
|
|
||||||
to implement and some problems can only be solved
|
|
||||||
using hashing.
|
|
||||||
|
|
||||||
\subsubsection*{Description}
|
|
||||||
|
|
||||||
The Z-algorithm constructs a Z-array that
|
|
||||||
indicates for each position the length of the
|
|
||||||
longest substring that is also a prefix of the string.
|
|
||||||
For example, the Z-array for the string
|
For example, the Z-array for the string
|
||||||
\texttt{ACBACDACBACBACDA} is as follows:
|
\texttt{ACBACDACBACBACDA} is as follows:
|
||||||
|
|
||||||
|
@ -494,45 +496,50 @@ For example, the Z-array for the string
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
For example, the position 7 contains the value 5,
|
For example, the value at position 7 in the
|
||||||
|
above Z-array is 5,
|
||||||
because the substring \texttt{ACBAC} of length 5
|
because the substring \texttt{ACBAC} of length 5
|
||||||
is a prefix of the string,
|
is a prefix of the string,
|
||||||
but the substring \texttt{ACBACB} of length 6
|
but the substring \texttt{ACBACB} of length 6
|
||||||
is not a prefix of the string.
|
is not a prefix of the string.
|
||||||
|
|
||||||
The Z-algorithm scans the string from the left
|
It is often a matter of taste whether to use
|
||||||
to the right, and calculates for each position
|
string hashing or the Z-algorithm.
|
||||||
|
Unlike hashing, the Z-algorithm always works
|
||||||
|
and there is no risk for collisions.
|
||||||
|
On the other hand, the Z-algorithm is more difficult
|
||||||
|
to implement and some problems can only be solved
|
||||||
|
using hashing.
|
||||||
|
|
||||||
|
\subsubsection*{Algorithm description}
|
||||||
|
|
||||||
|
The Z-algorithm scans the string from left
|
||||||
|
to right, and calculates for each position
|
||||||
the length of the longest substring that
|
the length of the longest substring that
|
||||||
is a prefix of the string.
|
is a prefix of the string.
|
||||||
The algorithm compares the first characters
|
A straightforward algorithm
|
||||||
of the string
|
would have a time complexity of $O(n^2)$,
|
||||||
and the active substring with each other to
|
but the Z-algorithm has an important
|
||||||
find the length of the common prefix.
|
|
||||||
|
|
||||||
A straightforward implementation would yield
|
|
||||||
an algorithm with time complexity $O(n^2)$
|
|
||||||
because the common prefixes may be long.
|
|
||||||
However, the Z-algorithm has one important
|
|
||||||
optimization which ensures that the time complexity
|
optimization which ensures that the time complexity
|
||||||
is only $O(n)$.
|
is only $O(n)$.
|
||||||
|
|
||||||
The idea is to maintain a range $[x,y]$ such that
|
The idea is to maintain a range $[x,y]$ such that
|
||||||
the substring from $x$ to $y$ is a prefix of
|
the substring from $x$ to $y$ is a prefix of
|
||||||
the string and $y$ is as large as possible.
|
the string and $y$ is as large as possible.
|
||||||
Since the Z-array already contains information
|
Since the Z-array already contains information
|
||||||
about the characters in the range $[x,y]$,
|
about the characters in the range $[x,y]$,
|
||||||
it is not needed to process them again later in the algorithm.
|
we can use this information to calculate
|
||||||
|
values for elements in the range $[x,y]$.
|
||||||
|
|
||||||
The time complexity of the Z-algorithm is $O(n)$,
|
The time complexity of the Z-algorithm is $O(n)$,
|
||||||
because the algorithm always compares substrings
|
because the algorithm always compares strings
|
||||||
character by character only from index $y+1$.
|
character by character starting at position $y+1$.
|
||||||
If the characters match, the value of $y$ increases,
|
If the characters match, the value of $y$ increases,
|
||||||
and it is not needed to inspect the character again,
|
and it is not needed to compare the character at
|
||||||
|
position $y$ again,
|
||||||
but the information in the Z-array can be used.
|
but the information in the Z-array can be used.
|
||||||
|
|
||||||
\subsubsection*{Example}
|
For example, let us construct the following Z-array:
|
||||||
|
|
||||||
Let's construct the following Z-array using
|
|
||||||
the Z-algorithm:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -595,7 +602,8 @@ the Z-algorithm:
|
||||||
|
|
||||||
The first interesting position is 7 where the
|
The first interesting position is 7 where the
|
||||||
length of the common prefix is 5.
|
length of the common prefix is 5.
|
||||||
The corresponding range in the string is $[7,11]$:
|
After calculating this value,
|
||||||
|
the current $[x,y]$ range will be $[7,11]$:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -663,14 +671,17 @@ The corresponding range in the string is $[7,11]$:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
The benefit in the range $[7,11]$ is that the
|
Now, it is possible to calculate the
|
||||||
algorithm can calculate the subsequent values
|
subsequent values for the Z-array
|
||||||
for the Z-array more efficiently.
|
more efficiently,
|
||||||
Since the ranges $[1,5]$ and $[7,11]$ contain
|
because we know that
|
||||||
the same characters, also the Z-array will
|
the ranges $[1,5]$ and $[7,11]$
|
||||||
contain similar values.
|
contain the same characters.
|
||||||
First, the values at indices 8 and 9
|
First, since the values at
|
||||||
correspond to the values at indices 2 and 3:
|
positions 2 and 3 are 0,
|
||||||
|
we immediately know that
|
||||||
|
the values at positions 8 and 9
|
||||||
|
are also 0:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -742,13 +753,9 @@ correspond to the values at indices 2 and 3:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
After this, the value for index 10 can be
|
After this, we know that the value
|
||||||
calculated using the value at index 4.
|
at position 10 will be at least 2,
|
||||||
The value at index 4 is 2,
|
because the value at position 4 is 2:
|
||||||
so the first two characters
|
|
||||||
in the substring match the beginning of the string.
|
|
||||||
However, the characters after index $y=11$ have
|
|
||||||
not been inspected yet.
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -817,13 +824,85 @@ not been inspected yet.
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
The algorithm compares the substring
|
Since we have no information about the characters
|
||||||
beginning at index $y+1=12$ character by character.
|
after position 11, we have to begin to compare the strings
|
||||||
The previous values in the Z-array cannot be used,
|
character by character:
|
||||||
because this is the first time the characters
|
|
||||||
after index 11 are inspected.
|
\begin{center}
|
||||||
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
\fill[color=lightgray] (9,0) rectangle (10,1);
|
||||||
|
\fill[color=lightgray] (2,1) rectangle (7,2);
|
||||||
|
\fill[color=lightgray] (11,1) rectangle (16,2);
|
||||||
|
|
||||||
|
|
||||||
|
\draw (0,0) grid (16,2);
|
||||||
|
|
||||||
|
\node at (0.5, 1.5) {A};
|
||||||
|
\node at (1.5, 1.5) {C};
|
||||||
|
\node at (2.5, 1.5) {B};
|
||||||
|
\node at (3.5, 1.5) {A};
|
||||||
|
\node at (4.5, 1.5) {C};
|
||||||
|
\node at (5.5, 1.5) {D};
|
||||||
|
\node at (6.5, 1.5) {A};
|
||||||
|
\node at (7.5, 1.5) {C};
|
||||||
|
\node at (8.5, 1.5) {B};
|
||||||
|
\node at (9.5, 1.5) {A};
|
||||||
|
\node at (10.5, 1.5) {C};
|
||||||
|
\node at (11.5, 1.5) {B};
|
||||||
|
\node at (12.5, 1.5) {A};
|
||||||
|
\node at (13.5, 1.5) {C};
|
||||||
|
\node at (14.5, 1.5) {D};
|
||||||
|
\node at (15.5, 1.5) {A};
|
||||||
|
|
||||||
|
\node at (0.5, 0.5) {--};
|
||||||
|
\node at (1.5, 0.5) {0};
|
||||||
|
\node at (2.5, 0.5) {0};
|
||||||
|
\node at (3.5, 0.5) {2};
|
||||||
|
\node at (4.5, 0.5) {0};
|
||||||
|
\node at (5.5, 0.5) {0};
|
||||||
|
\node at (6.5, 0.5) {5};
|
||||||
|
\node at (7.5, 0.5) {0};
|
||||||
|
\node at (8.5, 0.5) {0};
|
||||||
|
\node at (9.5, 0.5) {?};
|
||||||
|
\node at (10.5, 0.5) {?};
|
||||||
|
\node at (11.5, 0.5) {?};
|
||||||
|
\node at (12.5, 0.5) {?};
|
||||||
|
\node at (13.5, 0.5) {?};
|
||||||
|
\node at (14.5, 0.5) {?};
|
||||||
|
\node at (15.5, 0.5) {?};
|
||||||
|
|
||||||
|
\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00);
|
||||||
|
|
||||||
|
\node at (6.5,3.50) {$x$};
|
||||||
|
\node at (10.5,3.50) {$y$};
|
||||||
|
|
||||||
|
|
||||||
|
\footnotesize
|
||||||
|
\node at (0.5, 2.5) {1};
|
||||||
|
\node at (1.5, 2.5) {2};
|
||||||
|
\node at (2.5, 2.5) {3};
|
||||||
|
\node at (3.5, 2.5) {4};
|
||||||
|
\node at (4.5, 2.5) {5};
|
||||||
|
\node at (5.5, 2.5) {6};
|
||||||
|
\node at (6.5, 2.5) {7};
|
||||||
|
\node at (7.5, 2.5) {8};
|
||||||
|
\node at (8.5, 2.5) {9};
|
||||||
|
\node at (9.5, 2.5) {10};
|
||||||
|
\node at (10.5, 2.5) {11};
|
||||||
|
\node at (11.5, 2.5) {12};
|
||||||
|
\node at (12.5, 2.5) {13};
|
||||||
|
\node at (13.5, 2.5) {14};
|
||||||
|
\node at (14.5, 2.5) {15};
|
||||||
|
\node at (15.5, 2.5) {16};
|
||||||
|
|
||||||
|
%\draw[thick,<->] (11.5,-0.25) .. controls (11,-1.25) and (3,-1.25) .. (2.5,-0.25);
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{center}
|
||||||
|
|
||||||
|
|
||||||
It turns out that the length of the common
|
It turns out that the length of the common
|
||||||
prefix is 7, and the range $[x,y]$ will be updated:
|
prefix at position 10 is 7,
|
||||||
|
and thus the new range $[x,y]$ is $[10,16]$:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -892,9 +971,9 @@ prefix is 7, and the range $[x,y]$ will be updated:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
After this, all subsequent values in the Z-array
|
After this, all subsequent values for the Z-array
|
||||||
can be calculated using the information in
|
can be calculated using the values already
|
||||||
the range $[x,y]$. All the remaining values can be
|
calculated to the array. All the remaining values can be
|
||||||
directly retrieved from the beginning of the Z-array:
|
directly retrieved from the beginning of the Z-array:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
|
@ -964,29 +1043,26 @@ directly retrieved from the beginning of the Z-array:
|
||||||
|
|
||||||
\subsubsection{Using the Z-array}
|
\subsubsection{Using the Z-array}
|
||||||
|
|
||||||
As an example, let's solve a problem
|
As an example, let us once again consider
|
||||||
where our task is to calculate
|
the pattern matching problem,
|
||||||
the number of times a string $p$
|
where our task is to find the positions
|
||||||
occurs as a substring in a string $s$.
|
where a pattern $p$ occurs in a string $s$.
|
||||||
Previously, we solved this problem
|
We already solved this problem efficiently
|
||||||
using string hashing, but the Z-algorithm
|
using string hashing, but the Z-algorithm
|
||||||
provides another way to solve the problem.
|
provides another way to solve the problem.
|
||||||
|
|
||||||
A usual idea when using the Z-algorithm
|
A usual idea in string processing is to
|
||||||
is to construct a string that consists of
|
construct a string that consists of
|
||||||
several strings separated by special characters.
|
multiple strings separated by special characters.
|
||||||
In this problem, we can construct a string
|
In this problem, we can construct a string
|
||||||
$p$\texttt{\#}$s$,
|
$p$\texttt{\#}$s$,
|
||||||
where $p$ and $s$ are separated by a special
|
where $p$ and $s$ are separated by a special
|
||||||
character \texttt{\#} that doesn't occur
|
character \texttt{\#} that does not occur
|
||||||
in the strings.
|
in the strings.
|
||||||
After this, the Z-array for the string
|
The Z-array of $p$\texttt{\#}$s$ indicates the positions
|
||||||
$p$\texttt{\#}$s$ indicates the positions
|
where $p$ occurs in $s$,
|
||||||
where $p$ occurs in $s$.
|
because such positions contain the value $p$.
|
||||||
Such positions are those positions in the Z-array
|
|
||||||
that contain the value $p$.
|
|
||||||
|
|
||||||
\begin{samepage}
|
|
||||||
For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT},
|
For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT},
|
||||||
the Z-array is as follows:
|
the Z-array is as follows:
|
||||||
|
|
||||||
|
@ -1041,12 +1117,12 @@ the Z-array is as follows:
|
||||||
\node at (13.5, 2.5) {14};
|
\node at (13.5, 2.5) {14};
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
\end{samepage}
|
|
||||||
The positions 6 and 11 contain the value 3,
|
The positions 6 and 11 contain the value 3,
|
||||||
which means that the substring \texttt{ATT}
|
which means that the pattern \texttt{ATT}
|
||||||
occurs in the corresponding positions
|
occurs in the corresponding positions
|
||||||
in the string \texttt{HATTIVATTI}.
|
in the string \texttt{HATTIVATTI}.
|
||||||
|
|
||||||
The time complexity of the resulting algorithm
|
The time complexity of the resulting algorithm
|
||||||
is $O(n)$, because it suffices to construct and
|
is $O(n)$, because it suffices to construct
|
||||||
go through the Z-array.
|
the Z-array and go through its values.
|
Loading…
Reference in New Issue