Improve language

This commit is contained in:
Antti H S Laaksonen 2017-05-12 21:15:25 +03:00
parent bf51e8cb23
commit 41cc186beb
1 changed files with 147 additions and 149 deletions

View File

@ -9,17 +9,17 @@ time.
\index{pattern matching} \index{pattern matching}
For example, a fundamental problem related to strings For example, a fundamental string processing
is the \key{pattern matching} problem: problem is the \key{pattern matching} problem:
given a string of length $n$ and a pattern of length $m$, given a string of length $n$ and a pattern of length $m$,
our task is to find the positions where the pattern our task is to find the occurrences of the pattern
occurs in the string. in the string.
For example, the pattern \texttt{ABC} occurs two For example, the pattern \texttt{ABC} occurs two
times in the string \texttt{ABABCBABC}. times in the string \texttt{ABABCBABC}.
The pattern matching problem is easy to solve The pattern matching problem can be easily solved
in $O(nm)$ time by a brute force algorithm that in $O(nm)$ time by a brute force algorithm that
goes through all positions where the pattern may tests all positions where the pattern may
occur in the string. occur in the string.
However, in this chapter, we will see that there However, in this chapter, we will see that there
are more efficient algorithms that require only are more efficient algorithms that require only
@ -31,8 +31,13 @@ $O(n+m)$ time.
\index{alphabet} \index{alphabet}
An \key{alphabet} is a set of characters Throughout the chapter, we assume that
that may appear in strings. zero-based indexing is used in strings.
Thus, a string \texttt{s} of length $n$
consists of characters
$\texttt{s}[0],\texttt{s}[1],\ldots,\texttt{s}[n-1]$.
The set of characters that may appear
in strings is called an \key{alphabet}.
For example, the alphabet For example, the alphabet
$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$ $\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
consists of the capital letters of English. consists of the capital letters of English.
@ -40,9 +45,12 @@ consists of the capital letters of English.
\index{substring} \index{substring}
A \key{substring} is a sequence of consecutive A \key{substring} is a sequence of consecutive
characters of a string. characters in a string.
The number of substrings of a string is $n(n+1)/2$. We use the notation $\texttt{s}[a \ldots b]$
For example, the substrings of the string to refer to a substring of \texttt{s}
that begins at position $a$ and ends at position $b$.
A string of length $n$ has $n(n+1)/2$ substrings.
For example, the substrings of
\texttt{ABCD} are \texttt{ABCD} are
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
\texttt{AB}, \texttt{BC}, \texttt{CD}, \texttt{AB}, \texttt{BC}, \texttt{CD},
@ -52,9 +60,9 @@ For example, the substrings of the string
A \key{subsequence} is a sequence of A \key{subsequence} is a sequence of
(not necessarily consecutive) characters (not necessarily consecutive) characters
of a string in their original order. in a string in their original order.
The number of subsequences of a string is $2^n-1$. A string of length $n$ has $2^n-1$ subsequences.
For example, the subsequences of the string For example, the subsequences of
\texttt{ABCD} are \texttt{ABCD} are
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
\texttt{AB}, \texttt{AC}, \texttt{AD}, \texttt{AB}, \texttt{AC}, \texttt{AD},
@ -69,19 +77,18 @@ A \key{prefix} is a subtring that starts at the beginning
of a string, of a string,
and a \key{suffix} is a substring that ends at the end and a \key{suffix} is a substring that ends at the end
of a string. of a string.
For example, for the string \texttt{ABCD}, For example,
the prefixes are the prefixes of \texttt{ABCD} are
\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD} \texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD},
and the suffixes are and the suffixes of \texttt{ABCD} are
\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}. \texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.
\index{rotation} \index{rotation}
A \key{rotation} can be generated by moving A \key{rotation} can be generated by moving
characters one by one from the beginning the characters of a string one by one from the beginning
to the end of a string (or vice versa). to the end (or vice versa).
For example, the rotations of the string For example, the rotations of \texttt{ABCD} are
\texttt{ABCD} are
\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}. \texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.
\index{period} \index{period}
@ -97,13 +104,13 @@ For example, the shortest period of
A \key{border} is a string that is both A \key{border} is a string that is both
a prefix and a suffix of a string. a prefix and a suffix of a string.
For example, the borders of the string \texttt{ABACABA} For example, the borders of \texttt{ABACABA}
are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}. are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.
\index{lexicographical order} \index{lexicographical order}
Strings are usually compared using the \key{lexicographical order} Strings are compared using the \key{lexicographical order}
that corresponds to the alphabetical order. (which corresponds to the alphabetical order).
It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$, It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
or there is a position $k$ such that or there is a position $k$ such that
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$. $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
@ -112,11 +119,10 @@ $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
\index{trie} \index{trie}
A \key{trie} is a tree structure that A \key{trie} is a rooted tree that
maintains a set of strings. maintains a set of strings.
Each string is stored as Each string in the set is stored as
a chain of characters starting at a chain of characters that starts at the root.
the root node.
If two strings have a common prefix, If two strings have a common prefix,
they also have a common chain in the tree. they also have a common chain in the tree.
@ -156,38 +162,39 @@ For example, consider the following trie:
This trie corresponds to the set This trie corresponds to the set
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$. $\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
The character * in a node means that The character * in a node means that
one of the strings in the set ends at the node. a string in the set ends at the node.
Such a character is needed, because a string Such a character is needed, because a string
may be a prefix of another string. may be a prefix of another string.
For example, in the above trie, \texttt{THE} For example, in the above trie, \texttt{THE}
is a prefix of \texttt{THERE}. is a prefix of \texttt{THERE}.
We can check if a trie contains a string We can check in $O(n)$ time whether a trie
in $O(n)$ time where $n$ is the length of the string, contains a string of length $n$,
because we can follow the chain that starts at the root node. because we can follow the chain that starts at the root node.
We can also add a new string to the trie We can also add a string of length $n$ to the trie
in $O(n)$ time using a similar idea. in $O(n)$ time by first following the chain
If needed, new nodes will be added to the trie. and then adding new nodes to the trie if necessary.
Using a trie, we can also find Using a trie, we can find
for a given string the longest prefix the longest prefix of a given string
that belongs to the set. such that the prefix belongs to the set.
In addition, by storing additional information Moreover, by storing additional information
in each node, in each node,
it is possible to calculate the number of we can calculate the number of
strings that have a given prefix. strings that belong to the set and have a
given string as a prefix.
A trie can be stored in an array A trie can be stored in an array
\begin{lstlisting} \begin{lstlisting}
int t[N][A]; int trie[N][A];
\end{lstlisting} \end{lstlisting}
where $N$ is the maximum number of nodes where $N$ is the maximum number of nodes
(the maximum total length of the strings in the set) (the maximum total length of the strings in the set)
and $A$ is the size of the alphabet. and $A$ is the size of the alphabet.
The nodes of a trie are numbered The nodes of a trie are numbered
$1,2,3,\ldots$ so that the number of the root is 1, $0,1,2,\ldots$ so that the number of the root is 0,
and $\texttt{t}[s][c]$ is the next node in the chain and $\texttt{trie}[s][c]$ is the next node in the chain
from node $s$ using character $c$. when we move from node $s$ using character $c$.
\section{String hashing} \section{String hashing}
@ -199,7 +206,7 @@ allows us to efficiently check whether two
strings are equal\footnote{The technique strings are equal\footnote{The technique
was popularized by the KarpRabin pattern matching was popularized by the KarpRabin pattern matching
algorithm \cite{kar87}.}. algorithm \cite{kar87}.}.
The idea is to compare the hash values of the The idea in string hashing is to compare hash values of
strings instead of their individual characters. strings instead of their individual characters.
\subsubsection*{Calculating hash values} \subsubsection*{Calculating hash values}
@ -217,15 +224,15 @@ based on their hash values.
A usual way to implement string hashing A usual way to implement string hashing
is \key{polynomial hashing}, which means is \key{polynomial hashing}, which means
that the hash value is calculated using the formula that the hash value of a string \texttt{s}
of length $n$ is
\[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B ,\] \[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B ,\]
where \texttt{s} is a string of length $n$ where $s[0],s[1],\ldots,s[n-1]$
(so $s[0],s[1],\ldots,s[n-1]$ are interpreted as the codes of the characters of \texttt{s},
are the codes of the characters),
and $A$ and $B$ are pre-chosen constants. and $A$ and $B$ are pre-chosen constants.
For example, the codes of the characters For example, the codes of the characters
in the string \texttt{ALLEY} are: of \texttt{ALLEY} are:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
\draw (0,0) grid (5,2); \draw (0,0) grid (5,2);
@ -246,41 +253,37 @@ in the string \texttt{ALLEY} are:
\end{center} \end{center}
Thus, if $A=3$ and $B=97$, the hash value Thus, if $A=3$ and $B=97$, the hash value
of the string \texttt{ALLEY} is of \texttt{ALLEY} is
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\] \[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
\subsubsection*{Preprocessing} \subsubsection*{Preprocessing}
It turns out that using polynomial hashing, Using polynomial hashing, we can calculate the hash value of any substring
we can calculate the hash value of any substring of a string \texttt{s} in $O(1)$ time after an $O(n)$ time preprocessing.
of a string The idea is to construct an array \texttt{h} such that
in $O(1)$ time after an $O(n)$ time preprocessing. $\texttt{h}[k]$ contains the hash value of the prefix $\texttt{s}[0 \ldots k]$.
The idea is to construct an array $h$ such that
$h[k]$ contains the hash value of the prefix
of the string that ends at position $k$.
The array values can be recursively calculated as follows: The array values can be recursively calculated as follows:
\[ \[
\begin{array}{lcl} \begin{array}{lcl}
h[0] & = & \texttt{s}[0] \\ \texttt{h}[0] & = & \texttt{s}[0] \\
h[k] & = & (h[k-1] A + \texttt{s}[k]) \bmod B \\ \texttt{h}[k] & = & (\texttt{h}[k-1] A + \texttt{s}[k]) \bmod B \\
\end{array} \end{array}
\] \]
In addition, we construct an array $p$ In addition, we construct an array $\texttt{p}$
where $p[k]=A^k \bmod B$: where $\texttt{p}[k]=A^k \bmod B$:
\[ \[
\begin{array}{lcl} \begin{array}{lcl}
p[0] & = & 1 \\ \texttt{p}[0] & = & 1 \\
p[k] & = & (p[k-1] A) \bmod B. \\ \texttt{p}[k] & = & (\texttt{p}[k-1] A) \bmod B. \\
\end{array} \end{array}
\] \]
Constructing these arrays takes $O(n)$ time. Constructing these arrays takes $O(n)$ time.
After this, the hash value of a substring After this, the hash value of any substring
that begins at position $a$ and ends at position $b$ $\texttt{s}[a \ldots b]$
can be calculated in $O(1)$ time using the formula can be calculated in $O(1)$ time using the formula
\[(h[b]-h[a-1] p[b-a+1]) \bmod B\] \[(\texttt{h}[b]-\texttt{h}[a-1] \texttt{p}[b-a+1]) \bmod B\]
assuming that $a>0$. assuming that $a>0$.
If $a=0$, the hash value is simply $h[b]$. If $a=0$, the hash value is simply $\texttt{h}[b]$.
\subsubsection*{Using hash values} \subsubsection*{Using hash values}
@ -364,7 +367,7 @@ The probability of one or more collisions is
\[1-(1-\frac{1}{B})^n.\] \[1-(1-\frac{1}{B})^n.\]
\textit{Scenario 3:} Strings $x_1,x_2,\ldots,x_n$ \textit{Scenario 3:} All pairs of strings $x_1,x_2,\ldots,x_n$
are compared with each other. are compared with each other.
The probability of one or more collisions is The probability of one or more collisions is
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\] \[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
@ -398,7 +401,7 @@ $B \approx 10^9$.
The phenomenon in scenario 3 is known as the The phenomenon in scenario 3 is known as the
\key{birthday paradox}: if there are $n$ people \key{birthday paradox}: if there are $n$ people
in a room, the probability that some two people in a room, the probability that \emph{some} two people
have the same birthday is large even if $n$ is quite small. have the same birthday is large even if $n$ is quite small.
In hashing, correspondingly, when all hash values are compared In hashing, correspondingly, when all hash values are compared
with each other, the probability that some two with each other, the probability that some two
@ -417,7 +420,7 @@ which makes the probability of a collision very small.
Some people use constants $B=2^{32}$ and $B=2^{64}$, Some people use constants $B=2^{32}$ and $B=2^{64}$,
which is convenient, because operations with 32 and 64 which is convenient, because operations with 32 and 64
bit integers are calculated modulo $2^{32}$ and $2^{64}$. bit integers are calculated modulo $2^{32}$ and $2^{64}$.
However, this is not a good choice, because it is possible However, this is \emph{not} a good choice, because it is possible
to construct inputs that always generate collisions when to construct inputs that always generate collisions when
constants of the form $2^x$ are used \cite{pac13}. constants of the form $2^x$ are used \cite{pac13}.
@ -426,17 +429,16 @@ constants of the form $2^x$ are used \cite{pac13}.
\index{Z-algorithm} \index{Z-algorithm}
\index{Z-array} \index{Z-array}
The \key{Z-array} of a string The \key{Z-array} \texttt{z} of a string \texttt{s}
contains for each position of the string of length $n$ contains for each $k=0,1,\ldots,n-1$
the length of the longest substring the length of the longest substring of \texttt{s}
that begins at that position and is a prefix of the string. that begins at position $k$ and is a prefix of \texttt{s}.
Such an array can be efficiently constructed Thus, $\texttt{z}[k]=p$ tells us that
using the \key{Z-algorithm}\footnote{The Z-algorithm $\texttt{s}[0 \ldots p-1]$ equals $\texttt{s}[k \ldots k+p-1]$.
was presented in \cite{gus97} as the simplest known Many string processing problems can be efficiently solved
method for linear-time pattern matching, and the original idea using the Z-array.
was attributed to \cite{mai84}.}.
For example, the Z-array of the string For example, the Z-array of
\texttt{ACBACDACBACBACDA} is as follows: \texttt{ACBACDACBACBACDA} is as follows:
\begin{center} \begin{center}
@ -498,48 +500,45 @@ For example, the Z-array of the string
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
For example, the value at position 6 of the In this case, for example, $\texttt{z}[6]=5$,
above Z-array is 5,
because the substring \texttt{ACBAC} of length 5 because the substring \texttt{ACBAC} of length 5
is a prefix of the string, is a prefix of \texttt{s},
but the substring \texttt{ACBACB} of length 6 but the substring \texttt{ACBACB} of length 6
is not a prefix of the string. is not a prefix of \texttt{s}.
It is often a matter of taste whether to use
string hashing or the Z-algorithm.
Unlike hashing, the Z-algorithm always works
and there is no risk for collisions.
On the other hand, the Z-algorithm is more difficult
to implement and some problems can only be solved
using hashing.
\subsubsection*{Algorithm description} \subsubsection*{Algorithm description}
The Z-algorithm scans the string from left Next we describe an algorithm,
to right, and calculates for each position called the \key{Z-algorithm}\footnote{The Z-algorithm
the length of the longest substring that was presented in \cite{gus97} as the simplest known
is a prefix of the string. method for linear-time pattern matching, and the original idea
A straightforward algorithm was attributed to \cite{mai84}.},
would have a time complexity of $O(n^2)$, that efficiently constructs the Z-array in $O(n)$ time.
but the Z-algorithm has an important The algorithm calculates the Z-array values
optimization which ensures that the time complexity from left to right by both using information
is only $O(n)$. already stored in the Z-array and comparing substrings
character by character.
The idea is to maintain a range $[x,y]$ such that To efficiently calculate the Z-array values,
the substring from $x$ to $y$ is a prefix of the algorithm maintains a range $[x,y]$ such that
the string and $y$ is as large as possible. $\texttt{s}[x \ldots y]$ is a prefix of \texttt{s}
Since the characters in the ranges $[0,y-x]$ and $y$ is as large as possible.
and $[x,y]$ are the same, Since we know that $\texttt{s}[0 \ldots y-x]$
we can use this information to calculate and $\texttt{s}[x \ldots y]$ are equal,
the Z-array values in the range $[x,y]$. we can use this information when calculating
Z-values for positions $x+1,x+2,\ldots,y$.
The time complexity of the Z-algorithm is $O(n)$, At each position $k$, we first
because the algorithm only compares strings check the value of $\texttt{z}[k-x]$.
character by character starting at position $y+1$. If $k+\texttt{z}[k-x]<y$, we know that $\texttt{z}[k]=\texttt{z}[k-x]$.
If the characters match, the value of $y$ increases, However, if $k+\texttt{z}[k-x] \ge y$,
and it is not needed to compare the character at $\texttt{s}[0 \ldots y-k]$ equals
position $y$ again $\texttt{s}[k \ldots y]$, and to determine the
but the information in the Z-array can be used. value of $\texttt{z}[k]$ we need to compare
the substrings character by character.
Still, the algorithm works in $O(n)$ time,
because we start comparing at positions
$y-k+1$ and $y+1$.
For example, let us construct the following Z-array: For example, let us construct the following Z-array:
@ -602,10 +601,8 @@ For example, let us construct the following Z-array:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
The first interesting position is 6 where the After calculating the value $\texttt{z}[6]=5$,
length of the common prefix is 5. the current $[x,y]$ range is $[6,10]$:
After calculating this value,
the current $[x,y]$ range will be $[6,10]$:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -673,17 +670,15 @@ the current $[x,y]$ range will be $[6,10]$:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
Now, it is possible to calculate the Now we can calculate
subsequent values of the Z-array subsequent Z-array values
more efficiently, efficiently,
because we know that because we know that
the ranges $[0,4]$ and $[6,10]$ $\texttt{s}[0 \ldots 4]$ and
contain the same characters. $\texttt{s}[6 \ldots 10]$ are equal.
First, since the values at First, since $\texttt{z}[1] = \texttt{z}[2] = 0$,
positions 1 and 2 are 0, we immediately know that also
we immediately know that $\texttt{z}[7] = \texttt{z}[8] = 0$:
the values at positions 7 and 8
are also 0:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -755,9 +750,7 @@ are also 0:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
After this, we know that the value Then, since $\texttt{z}[3]=2$, we know that $\texttt{z}[9] \ge 2$:
at position 9 will be at least 2,
because the value at position 3 is 2:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -826,8 +819,8 @@ because the value at position 3 is 2:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
Since we have no information about the characters However, we have no information about the string
after position 10, we have to begin to compare the strings after position 10, so we need to compare the substrings
character by character: character by character:
\begin{center} \begin{center}
@ -901,10 +894,8 @@ character by character:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
It turns out that $\texttt{z}[9]=7$,
It turns out that the length of the common so the new $[x,y]$ range is $[9,15]$:
prefix at position 9 is 7,
and thus the new range $[x,y]$ is $[9,15]$:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -973,10 +964,9 @@ and thus the new range $[x,y]$ is $[9,15]$:
\end{tikzpicture} \end{tikzpicture}
\end{center} \end{center}
After this, all subsequent values of the Z-array After this, all the remaining Z-array values
can be calculated using the values already can be determined by using the information
stored in the array. All the remaining values can be already stored in the Z-array:
directly retrieved from the beginning of the Z-array:
\begin{center} \begin{center}
\begin{tikzpicture}[scale=0.7] \begin{tikzpicture}[scale=0.7]
@ -1045,10 +1035,18 @@ directly retrieved from the beginning of the Z-array:
\subsubsection{Using the Z-array} \subsubsection{Using the Z-array}
As an example, let us consider again It is often a matter of taste whether to use
string hashing or the Z-algorithm.
Unlike hashing, the Z-algorithm always works
and there is no risk for collisions.
On the other hand, the Z-algorithm is more difficult
to implement and some problems can only be solved
using hashing.
As an example, consider again
the pattern matching problem, the pattern matching problem,
where our task is to find the positions where our task is to find the occurrences
where a pattern $p$ occurs in a string $s$. of a pattern $p$ in a string $s$.
We already solved this problem efficiently We already solved this problem efficiently
using string hashing, but the Z-algorithm using string hashing, but the Z-algorithm
provides another way to solve the problem. provides another way to solve the problem.
@ -1123,7 +1121,7 @@ the Z-array is as follows:
The positions 5 and 10 contain the value 3, The positions 5 and 10 contain the value 3,
which means that the pattern \texttt{ATT} which means that the pattern \texttt{ATT}
occurs in the corresponding positions occurs in the corresponding positions
in the string \texttt{HATTIVATTI}. of \texttt{HATTIVATTI}.
The time complexity of the resulting algorithm The time complexity of the resulting algorithm
is linear, because it suffices to construct is linear, because it suffices to construct