Improve language
This commit is contained in:
parent
bf51e8cb23
commit
41cc186beb
296
chapter26.tex
296
chapter26.tex
|
@ -9,17 +9,17 @@ time.
|
||||||
|
|
||||||
\index{pattern matching}
|
\index{pattern matching}
|
||||||
|
|
||||||
For example, a fundamental problem related to strings
|
For example, a fundamental string processing
|
||||||
is the \key{pattern matching} problem:
|
problem is the \key{pattern matching} problem:
|
||||||
given a string of length $n$ and a pattern of length $m$,
|
given a string of length $n$ and a pattern of length $m$,
|
||||||
our task is to find the positions where the pattern
|
our task is to find the occurrences of the pattern
|
||||||
occurs in the string.
|
in the string.
|
||||||
For example, the pattern \texttt{ABC} occurs two
|
For example, the pattern \texttt{ABC} occurs two
|
||||||
times in the string \texttt{ABABCBABC}.
|
times in the string \texttt{ABABCBABC}.
|
||||||
|
|
||||||
The pattern matching problem is easy to solve
|
The pattern matching problem can be easily solved
|
||||||
in $O(nm)$ time by a brute force algorithm that
|
in $O(nm)$ time by a brute force algorithm that
|
||||||
goes through all positions where the pattern may
|
tests all positions where the pattern may
|
||||||
occur in the string.
|
occur in the string.
|
||||||
However, in this chapter, we will see that there
|
However, in this chapter, we will see that there
|
||||||
are more efficient algorithms that require only
|
are more efficient algorithms that require only
|
||||||
|
@ -31,8 +31,13 @@ $O(n+m)$ time.
|
||||||
|
|
||||||
\index{alphabet}
|
\index{alphabet}
|
||||||
|
|
||||||
An \key{alphabet} is a set of characters
|
Throughout the chapter, we assume that
|
||||||
that may appear in strings.
|
zero-based indexing is used in strings.
|
||||||
|
Thus, a string \texttt{s} of length $n$
|
||||||
|
consists of characters
|
||||||
|
$\texttt{s}[0],\texttt{s}[1],\ldots,\texttt{s}[n-1]$.
|
||||||
|
The set of characters that may appear
|
||||||
|
in strings is called an \key{alphabet}.
|
||||||
For example, the alphabet
|
For example, the alphabet
|
||||||
$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
|
$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
|
||||||
consists of the capital letters of English.
|
consists of the capital letters of English.
|
||||||
|
@ -40,9 +45,12 @@ consists of the capital letters of English.
|
||||||
\index{substring}
|
\index{substring}
|
||||||
|
|
||||||
A \key{substring} is a sequence of consecutive
|
A \key{substring} is a sequence of consecutive
|
||||||
characters of a string.
|
characters in a string.
|
||||||
The number of substrings of a string is $n(n+1)/2$.
|
We use the notation $\texttt{s}[a \ldots b]$
|
||||||
For example, the substrings of the string
|
to refer to a substring of \texttt{s}
|
||||||
|
that begins at position $a$ and ends at position $b$.
|
||||||
|
A string of length $n$ has $n(n+1)/2$ substrings.
|
||||||
|
For example, the substrings of
|
||||||
\texttt{ABCD} are
|
\texttt{ABCD} are
|
||||||
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
||||||
\texttt{AB}, \texttt{BC}, \texttt{CD},
|
\texttt{AB}, \texttt{BC}, \texttt{CD},
|
||||||
|
@ -52,9 +60,9 @@ For example, the substrings of the string
|
||||||
|
|
||||||
A \key{subsequence} is a sequence of
|
A \key{subsequence} is a sequence of
|
||||||
(not necessarily consecutive) characters
|
(not necessarily consecutive) characters
|
||||||
of a string in their original order.
|
in a string in their original order.
|
||||||
The number of subsequences of a string is $2^n-1$.
|
A string of length $n$ has $2^n-1$ subsequences.
|
||||||
For example, the subsequences of the string
|
For example, the subsequences of
|
||||||
\texttt{ABCD} are
|
\texttt{ABCD} are
|
||||||
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
|
||||||
\texttt{AB}, \texttt{AC}, \texttt{AD},
|
\texttt{AB}, \texttt{AC}, \texttt{AD},
|
||||||
|
@ -69,19 +77,18 @@ A \key{prefix} is a subtring that starts at the beginning
|
||||||
of a string,
|
of a string,
|
||||||
and a \key{suffix} is a substring that ends at the end
|
and a \key{suffix} is a substring that ends at the end
|
||||||
of a string.
|
of a string.
|
||||||
For example, for the string \texttt{ABCD},
|
For example,
|
||||||
the prefixes are
|
the prefixes of \texttt{ABCD} are
|
||||||
\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD}
|
\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD},
|
||||||
and the suffixes are
|
and the suffixes of \texttt{ABCD} are
|
||||||
\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.
|
\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.
|
||||||
|
|
||||||
\index{rotation}
|
\index{rotation}
|
||||||
|
|
||||||
A \key{rotation} can be generated by moving
|
A \key{rotation} can be generated by moving
|
||||||
characters one by one from the beginning
|
the characters of a string one by one from the beginning
|
||||||
to the end of a string (or vice versa).
|
to the end (or vice versa).
|
||||||
For example, the rotations of the string
|
For example, the rotations of \texttt{ABCD} are
|
||||||
\texttt{ABCD} are
|
|
||||||
\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.
|
\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.
|
||||||
|
|
||||||
\index{period}
|
\index{period}
|
||||||
|
@ -97,13 +104,13 @@ For example, the shortest period of
|
||||||
|
|
||||||
A \key{border} is a string that is both
|
A \key{border} is a string that is both
|
||||||
a prefix and a suffix of a string.
|
a prefix and a suffix of a string.
|
||||||
For example, the borders of the string \texttt{ABACABA}
|
For example, the borders of \texttt{ABACABA}
|
||||||
are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.
|
are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.
|
||||||
|
|
||||||
\index{lexicographical order}
|
\index{lexicographical order}
|
||||||
|
|
||||||
Strings are usually compared using the \key{lexicographical order}
|
Strings are compared using the \key{lexicographical order}
|
||||||
that corresponds to the alphabetical order.
|
(which corresponds to the alphabetical order).
|
||||||
It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
|
It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
|
||||||
or there is a position $k$ such that
|
or there is a position $k$ such that
|
||||||
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
||||||
|
@ -112,11 +119,10 @@ $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
||||||
|
|
||||||
\index{trie}
|
\index{trie}
|
||||||
|
|
||||||
A \key{trie} is a tree structure that
|
A \key{trie} is a rooted tree that
|
||||||
maintains a set of strings.
|
maintains a set of strings.
|
||||||
Each string is stored as
|
Each string in the set is stored as
|
||||||
a chain of characters starting at
|
a chain of characters that starts at the root.
|
||||||
the root node.
|
|
||||||
If two strings have a common prefix,
|
If two strings have a common prefix,
|
||||||
they also have a common chain in the tree.
|
they also have a common chain in the tree.
|
||||||
|
|
||||||
|
@ -156,38 +162,39 @@ For example, consider the following trie:
|
||||||
This trie corresponds to the set
|
This trie corresponds to the set
|
||||||
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
|
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
|
||||||
The character * in a node means that
|
The character * in a node means that
|
||||||
one of the strings in the set ends at the node.
|
a string in the set ends at the node.
|
||||||
Such a character is needed, because a string
|
Such a character is needed, because a string
|
||||||
may be a prefix of another string.
|
may be a prefix of another string.
|
||||||
For example, in the above trie, \texttt{THE}
|
For example, in the above trie, \texttt{THE}
|
||||||
is a prefix of \texttt{THERE}.
|
is a prefix of \texttt{THERE}.
|
||||||
|
|
||||||
We can check if a trie contains a string
|
We can check in $O(n)$ time whether a trie
|
||||||
in $O(n)$ time where $n$ is the length of the string,
|
contains a string of length $n$,
|
||||||
because we can follow the chain that starts at the root node.
|
because we can follow the chain that starts at the root node.
|
||||||
We can also add a new string to the trie
|
We can also add a string of length $n$ to the trie
|
||||||
in $O(n)$ time using a similar idea.
|
in $O(n)$ time by first following the chain
|
||||||
If needed, new nodes will be added to the trie.
|
and then adding new nodes to the trie if necessary.
|
||||||
|
|
||||||
Using a trie, we can also find
|
Using a trie, we can find
|
||||||
for a given string the longest prefix
|
the longest prefix of a given string
|
||||||
that belongs to the set.
|
such that the prefix belongs to the set.
|
||||||
In addition, by storing additional information
|
Moreover, by storing additional information
|
||||||
in each node,
|
in each node,
|
||||||
it is possible to calculate the number of
|
we can calculate the number of
|
||||||
strings that have a given prefix.
|
strings that belong to the set and have a
|
||||||
|
given string as a prefix.
|
||||||
|
|
||||||
A trie can be stored in an array
|
A trie can be stored in an array
|
||||||
\begin{lstlisting}
|
\begin{lstlisting}
|
||||||
int t[N][A];
|
int trie[N][A];
|
||||||
\end{lstlisting}
|
\end{lstlisting}
|
||||||
where $N$ is the maximum number of nodes
|
where $N$ is the maximum number of nodes
|
||||||
(the maximum total length of the strings in the set)
|
(the maximum total length of the strings in the set)
|
||||||
and $A$ is the size of the alphabet.
|
and $A$ is the size of the alphabet.
|
||||||
The nodes of a trie are numbered
|
The nodes of a trie are numbered
|
||||||
$1,2,3,\ldots$ so that the number of the root is 1,
|
$0,1,2,\ldots$ so that the number of the root is 0,
|
||||||
and $\texttt{t}[s][c]$ is the next node in the chain
|
and $\texttt{trie}[s][c]$ is the next node in the chain
|
||||||
from node $s$ using character $c$.
|
when we move from node $s$ using character $c$.
|
||||||
|
|
||||||
\section{String hashing}
|
\section{String hashing}
|
||||||
|
|
||||||
|
@ -199,7 +206,7 @@ allows us to efficiently check whether two
|
||||||
strings are equal\footnote{The technique
|
strings are equal\footnote{The technique
|
||||||
was popularized by the Karp–Rabin pattern matching
|
was popularized by the Karp–Rabin pattern matching
|
||||||
algorithm \cite{kar87}.}.
|
algorithm \cite{kar87}.}.
|
||||||
The idea is to compare the hash values of the
|
The idea in string hashing is to compare hash values of
|
||||||
strings instead of their individual characters.
|
strings instead of their individual characters.
|
||||||
|
|
||||||
\subsubsection*{Calculating hash values}
|
\subsubsection*{Calculating hash values}
|
||||||
|
@ -217,15 +224,15 @@ based on their hash values.
|
||||||
|
|
||||||
A usual way to implement string hashing
|
A usual way to implement string hashing
|
||||||
is \key{polynomial hashing}, which means
|
is \key{polynomial hashing}, which means
|
||||||
that the hash value is calculated using the formula
|
that the hash value of a string \texttt{s}
|
||||||
|
of length $n$ is
|
||||||
\[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B ,\]
|
\[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B ,\]
|
||||||
where \texttt{s} is a string of length $n$
|
where $s[0],s[1],\ldots,s[n-1]$
|
||||||
(so $s[0],s[1],\ldots,s[n-1]$
|
are interpreted as the codes of the characters of \texttt{s},
|
||||||
are the codes of the characters),
|
|
||||||
and $A$ and $B$ are pre-chosen constants.
|
and $A$ and $B$ are pre-chosen constants.
|
||||||
|
|
||||||
For example, the codes of the characters
|
For example, the codes of the characters
|
||||||
in the string \texttt{ALLEY} are:
|
of \texttt{ALLEY} are:
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
\draw (0,0) grid (5,2);
|
\draw (0,0) grid (5,2);
|
||||||
|
@ -246,41 +253,37 @@ in the string \texttt{ALLEY} are:
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
Thus, if $A=3$ and $B=97$, the hash value
|
Thus, if $A=3$ and $B=97$, the hash value
|
||||||
of the string \texttt{ALLEY} is
|
of \texttt{ALLEY} is
|
||||||
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
|
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
|
||||||
|
|
||||||
\subsubsection*{Preprocessing}
|
\subsubsection*{Preprocessing}
|
||||||
|
|
||||||
It turns out that using polynomial hashing,
|
Using polynomial hashing, we can calculate the hash value of any substring
|
||||||
we can calculate the hash value of any substring
|
of a string \texttt{s} in $O(1)$ time after an $O(n)$ time preprocessing.
|
||||||
of a string
|
The idea is to construct an array \texttt{h} such that
|
||||||
in $O(1)$ time after an $O(n)$ time preprocessing.
|
$\texttt{h}[k]$ contains the hash value of the prefix $\texttt{s}[0 \ldots k]$.
|
||||||
|
|
||||||
The idea is to construct an array $h$ such that
|
|
||||||
$h[k]$ contains the hash value of the prefix
|
|
||||||
of the string that ends at position $k$.
|
|
||||||
The array values can be recursively calculated as follows:
|
The array values can be recursively calculated as follows:
|
||||||
\[
|
\[
|
||||||
\begin{array}{lcl}
|
\begin{array}{lcl}
|
||||||
h[0] & = & \texttt{s}[0] \\
|
\texttt{h}[0] & = & \texttt{s}[0] \\
|
||||||
h[k] & = & (h[k-1] A + \texttt{s}[k]) \bmod B \\
|
\texttt{h}[k] & = & (\texttt{h}[k-1] A + \texttt{s}[k]) \bmod B \\
|
||||||
\end{array}
|
\end{array}
|
||||||
\]
|
\]
|
||||||
In addition, we construct an array $p$
|
In addition, we construct an array $\texttt{p}$
|
||||||
where $p[k]=A^k \bmod B$:
|
where $\texttt{p}[k]=A^k \bmod B$:
|
||||||
\[
|
\[
|
||||||
\begin{array}{lcl}
|
\begin{array}{lcl}
|
||||||
p[0] & = & 1 \\
|
\texttt{p}[0] & = & 1 \\
|
||||||
p[k] & = & (p[k-1] A) \bmod B. \\
|
\texttt{p}[k] & = & (\texttt{p}[k-1] A) \bmod B. \\
|
||||||
\end{array}
|
\end{array}
|
||||||
\]
|
\]
|
||||||
Constructing these arrays takes $O(n)$ time.
|
Constructing these arrays takes $O(n)$ time.
|
||||||
After this, the hash value of a substring
|
After this, the hash value of any substring
|
||||||
that begins at position $a$ and ends at position $b$
|
$\texttt{s}[a \ldots b]$
|
||||||
can be calculated in $O(1)$ time using the formula
|
can be calculated in $O(1)$ time using the formula
|
||||||
\[(h[b]-h[a-1] p[b-a+1]) \bmod B\]
|
\[(\texttt{h}[b]-\texttt{h}[a-1] \texttt{p}[b-a+1]) \bmod B\]
|
||||||
assuming that $a>0$.
|
assuming that $a>0$.
|
||||||
If $a=0$, the hash value is simply $h[b]$.
|
If $a=0$, the hash value is simply $\texttt{h}[b]$.
|
||||||
|
|
||||||
\subsubsection*{Using hash values}
|
\subsubsection*{Using hash values}
|
||||||
|
|
||||||
|
@ -364,7 +367,7 @@ The probability of one or more collisions is
|
||||||
|
|
||||||
\[1-(1-\frac{1}{B})^n.\]
|
\[1-(1-\frac{1}{B})^n.\]
|
||||||
|
|
||||||
\textit{Scenario 3:} Strings $x_1,x_2,\ldots,x_n$
|
\textit{Scenario 3:} All pairs of strings $x_1,x_2,\ldots,x_n$
|
||||||
are compared with each other.
|
are compared with each other.
|
||||||
The probability of one or more collisions is
|
The probability of one or more collisions is
|
||||||
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
|
\[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
|
||||||
|
@ -398,7 +401,7 @@ $B \approx 10^9$.
|
||||||
|
|
||||||
The phenomenon in scenario 3 is known as the
|
The phenomenon in scenario 3 is known as the
|
||||||
\key{birthday paradox}: if there are $n$ people
|
\key{birthday paradox}: if there are $n$ people
|
||||||
in a room, the probability that some two people
|
in a room, the probability that \emph{some} two people
|
||||||
have the same birthday is large even if $n$ is quite small.
|
have the same birthday is large even if $n$ is quite small.
|
||||||
In hashing, correspondingly, when all hash values are compared
|
In hashing, correspondingly, when all hash values are compared
|
||||||
with each other, the probability that some two
|
with each other, the probability that some two
|
||||||
|
@ -417,7 +420,7 @@ which makes the probability of a collision very small.
|
||||||
Some people use constants $B=2^{32}$ and $B=2^{64}$,
|
Some people use constants $B=2^{32}$ and $B=2^{64}$,
|
||||||
which is convenient, because operations with 32 and 64
|
which is convenient, because operations with 32 and 64
|
||||||
bit integers are calculated modulo $2^{32}$ and $2^{64}$.
|
bit integers are calculated modulo $2^{32}$ and $2^{64}$.
|
||||||
However, this is not a good choice, because it is possible
|
However, this is \emph{not} a good choice, because it is possible
|
||||||
to construct inputs that always generate collisions when
|
to construct inputs that always generate collisions when
|
||||||
constants of the form $2^x$ are used \cite{pac13}.
|
constants of the form $2^x$ are used \cite{pac13}.
|
||||||
|
|
||||||
|
@ -426,17 +429,16 @@ constants of the form $2^x$ are used \cite{pac13}.
|
||||||
\index{Z-algorithm}
|
\index{Z-algorithm}
|
||||||
\index{Z-array}
|
\index{Z-array}
|
||||||
|
|
||||||
The \key{Z-array} of a string
|
The \key{Z-array} \texttt{z} of a string \texttt{s}
|
||||||
contains for each position of the string
|
of length $n$ contains for each $k=0,1,\ldots,n-1$
|
||||||
the length of the longest substring
|
the length of the longest substring of \texttt{s}
|
||||||
that begins at that position and is a prefix of the string.
|
that begins at position $k$ and is a prefix of \texttt{s}.
|
||||||
Such an array can be efficiently constructed
|
Thus, $\texttt{z}[k]=p$ tells us that
|
||||||
using the \key{Z-algorithm}\footnote{The Z-algorithm
|
$\texttt{s}[0 \ldots p-1]$ equals $\texttt{s}[k \ldots k+p-1]$.
|
||||||
was presented in \cite{gus97} as the simplest known
|
Many string processing problems can be efficiently solved
|
||||||
method for linear-time pattern matching, and the original idea
|
using the Z-array.
|
||||||
was attributed to \cite{mai84}.}.
|
|
||||||
|
|
||||||
For example, the Z-array of the string
|
For example, the Z-array of
|
||||||
\texttt{ACBACDACBACBACDA} is as follows:
|
\texttt{ACBACDACBACBACDA} is as follows:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
|
@ -498,48 +500,45 @@ For example, the Z-array of the string
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
For example, the value at position 6 of the
|
In this case, for example, $\texttt{z}[6]=5$,
|
||||||
above Z-array is 5,
|
|
||||||
because the substring \texttt{ACBAC} of length 5
|
because the substring \texttt{ACBAC} of length 5
|
||||||
is a prefix of the string,
|
is a prefix of \texttt{s},
|
||||||
but the substring \texttt{ACBACB} of length 6
|
but the substring \texttt{ACBACB} of length 6
|
||||||
is not a prefix of the string.
|
is not a prefix of \texttt{s}.
|
||||||
|
|
||||||
It is often a matter of taste whether to use
|
|
||||||
string hashing or the Z-algorithm.
|
|
||||||
Unlike hashing, the Z-algorithm always works
|
|
||||||
and there is no risk for collisions.
|
|
||||||
On the other hand, the Z-algorithm is more difficult
|
|
||||||
to implement and some problems can only be solved
|
|
||||||
using hashing.
|
|
||||||
|
|
||||||
\subsubsection*{Algorithm description}
|
\subsubsection*{Algorithm description}
|
||||||
|
|
||||||
The Z-algorithm scans the string from left
|
Next we describe an algorithm,
|
||||||
to right, and calculates for each position
|
called the \key{Z-algorithm}\footnote{The Z-algorithm
|
||||||
the length of the longest substring that
|
was presented in \cite{gus97} as the simplest known
|
||||||
is a prefix of the string.
|
method for linear-time pattern matching, and the original idea
|
||||||
A straightforward algorithm
|
was attributed to \cite{mai84}.},
|
||||||
would have a time complexity of $O(n^2)$,
|
that efficiently constructs the Z-array in $O(n)$ time.
|
||||||
but the Z-algorithm has an important
|
The algorithm calculates the Z-array values
|
||||||
optimization which ensures that the time complexity
|
from left to right by both using information
|
||||||
is only $O(n)$.
|
already stored in the Z-array and comparing substrings
|
||||||
|
character by character.
|
||||||
|
|
||||||
The idea is to maintain a range $[x,y]$ such that
|
To efficiently calculate the Z-array values,
|
||||||
the substring from $x$ to $y$ is a prefix of
|
the algorithm maintains a range $[x,y]$ such that
|
||||||
the string and $y$ is as large as possible.
|
$\texttt{s}[x \ldots y]$ is a prefix of \texttt{s}
|
||||||
Since the characters in the ranges $[0,y-x]$
|
and $y$ is as large as possible.
|
||||||
and $[x,y]$ are the same,
|
Since we know that $\texttt{s}[0 \ldots y-x]$
|
||||||
we can use this information to calculate
|
and $\texttt{s}[x \ldots y]$ are equal,
|
||||||
the Z-array values in the range $[x,y]$.
|
we can use this information when calculating
|
||||||
|
Z-values for positions $x+1,x+2,\ldots,y$.
|
||||||
|
|
||||||
The time complexity of the Z-algorithm is $O(n)$,
|
At each position $k$, we first
|
||||||
because the algorithm only compares strings
|
check the value of $\texttt{z}[k-x]$.
|
||||||
character by character starting at position $y+1$.
|
If $k+\texttt{z}[k-x]<y$, we know that $\texttt{z}[k]=\texttt{z}[k-x]$.
|
||||||
If the characters match, the value of $y$ increases,
|
However, if $k+\texttt{z}[k-x] \ge y$,
|
||||||
and it is not needed to compare the character at
|
$\texttt{s}[0 \ldots y-k]$ equals
|
||||||
position $y$ again
|
$\texttt{s}[k \ldots y]$, and to determine the
|
||||||
but the information in the Z-array can be used.
|
value of $\texttt{z}[k]$ we need to compare
|
||||||
|
the substrings character by character.
|
||||||
|
Still, the algorithm works in $O(n)$ time,
|
||||||
|
because we start comparing at positions
|
||||||
|
$y-k+1$ and $y+1$.
|
||||||
|
|
||||||
For example, let us construct the following Z-array:
|
For example, let us construct the following Z-array:
|
||||||
|
|
||||||
|
@ -602,10 +601,8 @@ For example, let us construct the following Z-array:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
The first interesting position is 6 where the
|
After calculating the value $\texttt{z}[6]=5$,
|
||||||
length of the common prefix is 5.
|
the current $[x,y]$ range is $[6,10]$:
|
||||||
After calculating this value,
|
|
||||||
the current $[x,y]$ range will be $[6,10]$:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -673,17 +670,15 @@ the current $[x,y]$ range will be $[6,10]$:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
Now, it is possible to calculate the
|
Now we can calculate
|
||||||
subsequent values of the Z-array
|
subsequent Z-array values
|
||||||
more efficiently,
|
efficiently,
|
||||||
because we know that
|
because we know that
|
||||||
the ranges $[0,4]$ and $[6,10]$
|
$\texttt{s}[0 \ldots 4]$ and
|
||||||
contain the same characters.
|
$\texttt{s}[6 \ldots 10]$ are equal.
|
||||||
First, since the values at
|
First, since $\texttt{z}[1] = \texttt{z}[2] = 0$,
|
||||||
positions 1 and 2 are 0,
|
we immediately know that also
|
||||||
we immediately know that
|
$\texttt{z}[7] = \texttt{z}[8] = 0$:
|
||||||
the values at positions 7 and 8
|
|
||||||
are also 0:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -755,9 +750,7 @@ are also 0:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
After this, we know that the value
|
Then, since $\texttt{z}[3]=2$, we know that $\texttt{z}[9] \ge 2$:
|
||||||
at position 9 will be at least 2,
|
|
||||||
because the value at position 3 is 2:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -826,8 +819,8 @@ because the value at position 3 is 2:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
Since we have no information about the characters
|
However, we have no information about the string
|
||||||
after position 10, we have to begin to compare the strings
|
after position 10, so we need to compare the substrings
|
||||||
character by character:
|
character by character:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
|
@ -901,10 +894,8 @@ character by character:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
|
It turns out that $\texttt{z}[9]=7$,
|
||||||
It turns out that the length of the common
|
so the new $[x,y]$ range is $[9,15]$:
|
||||||
prefix at position 9 is 7,
|
|
||||||
and thus the new range $[x,y]$ is $[9,15]$:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -973,10 +964,9 @@ and thus the new range $[x,y]$ is $[9,15]$:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
After this, all subsequent values of the Z-array
|
After this, all the remaining Z-array values
|
||||||
can be calculated using the values already
|
can be determined by using the information
|
||||||
stored in the array. All the remaining values can be
|
already stored in the Z-array:
|
||||||
directly retrieved from the beginning of the Z-array:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=0.7]
|
\begin{tikzpicture}[scale=0.7]
|
||||||
|
@ -1045,10 +1035,18 @@ directly retrieved from the beginning of the Z-array:
|
||||||
|
|
||||||
\subsubsection{Using the Z-array}
|
\subsubsection{Using the Z-array}
|
||||||
|
|
||||||
As an example, let us consider again
|
It is often a matter of taste whether to use
|
||||||
|
string hashing or the Z-algorithm.
|
||||||
|
Unlike hashing, the Z-algorithm always works
|
||||||
|
and there is no risk for collisions.
|
||||||
|
On the other hand, the Z-algorithm is more difficult
|
||||||
|
to implement and some problems can only be solved
|
||||||
|
using hashing.
|
||||||
|
|
||||||
|
As an example, consider again
|
||||||
the pattern matching problem,
|
the pattern matching problem,
|
||||||
where our task is to find the positions
|
where our task is to find the occurrences
|
||||||
where a pattern $p$ occurs in a string $s$.
|
of a pattern $p$ in a string $s$.
|
||||||
We already solved this problem efficiently
|
We already solved this problem efficiently
|
||||||
using string hashing, but the Z-algorithm
|
using string hashing, but the Z-algorithm
|
||||||
provides another way to solve the problem.
|
provides another way to solve the problem.
|
||||||
|
@ -1123,7 +1121,7 @@ the Z-array is as follows:
|
||||||
The positions 5 and 10 contain the value 3,
|
The positions 5 and 10 contain the value 3,
|
||||||
which means that the pattern \texttt{ATT}
|
which means that the pattern \texttt{ATT}
|
||||||
occurs in the corresponding positions
|
occurs in the corresponding positions
|
||||||
in the string \texttt{HATTIVATTI}.
|
of \texttt{HATTIVATTI}.
|
||||||
|
|
||||||
The time complexity of the resulting algorithm
|
The time complexity of the resulting algorithm
|
||||||
is linear, because it suffices to construct
|
is linear, because it suffices to construct
|
||||||
|
|
Loading…
Reference in New Issue