diff --git a/chapter26.tex b/chapter26.tex index c5bc1a8..a1011ad 100644 --- a/chapter26.tex +++ b/chapter26.tex @@ -9,17 +9,17 @@ time. \index{pattern matching} -For example, a fundamental problem related to strings -is the \key{pattern matching} problem: +For example, a fundamental string processing +problem is the \key{pattern matching} problem: given a string of length $n$ and a pattern of length $m$, -our task is to find the positions where the pattern -occurs in the string. +our task is to find the occurrences of the pattern +in the string. For example, the pattern \texttt{ABC} occurs two times in the string \texttt{ABABCBABC}. -The pattern matching problem is easy to solve +The pattern matching problem can be easily solved in $O(nm)$ time by a brute force algorithm that -goes through all positions where the pattern may +tests all positions where the pattern may occur in the string. However, in this chapter, we will see that there are more efficient algorithms that require only @@ -31,8 +31,13 @@ $O(n+m)$ time. \index{alphabet} -An \key{alphabet} is a set of characters -that may appear in strings. +Throughout the chapter, we assume that +zero-based indexing is used in strings. +Thus, a string \texttt{s} of length $n$ +consists of characters +$\texttt{s}[0],\texttt{s}[1],\ldots,\texttt{s}[n-1]$. +The set of characters that may appear +in strings is called an \key{alphabet}. For example, the alphabet $\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$ consists of the capital letters of English. @@ -40,9 +45,12 @@ consists of the capital letters of English. \index{substring} A \key{substring} is a sequence of consecutive -characters of a string. -The number of substrings of a string is $n(n+1)/2$. -For example, the substrings of the string +characters in a string. +We use the notation $\texttt{s}[a \ldots b]$ +to refer to a substring of \texttt{s} +that begins at position $a$ and ends at position $b$. +A string of length $n$ has $n(n+1)/2$ substrings. +For example, the substrings of \texttt{ABCD} are \texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{AB}, \texttt{BC}, \texttt{CD}, @@ -52,9 +60,9 @@ For example, the substrings of the string A \key{subsequence} is a sequence of (not necessarily consecutive) characters -of a string in their original order. -The number of subsequences of a string is $2^n-1$. -For example, the subsequences of the string +in a string in their original order. +A string of length $n$ has $2^n-1$ subsequences. +For example, the subsequences of \texttt{ABCD} are \texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{AB}, \texttt{AC}, \texttt{AD}, @@ -69,19 +77,18 @@ A \key{prefix} is a subtring that starts at the beginning of a string, and a \key{suffix} is a substring that ends at the end of a string. -For example, for the string \texttt{ABCD}, -the prefixes are -\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD} -and the suffixes are +For example, +the prefixes of \texttt{ABCD} are +\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD}, +and the suffixes of \texttt{ABCD} are \texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}. \index{rotation} A \key{rotation} can be generated by moving -characters one by one from the beginning -to the end of a string (or vice versa). -For example, the rotations of the string -\texttt{ABCD} are +the characters of a string one by one from the beginning +to the end (or vice versa). +For example, the rotations of \texttt{ABCD} are \texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}. \index{period} @@ -97,13 +104,13 @@ For example, the shortest period of A \key{border} is a string that is both a prefix and a suffix of a string. -For example, the borders of the string \texttt{ABACABA} +For example, the borders of \texttt{ABACABA} are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}. \index{lexicographical order} -Strings are usually compared using the \key{lexicographical order} -that corresponds to the alphabetical order. +Strings are compared using the \key{lexicographical order} +(which corresponds to the alphabetical order). It means that $x0$. -If $a=0$, the hash value is simply $h[b]$. +If $a=0$, the hash value is simply $\texttt{h}[b]$. \subsubsection*{Using hash values} @@ -364,7 +367,7 @@ The probability of one or more collisions is \[1-(1-\frac{1}{B})^n.\] -\textit{Scenario 3:} Strings $x_1,x_2,\ldots,x_n$ +\textit{Scenario 3:} All pairs of strings $x_1,x_2,\ldots,x_n$ are compared with each other. The probability of one or more collisions is \[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\] @@ -398,7 +401,7 @@ $B \approx 10^9$. The phenomenon in scenario 3 is known as the \key{birthday paradox}: if there are $n$ people -in a room, the probability that some two people +in a room, the probability that \emph{some} two people have the same birthday is large even if $n$ is quite small. In hashing, correspondingly, when all hash values are compared with each other, the probability that some two @@ -417,7 +420,7 @@ which makes the probability of a collision very small. Some people use constants $B=2^{32}$ and $B=2^{64}$, which is convenient, because operations with 32 and 64 bit integers are calculated modulo $2^{32}$ and $2^{64}$. -However, this is not a good choice, because it is possible +However, this is \emph{not} a good choice, because it is possible to construct inputs that always generate collisions when constants of the form $2^x$ are used \cite{pac13}. @@ -426,17 +429,16 @@ constants of the form $2^x$ are used \cite{pac13}. \index{Z-algorithm} \index{Z-array} -The \key{Z-array} of a string -contains for each position of the string -the length of the longest substring -that begins at that position and is a prefix of the string. -Such an array can be efficiently constructed -using the \key{Z-algorithm}\footnote{The Z-algorithm -was presented in \cite{gus97} as the simplest known -method for linear-time pattern matching, and the original idea -was attributed to \cite{mai84}.}. +The \key{Z-array} \texttt{z} of a string \texttt{s} +of length $n$ contains for each $k=0,1,\ldots,n-1$ +the length of the longest substring of \texttt{s} +that begins at position $k$ and is a prefix of \texttt{s}. +Thus, $\texttt{z}[k]=p$ tells us that +$\texttt{s}[0 \ldots p-1]$ equals $\texttt{s}[k \ldots k+p-1]$. +Many string processing problems can be efficiently solved +using the Z-array. -For example, the Z-array of the string +For example, the Z-array of \texttt{ACBACDACBACBACDA} is as follows: \begin{center} @@ -498,48 +500,45 @@ For example, the Z-array of the string \end{tikzpicture} \end{center} -For example, the value at position 6 of the -above Z-array is 5, +In this case, for example, $\texttt{z}[6]=5$, because the substring \texttt{ACBAC} of length 5 -is a prefix of the string, +is a prefix of \texttt{s}, but the substring \texttt{ACBACB} of length 6 -is not a prefix of the string. - -It is often a matter of taste whether to use -string hashing or the Z-algorithm. -Unlike hashing, the Z-algorithm always works -and there is no risk for collisions. -On the other hand, the Z-algorithm is more difficult -to implement and some problems can only be solved -using hashing. +is not a prefix of \texttt{s}. \subsubsection*{Algorithm description} -The Z-algorithm scans the string from left -to right, and calculates for each position -the length of the longest substring that -is a prefix of the string. -A straightforward algorithm -would have a time complexity of $O(n^2)$, -but the Z-algorithm has an important -optimization which ensures that the time complexity -is only $O(n)$. +Next we describe an algorithm, +called the \key{Z-algorithm}\footnote{The Z-algorithm +was presented in \cite{gus97} as the simplest known +method for linear-time pattern matching, and the original idea +was attributed to \cite{mai84}.}, +that efficiently constructs the Z-array in $O(n)$ time. +The algorithm calculates the Z-array values +from left to right by both using information +already stored in the Z-array and comparing substrings +character by character. -The idea is to maintain a range $[x,y]$ such that -the substring from $x$ to $y$ is a prefix of -the string and $y$ is as large as possible. -Since the characters in the ranges $[0,y-x]$ -and $[x,y]$ are the same, -we can use this information to calculate -the Z-array values in the range $[x,y]$. +To efficiently calculate the Z-array values, +the algorithm maintains a range $[x,y]$ such that +$\texttt{s}[x \ldots y]$ is a prefix of \texttt{s} +and $y$ is as large as possible. +Since we know that $\texttt{s}[0 \ldots y-x]$ +and $\texttt{s}[x \ldots y]$ are equal, +we can use this information when calculating +Z-values for positions $x+1,x+2,\ldots,y$. -The time complexity of the Z-algorithm is $O(n)$, -because the algorithm only compares strings -character by character starting at position $y+1$. -If the characters match, the value of $y$ increases, -and it is not needed to compare the character at -position $y$ again -but the information in the Z-array can be used. +At each position $k$, we first +check the value of $\texttt{z}[k-x]$. +If $k+\texttt{z}[k-x]