Improve language

2017-05-12 21:15:25 +03:00 · 2017-05-12 21:15:25 +03:00 · 41cc186beb
parent bf51e8cb23
commit 41cc186beb
1 changed files with 147 additions and 149 deletions
--- a/chapter26.tex
+++ b/chapter26.tex
@ -9,17 +9,17 @@ time.

 \index{pattern matching}

-For example, a fundamental problem related to strings
-is the \key{pattern matching} problem:
+For example, a fundamental string processing
+problem is the \key{pattern matching} problem:
 given a string of length $n$ and a pattern of length $m$,
-our task is to find the positions where the pattern
-occurs in the string.
+our task is to find the occurrences of the pattern
+in the string.
 For example, the pattern \texttt{ABC} occurs two
 times in the string \texttt{ABABCBABC}.

-The pattern matching problem is easy to solve
+The pattern matching problem can be easily solved
 in $O(nm)$ time by a brute force algorithm that
-goes through all positions where the pattern may
+tests all positions where the pattern may
 occur in the string.
 However, in this chapter, we will see that there
 are more efficient algorithms that require only
@ -31,8 +31,13 @@ $O(n+m)$ time.

 \index{alphabet}

-An \key{alphabet} is a set of characters
-that may appear in strings.
+Throughout the chapter, we assume that
+zero-based indexing is used in strings.
+Thus, a string \texttt{s} of length $n$
+consists of characters
+$\texttt{s}[0],\texttt{s}[1],\ldots,\texttt{s}[n-1]$.
+The set of characters that may appear
+in strings is called an \key{alphabet}.
 For example, the alphabet
 $\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
 consists of the capital letters of English.
@ -40,9 +45,12 @@ consists of the capital letters of English.
 \index{substring}

 A \key{substring} is a sequence of consecutive
-characters of a string.
-The number of substrings of a string is $n(n+1)/2$.
-For example, the substrings of the string
+characters in a string.
+We use the notation $\texttt{s}[a \ldots b]$
+to refer to a substring of \texttt{s}
+that begins at position $a$ and ends at position $b$.
+A string of length $n$ has $n(n+1)/2$ substrings.
+For example, the substrings of
 \texttt{ABCD} are
 \texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
 \texttt{AB}, \texttt{BC}, \texttt{CD},
@ -52,9 +60,9 @@ For example, the substrings of the string

 A \key{subsequence} is a sequence of
 (not necessarily consecutive) characters
-of a string in their original order.
-The number of subsequences of a string is $2^n-1$.
-For example, the subsequences of the string
+in a string in their original order.
+A string of length $n$ has $2^n-1$ subsequences.
+For example, the subsequences of
 \texttt{ABCD} are
 \texttt{A}, \texttt{B}, \texttt{C}, \texttt{D},
 \texttt{AB}, \texttt{AC}, \texttt{AD},
@ -69,19 +77,18 @@ A \key{prefix} is a subtring that starts at the beginning
 of a string,
 and a \key{suffix} is a substring that ends at the end
 of a string.
-For example, for the string \texttt{ABCD},
-the prefixes are
-\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD}
-and the suffixes are
+For example,
+the prefixes of \texttt{ABCD} are
+\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD},
+and the suffixes of \texttt{ABCD} are
 \texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}.

 \index{rotation}

 A \key{rotation} can be generated by moving
-characters one by one from the beginning
-to the end of a string (or vice versa).
-For example, the rotations of the string
-\texttt{ABCD} are
+the characters of a string one by one from the beginning
+to the end (or vice versa).
+For example, the rotations of \texttt{ABCD} are
 \texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}.

 \index{period}
@ -97,13 +104,13 @@ For example, the shortest period of

 A \key{border} is a string that is both
 a prefix and a suffix of a string.
-For example, the borders of the string \texttt{ABACABA}
+For example, the borders of \texttt{ABACABA}
 are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}.

 \index{lexicographical order}

-Strings are usually compared using the \key{lexicographical order}
-that corresponds to the alphabetical order.
+Strings are compared using the \key{lexicographical order}
+(which corresponds to the alphabetical order).
 It means that $x<y$ if either $x \neq y$ and $x$ is a prefix of $y$,
 or there is a position $k$ such that
 $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
@ -112,11 +119,10 @@ $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.

 \index{trie}

-A \key{trie} is a tree structure that
+A \key{trie} is a rooted tree that
 maintains a set of strings.
-Each string is stored as
-a chain of characters starting at
-the root node.
+Each string in the set is stored as
+a chain of characters that starts at the root.
 If two strings have a common prefix,
 they also have a common chain in the tree.

@ -156,38 +162,39 @@ For example, consider the following trie:
 This trie corresponds to the set
 $\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
 The character * in a node means that
-one of the strings in the set ends at the node.
+a string in the set ends at the node.
 Such a character is needed, because a string
 may be a prefix of another string.
 For example, in the above trie, \texttt{THE}
 is a prefix of \texttt{THERE}.

-We can check if a trie contains a string
-in $O(n)$ time where $n$ is the length of the string,
+We can check in $O(n)$ time whether a trie
+contains a string of length $n$,
 because we can follow the chain that starts at the root node.
-We can also add a new string to the trie
-in $O(n)$ time using a similar idea.
-If needed, new nodes will be added to the trie.
+We can also add a string of length $n$ to the trie
+in $O(n)$ time by first following the chain
+and then adding new nodes to the trie if necessary.

-Using a trie, we can also find
-for a given string the longest prefix
-that belongs to the set.
-In addition, by storing additional information
+Using a trie, we can find
+the longest prefix of a given string
+such that the prefix belongs to the set.
+Moreover, by storing additional information
 in each node,
-it is possible to calculate the number of
-strings that have a given prefix.
+we can calculate the number of
+strings that belong to the set and have a
+given string as a prefix.

 A trie can be stored in an array
 \begin{lstlisting}
-int t[N][A];
+int trie[N][A];
 \end{lstlisting}
 where $N$ is the maximum number of nodes
 (the maximum total length of the strings in the set)
 and $A$ is the size of the alphabet.
 The nodes of a trie are numbered
-$1,2,3,\ldots$ so that the number of the root is 1,
-and $\texttt{t}[s][c]$ is the next node in the chain
-from node $s$ using character $c$.
+$0,1,2,\ldots$ so that the number of the root is 0,
+and $\texttt{trie}[s][c]$ is the next node in the chain
+when we move from node $s$ using character $c$.

 \section{String hashing}

@ -199,7 +206,7 @@ allows us to efficiently check whether two
 strings are equal\footnote{The technique
 was popularized by the Karp–Rabin pattern matching
 algorithm \cite{kar87}.}.
-The idea is to compare the hash values of the
+The idea in string hashing is to compare hash values of
 strings instead of their individual characters.

 \subsubsection*{Calculating hash values}
@ -217,15 +224,15 @@ based on their hash values.

 A usual way to implement string hashing
 is \key{polynomial hashing}, which means
-that the hash value is calculated using the formula
+that the hash value of a string \texttt{s}
+of length $n$ is
 \[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B  ,\]
-where \texttt{s} is a string of length $n$
-(so $s[0],s[1],\ldots,s[n-1]$
-are the codes of the characters),
+where $s[0],s[1],\ldots,s[n-1]$
+are interpreted as the codes of the characters of \texttt{s},
 and $A$ and $B$ are pre-chosen constants.

 For example, the codes of the characters
-in the string \texttt{ALLEY} are:
+of \texttt{ALLEY} are:
 \begin{center}
 \begin{tikzpicture}[scale=0.7]
 \draw (0,0) grid (5,2);
@ -246,41 +253,37 @@ in the string \texttt{ALLEY} are:
 \end{center}

 Thus, if $A=3$ and $B=97$, the hash value
-of the string \texttt{ALLEY} is
+of \texttt{ALLEY} is
 \[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]

 \subsubsection*{Preprocessing}

-It turns out that using polynomial hashing,
-we can calculate the hash value of any substring
-of a string
-in $O(1)$ time after an $O(n)$ time preprocessing.
-
-The idea is to construct an array $h$ such that
-$h[k]$ contains the hash value of the prefix
-of the string that ends at position $k$.
+Using polynomial hashing, we can calculate the hash value of any substring
+of a string \texttt{s} in $O(1)$ time after an $O(n)$ time preprocessing.
+The idea is to construct an array \texttt{h} such that
+$\texttt{h}[k]$ contains the hash value of the prefix $\texttt{s}[0 \ldots k]$.
 The array values can be recursively calculated as follows:
 \[
 \begin{array}{lcl}
-h[0] & = & \texttt{s}[0] \\
-h[k] & = & (h[k-1] A + \texttt{s}[k]) \bmod B \\
+\texttt{h}[0] & = & \texttt{s}[0] \\
+\texttt{h}[k] & = & (\texttt{h}[k-1] A + \texttt{s}[k]) \bmod B \\
 \end{array}
 \]
-In addition, we construct an array $p$
-where $p[k]=A^k \bmod B$:
+In addition, we construct an array $\texttt{p}$
+where $\texttt{p}[k]=A^k \bmod B$:
 \[
 \begin{array}{lcl}
-p[0] & = & 1 \\
-p[k] & = & (p[k-1] A) \bmod B. \\
+\texttt{p}[0] & = & 1 \\
+\texttt{p}[k] & = & (\texttt{p}[k-1] A) \bmod B. \\
 \end{array}
 \]
 Constructing these arrays takes $O(n)$ time.
-After this, the hash value of a substring
-that begins at position $a$ and ends at position $b$
+After this, the hash value of any substring
+$\texttt{s}[a \ldots b]$
 can be calculated in $O(1)$ time using the formula
-\[(h[b]-h[a-1] p[b-a+1]) \bmod B\]
+\[(\texttt{h}[b]-\texttt{h}[a-1] \texttt{p}[b-a+1]) \bmod B\]
 assuming that $a>0$.
-If $a=0$, the hash value is simply $h[b]$.
+If $a=0$, the hash value is simply $\texttt{h}[b]$.

 \subsubsection*{Using hash values}

@ -364,7 +367,7 @@ The probability of one or more collisions is

 \[1-(1-\frac{1}{B})^n.\]

-\textit{Scenario 3:} Strings $x_1,x_2,\ldots,x_n$
+\textit{Scenario 3:} All pairs of strings $x_1,x_2,\ldots,x_n$
 are compared with each other.
 The probability of one or more collisions is
 \[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
@ -398,7 +401,7 @@ $B \approx 10^9$.

 The phenomenon in scenario 3 is known as the
 \key{birthday paradox}: if there are $n$ people
-in a room, the probability that some two people
+in a room, the probability that \emph{some} two people
 have the same birthday is large even if $n$ is quite small.
 In hashing, correspondingly, when all hash values are compared
 with each other, the probability that some two
@ -417,7 +420,7 @@ which makes the probability of a collision very small.
 Some people use constants $B=2^{32}$ and $B=2^{64}$,
 which is convenient, because operations with 32 and 64
 bit integers are calculated modulo $2^{32}$ and $2^{64}$.
-However, this is not a good choice, because it is possible
+However, this is \emph{not} a good choice, because it is possible
 to construct inputs that always generate collisions when
 constants of the form $2^x$ are used \cite{pac13}.

@ -426,17 +429,16 @@ constants of the form $2^x$ are used \cite{pac13}.
 \index{Z-algorithm}
 \index{Z-array}

-The \key{Z-array} of a string
-contains for each position of the string
-the length of the longest substring
-that begins at that position and is a prefix of the string.
-Such an array can be efficiently constructed
-using the \key{Z-algorithm}\footnote{The Z-algorithm
-was presented in \cite{gus97} as the simplest known
-method for linear-time pattern matching, and the original idea
-was attributed to \cite{mai84}.}.
+The \key{Z-array} \texttt{z} of a string \texttt{s}
+of length $n$ contains for each $k=0,1,\ldots,n-1$
+the length of the longest substring of \texttt{s}
+that begins at position $k$ and is a prefix of \texttt{s}.
+Thus, $\texttt{z}[k]=p$ tells us that
+$\texttt{s}[0 \ldots p-1]$ equals $\texttt{s}[k \ldots k+p-1]$.
+Many string processing problems can be efficiently solved
+using the Z-array.

-For example, the Z-array of the string
+For example, the Z-array of
 \texttt{ACBACDACBACBACDA} is as follows:

 \begin{center}
@ -498,48 +500,45 @@ For example, the Z-array of the string
 \end{tikzpicture}
 \end{center}

-For example, the value at position 6 of the
-above Z-array is 5,
+In this case, for example, $\texttt{z}[6]=5$,
 because the substring \texttt{ACBAC} of length 5
-is a prefix of the string,
+is a prefix of \texttt{s},
 but the substring \texttt{ACBACB} of length 6
-is not a prefix of the string.
-
-It is often a matter of taste whether to use
-string hashing or the Z-algorithm.
-Unlike hashing, the Z-algorithm always works
-and there is no risk for collisions.
-On the other hand, the Z-algorithm is more difficult
-to implement and some problems can only be solved
-using hashing.
+is not a prefix of \texttt{s}.

 \subsubsection*{Algorithm description}

-The Z-algorithm scans the string from left
-to right, and calculates for each position
-the length of the longest substring that
-is a prefix of the string.
-A straightforward algorithm
-would have a time complexity of $O(n^2)$,
-but the Z-algorithm has an important
-optimization which ensures that the time complexity
-is only $O(n)$.
+Next we describe an algorithm,
+called the \key{Z-algorithm}\footnote{The Z-algorithm
+was presented in \cite{gus97} as the simplest known
+method for linear-time pattern matching, and the original idea
+was attributed to \cite{mai84}.},
+that efficiently constructs the Z-array in $O(n)$ time.
+The algorithm calculates the Z-array values
+from left to right by both using information
+already stored in the Z-array and comparing substrings
+character by character.

-The idea is to maintain a range $[x,y]$ such that
-the substring from $x$ to $y$ is a prefix of
-the string and $y$ is as large as possible.
-Since the characters in the ranges $[0,y-x]$
-and $[x,y]$ are the same,
-we can use this information to calculate
-the Z-array values in the range $[x,y]$.
+To efficiently calculate the Z-array values,
+the algorithm maintains a range $[x,y]$ such that
+$\texttt{s}[x \ldots y]$ is a prefix of \texttt{s}
+and $y$ is as large as possible.
+Since we know that $\texttt{s}[0 \ldots y-x]$
+and $\texttt{s}[x \ldots y]$ are equal,
+we can use this information when calculating
+Z-values for positions $x+1,x+2,\ldots,y$.

-The time complexity of the Z-algorithm is $O(n)$,
-because the algorithm only compares strings
-character by character starting at position $y+1$.
-If the characters match, the value of $y$ increases,
-and it is not needed to compare the character at
-position $y$ again
-but the information in the Z-array can be used.
+At each position $k$, we first
+check the value of $\texttt{z}[k-x]$.
+If $k+\texttt{z}[k-x]<y$, we know that $\texttt{z}[k]=\texttt{z}[k-x]$.
+However, if $k+\texttt{z}[k-x] \ge y$,
+$\texttt{s}[0 \ldots y-k]$ equals
+$\texttt{s}[k \ldots y]$, and to determine the
+value of $\texttt{z}[k]$ we need to compare
+the substrings character by character.
+Still, the algorithm works in $O(n)$ time,
+because we start comparing at positions
+$y-k+1$ and $y+1$.

 For example, let us construct the following Z-array:

@ -602,10 +601,8 @@ For example, let us construct the following Z-array:
 \end{tikzpicture}
 \end{center}

-The first interesting position is 6 where the
-length of the common prefix is 5.
-After calculating this value,
-the current $[x,y]$ range will be $[6,10]$:
+After calculating the value $\texttt{z}[6]=5$,
+the current $[x,y]$ range is $[6,10]$:

 \begin{center}
 \begin{tikzpicture}[scale=0.7]
@ -673,17 +670,15 @@ the current $[x,y]$ range will be $[6,10]$:
 \end{tikzpicture}
 \end{center}

-Now, it is possible to calculate the
-subsequent values of the Z-array
-more efficiently,
+Now we can calculate
+subsequent Z-array values
+efficiently,
 because we know that
-the ranges $[0,4]$ and $[6,10]$
-contain the same characters.
-First, since the values at
-positions 1 and 2 are 0,
-we immediately know that
-the values at positions 7 and 8
-are also 0:
+$\texttt{s}[0 \ldots 4]$ and
+$\texttt{s}[6 \ldots 10]$ are equal.
+First, since $\texttt{z}[1] = \texttt{z}[2] = 0$,
+we immediately know that also
+$\texttt{z}[7] = \texttt{z}[8] = 0$:

 \begin{center}
 \begin{tikzpicture}[scale=0.7]
@ -755,9 +750,7 @@ are also 0:
 \end{tikzpicture}
 \end{center}

-After this, we know that the value
-at position 9 will be at least 2,
-because the value at position 3 is 2:
+Then, since $\texttt{z}[3]=2$, we know that $\texttt{z}[9] \ge 2$:

 \begin{center}
 \begin{tikzpicture}[scale=0.7]
@ -826,8 +819,8 @@ because the value at position 3 is 2:
 \end{tikzpicture}
 \end{center}

-Since we have no information about the characters
-after position 10, we have to begin to compare the strings
+However, we have no information about the string
+after position 10, so we need to compare the substrings
 character by character:

 \begin{center}
@ -901,10 +894,8 @@ character by character:
 \end{tikzpicture}
 \end{center}

-
-It turns out that the length of the common
-prefix at position 9 is 7,
-and thus the new range $[x,y]$ is $[9,15]$:
+It turns out that $\texttt{z}[9]=7$,
+so the new $[x,y]$ range is $[9,15]$:

 \begin{center}
 \begin{tikzpicture}[scale=0.7]
@ -973,10 +964,9 @@ and thus the new range $[x,y]$ is $[9,15]$:
 \end{tikzpicture}
 \end{center}

-After this, all subsequent values of the Z-array
-can be calculated using the values already
-stored in the array. All the remaining values can be
-directly retrieved from the beginning of the Z-array:
+After this, all the remaining Z-array values
+can be determined by using the information
+already stored in the Z-array:

 \begin{center}
 \begin{tikzpicture}[scale=0.7]
@ -1045,10 +1035,18 @@ directly retrieved from the beginning of the Z-array:

 \subsubsection{Using the Z-array}

-As an example, let us consider again
+It is often a matter of taste whether to use
+string hashing or the Z-algorithm.
+Unlike hashing, the Z-algorithm always works
+and there is no risk for collisions.
+On the other hand, the Z-algorithm is more difficult
+to implement and some problems can only be solved
+using hashing.
+
+As an example, consider again
 the pattern matching problem,
-where our task is to find the positions
-where a pattern $p$ occurs in a string $s$.
+where our task is to find the occurrences
+of a pattern $p$ in a string $s$.
 We already solved this problem efficiently
 using string hashing, but the Z-algorithm
 provides another way to solve the problem.
@ -1123,7 +1121,7 @@ the Z-array is as follows:
 The positions 5 and 10 contain the value 3,
 which means that the pattern \texttt{ATT}
 occurs in the corresponding positions
-in the string \texttt{HATTIVATTI}.
+of \texttt{HATTIVATTI}.

 The time complexity of the resulting algorithm
 is linear, because it suffices to construct