diff --git a/luku26.tex b/luku26.tex index c8c09fb..d39577e 100644 --- a/luku26.tex +++ b/luku26.tex @@ -1,11 +1,35 @@ \chapter{String algorithms} -\index{string} -\index{alphabet} +This chapter deals with efficient algorithms +for processing strings. +Many string problems can be easily solved +in $O(n^2)$ time, but the challenge is to +find algorithms that work in $O(n)$ or $O(n \log n)$ +time and can process long strings. -A string $s$ of length $n$ -is a sequence of characters -$s[1],s[2],\ldots,s[n]$. +\index{pattern matching} + +For example, a fundamental problem related to strings +is the \key{pattern matching} problem: +given a string of length $n$ and a pattern of length $m$, +our task is to find the positions where the pattern +occurs in the string. +For example, the pattern \texttt{ABC} occurs two +times in the string \texttt{ABABCBABC}. + +The pattern matching problem is easy to solve +in $O(nm)$ time by a brute force algorithm that +goes through all positions where the pattern may +occur in the string. +However, in this chapter, we will see, that there +are more efficient algorithms that require only +$O(n+m)$ time. + +\index{string} + +\section{Terminology} + +\index{alphabet} An \key{alphabet} is a set of characters that may appear in strings. @@ -15,76 +39,73 @@ consists of the capital letters of English. \index{substring} -A \key{substring} consists of consecutive -characters in a string. -The number of substrings in a string is $n(n+1)/2$. -For example, \texttt{ORITH} is a substring -in \texttt{ALGORITHM}, and it corresponds -to \texttt{ALG\underline{ORITH}M}. +A \key{substring} is a sequence of consecutive +characters of a string. +The number of substrings of a string is $n(n+1)/2$. +For example, the substrings of the string +\texttt{ABCD} are +\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, +\texttt{AB}, \texttt{BC}, \texttt{CD}, +\texttt{ABC}, \texttt{BCD} and \texttt{ABCD}. \index{subsequence} -A \key{subsequence} is a subset of characters -in a string in their original order. -The number of subsequences in a string is $2^n-1$. -For example, \texttt{LGRHM} is a subsequece -in \texttt{ALGORITHM}, and it corresponds -to \texttt{A\underline{LG}O\underline{R}IT\underline{HM}}. +A \key{subsequence} is a sequence of +(not necessarily consecutive) characters +of a string in their original order. +The number of subsequences of a string is $2^n-1$. +For example, the subsequences of the string +\texttt{ABCD} are +\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, +\texttt{AB}, \texttt{AC}, \texttt{AD}, +\texttt{BC}, \texttt{BD}, \texttt{CD}, +\texttt{ABC}, \texttt{ABD}, \texttt{ACD}, +\texttt{BCD} and \texttt{ABCD}. \index{prefix} \index{suffix} -A \key{prefix} is a subtring that contains the first -character of a string, -and a \key{suffix} is a substring that contains the last character. -For example, the prefixes of -\texttt{STORY} are \texttt{S}, \texttt{ST}, -\texttt{STO}, \texttt{STOR} and \texttt{STORY}, -and the suffixes are \texttt{Y}, \texttt{RY}, -\texttt{ORY}, \texttt{TORY} and \texttt{STORY}. -A prefix or a suffix is \key{proper} -if it is not the whole string. +A \key{prefix} is a subtring that starts at the beginning +of a string, +and a \key{suffix} is a substring that ends at the end +of a string. +For example, for the string \texttt{ABCD}, +the prefixes are +\texttt{A}, \texttt{AB}, \texttt{ABC} and \texttt{ABCD} +and the suffixes are +\texttt{D}, \texttt{CD}, \texttt{BCD} and \texttt{ABCD}. \index{rotation} A \key{rotation} can be generated by moving -characters one by one from the beginning to the end -in a string (or vice versa). -For example, the rotations of \texttt{STORY} are -\texttt{STORY}, -\texttt{TORYS}, -\texttt{ORYST}, -\texttt{RYSTO} and -\texttt{YSTOR}. +characters one by one from the beginning +to the end of a string (or vice versa). +For example, the rotations of the string +\texttt{ABCD} are +\texttt{ABCD}, \texttt{BCDA}, \texttt{CDAB} and \texttt{DABC}. \index{period} A \key{period} is a prefix of a string such that -we can construct the string by repeating the period. +the string can be constructed by repeating the period. The last repetition may be partial and contain only a prefix of the period. -Often it is interesting to find the \key{shortest period} -of a string. For example, the shortest period of \texttt{ABCABCA} is \texttt{ABC}. -In this case, we first repeat the period twice -and then partially. \index{border} A \key{border} is a string that is both a prefix and a suffix of a string. -For example, the borders for \texttt{ABADABA} -are \texttt{A}, \texttt{ABA} and \texttt{ABADABA}. -Often we want to find the \key{longest border} -that is not the whole string. +For example, the borders of the string \texttt{ABACABA} +are \texttt{A}, \texttt{ABA} and \texttt{ABACABA}. \index{lexicographical order} -Usually we compare string using the \key{lexicographical order} +Strings are usually compared using the \key{lexicographical order} that corresponds to the alphabetical order. -It means that $x] (12) -- node[font=\small,label=right:\texttt{E}] {} (13); \end{tikzpicture} \end{center} + +This trie corresponds to the set +$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$. The character * in a node means that -a string ends at the node. -This character is needed because a string +one of the string in the set ends at the node. +This character is needed, because a string may be a prefix of another string. For example, in this trie, \texttt{THE} -is a suffix of \texttt{THERE}. +is a prefix of \texttt{THERE}. -Inserting and searching a string in a trie take $O(n)$ time -where $n$ is the length of the string. -Both operations can be implemented by -starting at the root node and following the -chain of characters that appear in the string. +We can check if a trie contains a string +in $O(n)$ time where $n$ is the length of the string, +because we can follow the chain that starts at the root node. +We can also add a new string to the trie +in $O(n)$ time using a similar idea. If needed, new nodes will be added to the trie. -Tries can be used for searching both strings -and prefixes of strings. -In addition, it is possible to calculate numbers -of strings that correspond to each prefix, -which can be useful in some applications. +Using a trie, we can also find the longest prefix +of a string that belongs to the set. +In addition, by storing additional information +in each node, +it is possible to calculate the number of +strings that have a given prefix. -A trie can be stored as an array +A trie can be stored in an array \begin{lstlisting} int t[N][A]; \end{lstlisting} where $N$ is the maximum number of nodes -(the total length of the string to be stored) +(the maximum total length of the strings in the set) and $A$ is the size of the alphabet. The nodes of a trie are numbered $1,2,3,\ldots$ so that the number of the root is 1, -and $\texttt{t}[s][c]$ is the next node in chain +and $\texttt{t}[s][c]$ is the next node in the chain from node $s$ using character $c$. \section{String hashing} @@ -173,7 +196,7 @@ from node $s$ using character $c$. \key{String hashing} is a technique that allows us to efficiently check whether two substrings in a string are equal. -The idea is to compare hash values of the +The idea is to compare the hash values of the substrings instead of their individual characters. \subsubsection*{Calculating hash values} @@ -190,7 +213,7 @@ which makes it possible to compare strings based on their hash values. A usual way to implement string hashing -is to use polynomial hashing, which means +is polynomial hashing, which means that the hash value is calculated using the formula \[(c[1] A^{n-1} + c[2] A^{n-2} + \cdots + c[n] A^0) \bmod B ,\] where $c[1],c[2],\ldots,c[n]$ @@ -218,7 +241,7 @@ in the string \texttt{ALLEY} are: \end{tikzpicture} \end{center} -If $A=3$ and $B=97$, the hash value +Thus, if $A=3$ and $B=97$, the hash value for the string \texttt{ALLEY} is \[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\] @@ -232,8 +255,8 @@ we can calculate the hash value of any substring in $O(1)$ time after an $O(n)$ time preprocessing. The idea is to construct an array $h$ such that -$h[k]$ contains the hash value for the prefix -of the string that ends at index $k$. +$h[k]$ contains the hash value of the prefix +of the string that ends at position $k$. The array values can be recursively calculated as follows: \[ \begin{array}{lcl} @@ -250,9 +273,8 @@ p[k] & = & (p[k-1] A) \bmod B. \\ \end{array} \] Constructing these arrays takes $O(n)$ time. -After this, the hash value for a substring -of the string -that begins at index $a$ and ends at index $b$ +After this, the hash value of a substring +that begins at position $a$ and ends at position $b$ can be calculated in $O(1)$ time using the formula \[(h[b]-h[a-1] p[b-a+1]) \bmod B.\] @@ -268,16 +290,15 @@ the strings are \emph{certainly} different. Using hashing, we can often make a brute force algorithm efficient. -As an example, let's consider a brute force -algorithm that calculates how many times -a string $p$ occurs as a substring in -a string $s$. -The algorithm goes through all locations -where $p$ can occur, and compares the strings +As an example, consider the pattern matching problem: +given a string $s$ and a pattern $p$, +find the positions where $p$ occurs in $s$. +A brute force algorithm goes through all positions +where $p$ may occur, and compares the strings character by character. The time complexity of such an algorithm is $O(n^2)$. -However, we can make the algorithm more efficient +We can make the brute force algorithm more efficient using hashing, because the algorithm compares substrings of strings. Using hashing, each comparison only takes $O(1)$ time, @@ -286,23 +307,24 @@ This results in an algorithm with time complexity $O(n)$, which is the best possible time complexity for this problem. By combining hashing and \emph{binary search}, -it is also possible to check the lexicographic order of +it is also possible to find out the lexicographic order of two strings in logarithmic time. -This can be done by finding out the length +This can be done by calculating the length of the common prefix of the strings using binary search. -Once we know the common prefix, -the next character after the prefix -indicates the order of the strings. +Once we know the length of the common prefix, +we can just check the next character after the prefix, +because this determines the order of the strings. \subsubsection*{Collisions and parameters} \index{collision} -An evident risk in comparing hash values is -\key{collision}, which means that two strings have +An evident risk when comparing hash values is +a \key{collision}, which means that two strings have different contents but equal hash values. -In this case, based on the hash values it seems that -the strings are equal, but in reality they aren't, +In this case, an algorithm that relies on +the hash values concludes that the strings are equal, +but in reality they are not, and the algorithm may give incorrect results. Collisions are always possible, @@ -310,49 +332,41 @@ because the number of different strings is larger than the number of different hash values. However, the probability of a collision is small if the constants $A$ and $B$ are carefully chosen. -There are two goals: the hash values should be -evenly distributed for the strings, -and the number of different hash values should -be large enough. - -A good solution is to use large random numbers -as constants. -A usual way is to choose constants that are -near $10^9$, for example +A usual way is to choose random constants +near $10^9$, for example as follows: \[ \begin{array}{lcl} A & = & 911382323 \\ B & = & 972663749 \\ \end{array} \] -This choice ensures that the hash values -are distributed evenly enough in the range $0 \ldots B-1$. -The benefit in $10^9$ is that -the \texttt{long long} type can be used -for calculating the hash values, -because the products $AB$ and $BB$ fit in \texttt{long long}. -But is it enough to have $10^9$ different hash values? -Let's consider three scenarios where hashing can be used: +Using such constants, +the \texttt{long long} type can be used +when calculating the hash values, +because the products $AB$ and $BB$ will fit in \texttt{long long}. +But is it enough to have about $10^9$ different hash values? + +Let us consider three scenarios where hashing can be used: \textit{Scenario 1:} Strings $x$ and $y$ are compared with each other. The probability of a collision is $1/B$ assuming that all hash values are equally probable. -\textit{Tapaus 2:} A string $x$ is compared with strings +\textit{Scenario 2:} A string $x$ is compared with strings $y_1,y_2,\ldots,y_n$. -The probability for one or more collisions is +The probability of one or more collisions is -\[1-(1-1/B)^n.\] +\[1-(1-\frac{1}{B})^n.\] -\textit{Tapaus 3:} Strings $x_1,x_2,\ldots,x_n$ +\textit{Scenario 3:} Strings $x_1,x_2,\ldots,x_n$ are compared with each other. -The probability for one or more collisions is +The probability of one or more collisions is \[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\] The following table shows the collision probabilities -when the value of $B$ varies and $n=10^6$: +when $n=10^6$ and the value of $B$ varies: \begin{center} \begin{tabular}{rrrr} @@ -384,12 +398,12 @@ in a room, the probability that some two people have the same birthday is large even if $n$ is quite small. In hashing, correspondingly, when all hash values are compared with each other, the probability that some two -hash values are the same is large. +hash values are equal is large. -A good way to make the probability of a collision -smaller is to calculate \emph{multiple} hash values +We can make the probability of a collision +smaller by calculating \emph{multiple} hash values using different parameters. -It is very unlikely that a collision would occur +It is unlikely that a collision would occur in all hash values at the same time. For example, two hash values with parameter $B \approx 10^9$ correspond to one hash @@ -401,37 +415,25 @@ which is convenient, because operations with 32 and 64 bit integers are calculated modulo $2^{32}$ and $2^{64}$. However, this is not a good choice, because it is possible to construct inputs that always generate collisions when -constants of the form $2^x$ are used\footnote{ -J. Pachocki and Jakub Radoszweski: -''Where to use and how not to use polynomial string hashing''. -\textit{Olympiads in Informatics}, 2013. -}. +constants of the form $2^x$ are used. +% \footnote{ +% J. Pachocki and Jakub Radoszweski: +% ''Where to use and how not to use polynomial string hashing''. +% \textit{Olympiads in Informatics}, 2013. +% }. \section{Z-algorithm} \index{Z-algorithm} \index{Z-array} -The \key{Z-algorithm} generates a \key{Z-array} -for the string, that contains for each index $k$ -in the string the length of the longest substring -that begins at index $k$ and is a prefix of the string. -Many string problems can be efficiently solved -using the Z-algorithm. +The \key{Z-array} of a string +contains for each position $k$ in the string +the lengt of the longest substring +that begins at position $k$ and is a prefix of the string. +Such an array can be efficiently constructed +using the \key{Z-algorithm}. -It is often a matter of taste whether to use -the Z-algorithm or string hashing. -Unlike hashing, the Z-algorithm always works -and there is no risk for collisions. -On the other hand, the Z-algorithm is more difficult -to implement and some problems can only be solved -using hashing. - -\subsubsection*{Description} - -The Z-algorithm constructs a Z-array that -indicates for each position the length of the -longest substring that is also a prefix of the string. For example, the Z-array for the string \texttt{ACBACDACBACBACDA} is as follows: @@ -494,45 +496,50 @@ For example, the Z-array for the string \end{tikzpicture} \end{center} -For example, the position 7 contains the value 5, +For example, the value at position 7 in the +above Z-array is 5, because the substring \texttt{ACBAC} of length 5 is a prefix of the string, but the substring \texttt{ACBACB} of length 6 is not a prefix of the string. -The Z-algorithm scans the string from the left -to the right, and calculates for each position +It is often a matter of taste whether to use +string hashing or the Z-algorithm. +Unlike hashing, the Z-algorithm always works +and there is no risk for collisions. +On the other hand, the Z-algorithm is more difficult +to implement and some problems can only be solved +using hashing. + +\subsubsection*{Algorithm description} + +The Z-algorithm scans the string from left +to right, and calculates for each position the length of the longest substring that is a prefix of the string. -The algorithm compares the first characters -of the string -and the active substring with each other to -find the length of the common prefix. - -A straightforward implementation would yield -an algorithm with time complexity $O(n^2)$ -because the common prefixes may be long. -However, the Z-algorithm has one important +A straightforward algorithm +would have a time complexity of $O(n^2)$, +but the Z-algorithm has an important optimization which ensures that the time complexity is only $O(n)$. + The idea is to maintain a range $[x,y]$ such that the substring from $x$ to $y$ is a prefix of the string and $y$ is as large as possible. Since the Z-array already contains information about the characters in the range $[x,y]$, -it is not needed to process them again later in the algorithm. +we can use this information to calculate +values for elements in the range $[x,y]$. The time complexity of the Z-algorithm is $O(n)$, -because the algorithm always compares substrings -character by character only from index $y+1$. +because the algorithm always compares strings +character by character starting at position $y+1$. If the characters match, the value of $y$ increases, -and it is not needed to inspect the character again, +and it is not needed to compare the character at +position $y$ again, but the information in the Z-array can be used. -\subsubsection*{Example} - -Let's construct the following Z-array using -the Z-algorithm: +For example, let us construct the following Z-array: \begin{center} \begin{tikzpicture}[scale=0.7] @@ -595,7 +602,8 @@ the Z-algorithm: The first interesting position is 7 where the length of the common prefix is 5. -The corresponding range in the string is $[7,11]$: +After calculating this value, +the current $[x,y]$ range will be $[7,11]$: \begin{center} \begin{tikzpicture}[scale=0.7] @@ -663,14 +671,17 @@ The corresponding range in the string is $[7,11]$: \end{tikzpicture} \end{center} -The benefit in the range $[7,11]$ is that the -algorithm can calculate the subsequent values -for the Z-array more efficiently. -Since the ranges $[1,5]$ and $[7,11]$ contain -the same characters, also the Z-array will -contain similar values. -First, the values at indices 8 and 9 -correspond to the values at indices 2 and 3: +Now, it is possible to calculate the +subsequent values for the Z-array +more efficiently, +because we know that +the ranges $[1,5]$ and $[7,11]$ +contain the same characters. +First, since the values at +positions 2 and 3 are 0, +we immediately know that +the values at positions 8 and 9 +are also 0: \begin{center} \begin{tikzpicture}[scale=0.7] @@ -742,13 +753,9 @@ correspond to the values at indices 2 and 3: \end{tikzpicture} \end{center} -After this, the value for index 10 can be -calculated using the value at index 4. -The value at index 4 is 2, -so the first two characters -in the substring match the beginning of the string. -However, the characters after index $y=11$ have -not been inspected yet. +After this, we know that the value +at position 10 will be at least 2, +because the value at position 4 is 2: \begin{center} \begin{tikzpicture}[scale=0.7] @@ -817,13 +824,85 @@ not been inspected yet. \end{tikzpicture} \end{center} -The algorithm compares the substring -beginning at index $y+1=12$ character by character. -The previous values in the Z-array cannot be used, -because this is the first time the characters -after index 11 are inspected. +Since we have no information about the characters +after position 11, we have to begin to compare the strings +character by character: + +\begin{center} +\begin{tikzpicture}[scale=0.7] +\fill[color=lightgray] (9,0) rectangle (10,1); +\fill[color=lightgray] (2,1) rectangle (7,2); +\fill[color=lightgray] (11,1) rectangle (16,2); + + +\draw (0,0) grid (16,2); + +\node at (0.5, 1.5) {A}; +\node at (1.5, 1.5) {C}; +\node at (2.5, 1.5) {B}; +\node at (3.5, 1.5) {A}; +\node at (4.5, 1.5) {C}; +\node at (5.5, 1.5) {D}; +\node at (6.5, 1.5) {A}; +\node at (7.5, 1.5) {C}; +\node at (8.5, 1.5) {B}; +\node at (9.5, 1.5) {A}; +\node at (10.5, 1.5) {C}; +\node at (11.5, 1.5) {B}; +\node at (12.5, 1.5) {A}; +\node at (13.5, 1.5) {C}; +\node at (14.5, 1.5) {D}; +\node at (15.5, 1.5) {A}; + +\node at (0.5, 0.5) {--}; +\node at (1.5, 0.5) {0}; +\node at (2.5, 0.5) {0}; +\node at (3.5, 0.5) {2}; +\node at (4.5, 0.5) {0}; +\node at (5.5, 0.5) {0}; +\node at (6.5, 0.5) {5}; +\node at (7.5, 0.5) {0}; +\node at (8.5, 0.5) {0}; +\node at (9.5, 0.5) {?}; +\node at (10.5, 0.5) {?}; +\node at (11.5, 0.5) {?}; +\node at (12.5, 0.5) {?}; +\node at (13.5, 0.5) {?}; +\node at (14.5, 0.5) {?}; +\node at (15.5, 0.5) {?}; + +\draw [decoration={brace}, decorate, line width=0.5mm] (6,3.00) -- (11,3.00); + +\node at (6.5,3.50) {$x$}; +\node at (10.5,3.50) {$y$}; + + +\footnotesize +\node at (0.5, 2.5) {1}; +\node at (1.5, 2.5) {2}; +\node at (2.5, 2.5) {3}; +\node at (3.5, 2.5) {4}; +\node at (4.5, 2.5) {5}; +\node at (5.5, 2.5) {6}; +\node at (6.5, 2.5) {7}; +\node at (7.5, 2.5) {8}; +\node at (8.5, 2.5) {9}; +\node at (9.5, 2.5) {10}; +\node at (10.5, 2.5) {11}; +\node at (11.5, 2.5) {12}; +\node at (12.5, 2.5) {13}; +\node at (13.5, 2.5) {14}; +\node at (14.5, 2.5) {15}; +\node at (15.5, 2.5) {16}; + +%\draw[thick,<->] (11.5,-0.25) .. controls (11,-1.25) and (3,-1.25) .. (2.5,-0.25); +\end{tikzpicture} +\end{center} + + It turns out that the length of the common -prefix is 7, and the range $[x,y]$ will be updated: +prefix at position 10 is 7, +and thus the new range $[x,y]$ is $[10,16]$: \begin{center} \begin{tikzpicture}[scale=0.7] @@ -892,9 +971,9 @@ prefix is 7, and the range $[x,y]$ will be updated: \end{tikzpicture} \end{center} -After this, all subsequent values in the Z-array -can be calculated using the information in -the range $[x,y]$. All the remaining values can be +After this, all subsequent values for the Z-array +can be calculated using the values already +calculated to the array. All the remaining values can be directly retrieved from the beginning of the Z-array: \begin{center} @@ -964,29 +1043,26 @@ directly retrieved from the beginning of the Z-array: \subsubsection{Using the Z-array} -As an example, let's solve a problem -where our task is to calculate -the number of times a string $p$ -occurs as a substring in a string $s$. -Previously, we solved this problem +As an example, let us once again consider +the pattern matching problem, +where our task is to find the positions +where a pattern $p$ occurs in a string $s$. +We already solved this problem efficiently using string hashing, but the Z-algorithm provides another way to solve the problem. -A usual idea when using the Z-algorithm -is to construct a string that consists of -several strings separated by special characters. +A usual idea in string processing is to +construct a string that consists of +multiple strings separated by special characters. In this problem, we can construct a string $p$\texttt{\#}$s$, where $p$ and $s$ are separated by a special -character \texttt{\#} that doesn't occur +character \texttt{\#} that does not occur in the strings. -After this, the Z-array for the string -$p$\texttt{\#}$s$ indicates the positions -where $p$ occurs in $s$. -Such positions are those positions in the Z-array -that contain the value $p$. +The Z-array of $p$\texttt{\#}$s$ indicates the positions +where $p$ occurs in $s$, +because such positions contain the value $p$. -\begin{samepage} For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT}, the Z-array is as follows: @@ -1041,12 +1117,12 @@ the Z-array is as follows: \node at (13.5, 2.5) {14}; \end{tikzpicture} \end{center} -\end{samepage} + The positions 6 and 11 contain the value 3, -which means that the substring \texttt{ATT} +which means that the pattern \texttt{ATT} occurs in the corresponding positions in the string \texttt{HATTIVATTI}. The time complexity of the resulting algorithm -is $O(n)$, because it suffices to construct and -go through the Z-array. \ No newline at end of file +is $O(n)$, because it suffices to construct +the Z-array and go through its values. \ No newline at end of file