Small improvements
This commit is contained in:
parent
60d09f8199
commit
35d6a58004
|
@ -114,7 +114,7 @@ $x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
|||
|
||||
A \key{trie} is a tree structure that
|
||||
maintains a set of strings.
|
||||
Each string in a trie corresponds to
|
||||
Each string is stored as
|
||||
a chain of characters starting at
|
||||
the root node.
|
||||
If two strings have a common prefix,
|
||||
|
@ -157,9 +157,9 @@ This trie corresponds to the set
|
|||
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$.
|
||||
The character * in a node means that
|
||||
one of the strings in the set ends at the node.
|
||||
This character is needed, because a string
|
||||
Such a character is needed, because a string
|
||||
may be a prefix of another string.
|
||||
For example, in this trie, \texttt{THE}
|
||||
For example, in the above trie, \texttt{THE}
|
||||
is a prefix of \texttt{THERE}.
|
||||
|
||||
We can check if a trie contains a string
|
||||
|
@ -196,11 +196,11 @@ from node $s$ using character $c$.
|
|||
|
||||
\key{String hashing} is a technique that
|
||||
allows us to efficiently check whether two
|
||||
substrings in a string are equal\footnote{The technique
|
||||
strings are equal\footnote{The technique
|
||||
was popularized by the Karp–Rabin pattern matching
|
||||
algorithm \cite{kar87}.}.
|
||||
The idea is to compare the hash values of the
|
||||
substrings instead of their individual characters.
|
||||
strings instead of their individual characters.
|
||||
|
||||
\subsubsection*{Calculating hash values}
|
||||
|
||||
|
@ -216,7 +216,7 @@ which makes it possible to compare strings
|
|||
based on their hash values.
|
||||
|
||||
A usual way to implement string hashing
|
||||
is polynomial hashing, which means
|
||||
is \key{polynomial hashing}, which means
|
||||
that the hash value is calculated using the formula
|
||||
\[(\texttt{s}[0] A^{n-1} + \texttt{s}[1] A^{n-2} + \cdots + \texttt{s}[n-1] A^0) \bmod B ,\]
|
||||
where \texttt{s} is a string of length $n$
|
||||
|
@ -246,16 +246,14 @@ in the string \texttt{ALLEY} are:
|
|||
\end{center}
|
||||
|
||||
Thus, if $A=3$ and $B=97$, the hash value
|
||||
for the string \texttt{ALLEY} is
|
||||
|
||||
of the string \texttt{ALLEY} is
|
||||
\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
|
||||
|
||||
\subsubsection*{Preprocessing}
|
||||
|
||||
To efficiently calculate hash values of substrings,
|
||||
we need to preprocess the string.
|
||||
It turns out that using polynomial hashing,
|
||||
we can calculate the hash value of any substring
|
||||
of a string
|
||||
in $O(1)$ time after an $O(n)$ time preprocessing.
|
||||
|
||||
The idea is to construct an array $h$ such that
|
||||
|
@ -305,10 +303,10 @@ character by character.
|
|||
The time complexity of such an algorithm is $O(n^2)$.
|
||||
|
||||
We can make the brute force algorithm more efficient
|
||||
using hashing, because the algorithm compares
|
||||
by using hashing, because the algorithm compares
|
||||
substrings of strings.
|
||||
Using hashing, each comparison only takes $O(1)$ time,
|
||||
because only hash values of the strings are compared.
|
||||
because only hash values of substrings are compared.
|
||||
This results in an algorithm with time complexity $O(n)$,
|
||||
which is the best possible time complexity for this problem.
|
||||
|
||||
|
@ -349,7 +347,7 @@ B & = & 972663749 \\
|
|||
|
||||
Using such constants,
|
||||
the \texttt{long long} type can be used
|
||||
when calculating the hash values,
|
||||
when calculating hash values,
|
||||
because the products $AB$ and $BB$ will fit in \texttt{long long}.
|
||||
But is it enough to have about $10^9$ different hash values?
|
||||
|
||||
|
@ -429,16 +427,16 @@ constants of the form $2^x$ are used \cite{pac13}.
|
|||
\index{Z-array}
|
||||
|
||||
The \key{Z-array} of a string
|
||||
gives for each position $k$ in the string
|
||||
contains for each position of the string
|
||||
the length of the longest substring
|
||||
that begins at position $k$ and is a prefix of the string.
|
||||
that begins at that position and is a prefix of the string.
|
||||
Such an array can be efficiently constructed
|
||||
using the \key{Z-algorithm}\footnote{The Z-algorithm
|
||||
was presented in \cite{gus97} as the simplest known
|
||||
method for linear-time pattern matching, and the original idea
|
||||
was attributed to \cite{mai84}.}.
|
||||
|
||||
For example, the Z-array for the string
|
||||
For example, the Z-array of the string
|
||||
\texttt{ACBACDACBACBACDA} is as follows:
|
||||
|
||||
\begin{center}
|
||||
|
@ -500,7 +498,7 @@ For example, the Z-array for the string
|
|||
\end{tikzpicture}
|
||||
\end{center}
|
||||
|
||||
For example, the value at position 7 in the
|
||||
For example, the value at position 6 of the
|
||||
above Z-array is 5,
|
||||
because the substring \texttt{ACBAC} of length 5
|
||||
is a prefix of the string,
|
||||
|
@ -530,10 +528,10 @@ is only $O(n)$.
|
|||
The idea is to maintain a range $[x,y]$ such that
|
||||
the substring from $x$ to $y$ is a prefix of
|
||||
the string and $y$ is as large as possible.
|
||||
Since the Z-array already contains information
|
||||
about the characters in the range $[x,y]$,
|
||||
Since the characters in the ranges $[0,y-x]$
|
||||
and $[x,y]$ are the same,
|
||||
we can use this information to calculate
|
||||
values for elements in the range $[x,y]$.
|
||||
the Z-array values in the range $[x,y]$.
|
||||
|
||||
The time complexity of the Z-algorithm is $O(n)$,
|
||||
because the algorithm only compares strings
|
||||
|
@ -1047,7 +1045,7 @@ directly retrieved from the beginning of the Z-array:
|
|||
|
||||
\subsubsection{Using the Z-array}
|
||||
|
||||
As an example, let us once again consider
|
||||
As an example, let us consider again
|
||||
the pattern matching problem,
|
||||
where our task is to find the positions
|
||||
where a pattern $p$ occurs in a string $s$.
|
||||
|
@ -1065,7 +1063,7 @@ character \texttt{\#} that does not occur
|
|||
in the strings.
|
||||
The Z-array of $p$\texttt{\#}$s$ tells us the positions
|
||||
where $p$ occurs in $s$,
|
||||
because such positions contain the value $p$.
|
||||
because such positions contain the length of $p$.
|
||||
|
||||
For example, if $s=$\texttt{HATTIVATTI} and $p=$\texttt{ATT},
|
||||
the Z-array is as follows:
|
||||
|
@ -1128,7 +1126,7 @@ occurs in the corresponding positions
|
|||
in the string \texttt{HATTIVATTI}.
|
||||
|
||||
The time complexity of the resulting algorithm
|
||||
is $O(n)$, because it suffices to construct
|
||||
is linear, because it suffices to construct
|
||||
the Z-array and go through its values.
|
||||
|
||||
\subsubsection{Implementation}
|
||||
|
|
Loading…
Reference in New Issue