String hashing

2017-01-24 21:59:20 +02:00 · 2017-01-24 21:59:20 +02:00 · e15acf135f
parent e03e9906a8
commit e15acf135f
1 changed files with 165 additions and 173 deletions
--- a/luku26.tex
+++ b/luku26.tex
@ -147,11 +147,11 @@ starting at the root node and following the
 chain of characters that appear in the string.
 If needed, new nodes will be added to the trie.
-Trie can be used for searching both strings
+Tries can be used for searching both strings
 and prefixes of strings.
-In addition, we can keep track of the number
+In addition, it is possible to calculate numbers
-of strings that have each prefix,
+of strings that correspond to each prefix,
-that can be useful in some applications.
+which can be useful in some applications.
 A trie can be stored as an array
 \begin{lstlisting}
@ -165,203 +165,198 @@ $1,2,3,\ldots$ so that the number of the root is 1,
 and $\texttt{t}[s][c]$ is the next node in chain
 from node $s$ using character $c$.
-\section{Merkkijonohajautus}
+\section{String hashing}
-\index{hajautus@hajautus}
+\index{hashing}
-\index{merkkijonohajautus@merkkijonohajautus}
+\index{string hashing}
-\key{Merkkijonohajautus}
+\key{String hashing} is a technique that
-on tekniikka, jonka avulla voi esikäsittelyn
+allows us to efficiently check whether two
-jälkeen tarkastaa tehokkaasti, ovatko
+substrings in a string are equal.
-kaksi merkkijonon osajonoa samat.
+The idea is to compare hash values of the
-Ideana on verrata toisiinsa
+substrings instead of their individual characters.
 osajonojen hajautusarvoja,
 mikä on tehokkaampaa kuin osajonojen
 vertaaminen merkki kerrallaan.
-\subsubsection*{Hajautusarvon laskeminen}
+\subsubsection*{Calculating hash values}
-\index{hajautusarvo@hajautusarvo}
+\index{hash value}
-\index{polynominen hajautus@polynominen hajautus}
+\index{polynomial hashing}
-Merkkijonon \key{hajautusarvo}
+A \key{hash value} of a string is
-on luku, joka lasketaan merkkijonon merkeistä
+a number that is calculated from the characters
-etukäteen valitulla tavalla.
+of the string.
-Jos kaksi merkkijonoa ovat samat,
+If two strings are the same,
-myös niiden hajautusarvot ovat samat,
+their hash values are also the same,
-minkä ansiosta merkkijonoja voi vertailla
+which makes it possible to compare strings
-niiden hajautusarvojen kautta.
+based on their hash values.
-Tavallinen tapa toteuttaa merkkijonohajautus
+A usual way to implement string hashing
-on käyttää polynomista hajautusta.
+is to use polynomial hashing, which means
-Siinä hajautusarvo lasketaan kaavalla
+that the hash value is calculated using the formula
 \[(c[1] A^{n-1} + c[2] A^{n-2} + \cdots + c[n] A^0) \bmod B  ,\]
-missä merkkijonon merkkien koodit ovat
+where $c[1],c[2],\ldots,c[n]$
-$c[1],c[2],\ldots,c[n]$ ja $A$ ja $B$ ovat etukäteen
+are the codes of the characters in the string,
-valitut vakiot.
+and $A$ and $B$ are pre-chosen constants.
-Esimerkiksi merkkijonon \texttt{KISSA} merkkien koodit ovat:
+For example, the codes of the characters
 in the string \texttt{ALLEY} are:
 \begin{center}
 \begin{tikzpicture}[scale=0.7]
 \draw (0,0) grid (5,2);
-\node at (0.5, 1.5) {\texttt{K}};
+\node at (0.5, 1.5) {\texttt{A}};
-\node at (1.5, 1.5) {\texttt{I}};
+\node at (1.5, 1.5) {\texttt{L}};
-\node at (2.5, 1.5) {\texttt{S}};
+\node at (2.5, 1.5) {\texttt{L}};
-\node at (3.5, 1.5) {\texttt{S}};
+\node at (3.5, 1.5) {\texttt{E}};
-\node at (4.5, 1.5) {\texttt{A}};
+\node at (4.5, 1.5) {\texttt{Y}};
-\node at (0.5, 0.5) {75};
+\node at (0.5, 0.5) {65};
-\node at (1.5, 0.5) {73};
+\node at (1.5, 0.5) {76};
-\node at (2.5, 0.5) {83};
+\node at (2.5, 0.5) {76};
-\node at (3.5, 0.5) {83};
+\node at (3.5, 0.5) {69};
-\node at (4.5, 0.5) {65};
+\node at (4.5, 0.5) {89};
 \end{tikzpicture}
 \end{center}
-Jos $A=3$ ja $B=97$, merkkijonon \texttt{KISSA} hajautusarvoksi tulee
+If $A=3$ and $B=97$, the hash value
 for the string \texttt{ALLEY} is
-\[(75 \cdot 3^4 + 73 \cdot 3^3 + 83 \cdot 3^2 + 83 \cdot 3^1 + 65 \cdot 3^0) \bmod 97 = 59.\]
+\[(65 \cdot 3^4 + 76 \cdot 3^3 + 76 \cdot 3^2 + 69 \cdot 3^1 + 89 \cdot 3^0) \bmod 97 = 52.\]
-\subsubsection*{Esikäsittely}
+\subsubsection*{Preprocessing}
-Merkkijonohajautuksen esikäsittely
+To efficiently calculate hash values of substrings,
-muodostaa tietoa, jonka avulla 
+we need to preprocess the string.
-voi laskea tehokkaasti merkkijonon
+It turns out that using polynomial hashing,
-osajonojen hajautusarvoja.
+we can calculate the hash value of any substring
-Osoittautuu, että polynomisessa hajautuksessa
+in $O(1)$ time after an $O(n)$ time preprocessing.
 $O(n)$-aikaisen esikäsittelyn jälkeen voi laskea
 minkä tahansa osajonon hajautusarvon
 ajassa $O(1)$.
-Ideana on muodostaa taulukko $h$,
+The idea is to construct an array $h$ such that
-jossa $h[k]$ on hajautusarvo merkkijonon
+$h[k]$ contains the hash value for the prefix
-alkuosalle kohtaan $k$ asti.
+of the string that ends at index $k$.
-Taulukon voi muodostaa rekursiolla seuraavasti:
+The array values can be recursively calculated as follows:
 \[
 \begin{array}{lcl}
 h[0] & = & 0 \\
 h[k] & = & (h[k-1] A + c[k]) \bmod B \\
 \end{array}
 \]
-Lisäksi muodostetaan taulukko $p$,
+In addition, we construct an array $p$
-jossa $p[k]=A^k \bmod B$:
+where $p[k]=A^k \bmod B$:
 \[
 \begin{array}{lcl}
 p[0] & = & 1 \\
 p[k] & = & (p[k-1] A) \bmod B. \\
 \end{array}
 \]
-Näiden taulukoiden muodostaminen vie aikaa $O(n)$.
+Constructing these arrays takes $O(n)$ time.
-Tämän jälkeen hajautusarvo merkkijonon osajonolle,
+After this, the hash value for a substring
-joka alkaa kohdasta $a$ ja päättyy kohtaan $b$,
+of the string
-voidaan laskea $O(1)$-ajassa kaavalla
+that begins at index $a$ and ends at index $b$
 can be calculated in $O(1)$ time using the formula
 \[(h[b]-h[a-1] p[b-a+1]) \bmod B.\]
-\subsubsection*{Hajautuksen käyttö}
+\subsubsection*{Using hash values}
-Hajautusarvot tarjoavat nopean tavan merkkijonojen
+We can efficiently compare strings using hash values.
-vertailemiseen.
+Instead of comparing the real contents of the strings,
-Ideana on vertailla merkkijonojen koko sisällön
+the idea is to compare their hash values.
-sijasta niiden hajautusarvoja.
+If the hash values are equal,
-Jos hajautusarvot ovat samat,
+the strings are \emph{probably} equal,
-myös merkkijonot ovat \textit{todennäköisesti} samat,
+and if the hash values are different,
-ja jos taas hajautusarvot eivät ole samat,
+the strings are \emph{certainly} different.
 merkkijonot eivät \textit{varmasti} ole samat.
-Hajautuksen avulla voi usein tehostaa
+Using hashing, we can often make a brute force
-raa'an voiman algoritmia niin, että siitä tulee tehokas.
+algorithm efficient.
-Tarkastellaan esimerkkinä
+As an example, let's consider a brute force
-raa'an voiman algoritmia, joka laskee,
+algorithm that calculates how many times
-montako kertaa merkkijono $p$
+a string $p$ occurs as a substring in
-esiintyy osajonona merkkijonossa $s$.
+a string $s$.
-Algoritmi käy läpi kaikki kohdat,
+The algorithm goes through all locations
-joissa $p$ voi esiintyä,
+where $p$ can occur, and compares the strings
-ja vertailee merkkijonoja merkki merkiltä.
+character by character.
-Tällaisen algoritmin aikavaativuus on $O(n^2)$.
+The time complexity of such an algorithm is $O(n^2)$.
-Voimme kuitenkin tehostaa algoritmia hajautuksen avulla,
+However, we can make the algorithm more efficient
-koska algoritmissa vertaillaan merkkijonojen osajonoja.
+using hashing, because the algorithm compares
-Hajautusta käyttäen kukin vertailu vie aikaa vain $O(1)$,
+substrings of strings.
-koska vertailua ei tehdä merkki merkiltä
+Using hashing, each comparison only takes $O(1)$ time,
-vaan suoraan hajautusarvon perusteella.
+because only hash values of the strings are compared.
-Tuloksena on algoritmi, jonka aikavaativuus on $O(n)$,
+This results in an algorithm with time complexity $O(n)$,
-joka on paras mahdollinen aikavaativuus tehtävään.
+which is the best possible time complexity for this problem.
-Yhdistämällä hajautus ja \emph{binäärihaku} on mahdollista
+By combining hashing and \emph{binary search},
-myös selvittää logaritmisessa ajassa,
+it is also possible to check the lexicographic order of
-kumpi kahdesta osajonosta on suurempi
+two strings in logarithmic time.
-aakkosjärjestyksessä.
+This can be done by finding out the length
-Tämä onnistuu tutkimalla ensin binäärihaulla,
+of the common prefix of the strings using binary search.
-kuinka pitkä on merkkijonojen yhteinen alkuosa,
+Once we know the common prefix,
-minkä jälkeen yhteisen alkuosan jälkeinen merkki
+the next character after the prefix
-kertoo, kumpi merkkijono on suurempi.
+indicates the order of the strings.
-\subsubsection*{Törmäykset ja parametrit}
+\subsubsection*{Collisions and parameters}
-\index{tzzmxys@törmäys}
+\index{collision}
-Ilmeinen riski hajautusarvojen vertailussa
+An evident risk in comparing hash values is
-on \key{törmäys}, joka tarkoittaa, että kahdessa merkkijonossa on
+\key{collision}, which means that two strings have
-eri sisältö mutta niiden hajautusarvot ovat samat.
+different contents but equal hash values.
-Tällöin hajautusarvojen perusteella merkkijonot
+In this case, based on the hash values it seems that
-näyttävät samalta, vaikka todellisuudessa ne eivät ole samat,
+the strings are equal, but in reality they aren't,
-ja algoritmi voi toimia väärin.
+and the algorithm may give incorrect results.
-Törmäyksen riski on aina olemassa,
+Collisions are always possible,
-koska erilaisia merkkijonoja on enemmän kuin
+because the number of different strings is larger
-erilaisia hajautusarvoja.
+than the number of different hash values.
-Riskin saa kuitenkin pieneksi valitsemalla
+However, the probability of a collision is small
-hajautuksen vakiot $A$ ja $B$ huolellisesti.
+if the constants $A$ and $B$ are carefully chosen.
-Vakioiden valinnassa on kaksi tavoitetta:
+There are two goals: the hash values should be
-hajautusarvojen tulisi
+evenly distributed for the strings,
-jakautua tasaisesti merkkijonoille
+and the number of different hash values should
-ja
+be large enough.
 erilaisten hajautusarvojen määrän tulisi
 olla riittävän suuri.
-Hyvä ratkaisu on valita vakioiksi suuria
+A good solution is to use large random numbers
-satunnaislukuja. Tavallinen tapa on valita vakiot
+as constants.
-läheltä lukua $10^9$, esimerkiksi
+A usual way is to choose constants that are
 near $10^9$, for example
 \[
 \begin{array}{lcl}
 A & = & 911382323 \\
 B & = & 972663749 \\
 \end{array}
 \]
-Tällainen valinta takaa sen,
+This choice ensures that the hash values
-että hajautusarvot jakautuvat
+are distributed evenly enough in the range $0 \ldots B-1$.
-riittävän tasaisesti välille $0 \ldots B-1$.
+The benefit in $10^9$ is that
-Suuruusluokan $10^9$ etuna on,
+the \texttt{long long} type can be used
-että \texttt{long long} -tyyppi riittää
+for calculating the hash values,
-hajautusarvojen käsittelyyn koodissa,
+because the products $AB$ and $BB$ fit in \texttt{long long}.
-koska tulot $AB$ ja $BB$ mahtuvat \texttt{long long} -tyyppiin.
+But is it enough to have $10^9$ different hash values?
 Mutta onko $10^9$ riittävä määrä hajautusarvoja?
-Tarkastellaan nyt kolmea hajautuksen käyttötapaa:
+Let's consider three scenarios where hashing can be used:
-\textit{Tapaus 1:} Merkkijonoja $x$ ja $y$ verrataan toisiinsa.
+\textit{Scenario 1:} Strings $x$ and $y$ are compared with
-Törmäyksen todennäköisyys on $1/B$ olettaen,
+each other.
-että kaikki hajautusarvot esiintyvät yhtä usein.
+The probability of a collision is $1/B$ assuming that
 all hash values are equally probable.
-\textit{Tapaus 2:} Merkkijonoa $x$ verrataan merkkijonoihin
+\textit{Tapaus 2:} A string $x$ is compared with strings
 $y_1,y_2,\ldots,y_n$.
-Yhden tai useamman törmäyksen todennäköisyys on
+The probability for one or more collisions is
 \[1-(1-1/B)^n.\]
-\textit{Tapaus 3:} Merkkijonoja $x_1,x_2,\ldots,x_n$
+\textit{Tapaus 3:} Strings $x_1,x_2,\ldots,x_n$
-verrataan kaikkia keskenään.
+are compared with each other.
-Yhden tai useamman törmäyksen todennäköisyys on
+The probability for one or more collisions is
 \[ 1 - \frac{B \cdot (B-1) \cdot (B-2) \cdots (B-n+1)}{B^n}.\]
-Seuraava taulukko sisältää törmäyksen todennäköisyydet,
+The following table shows the collision probabilities
-kun vakion $B$ arvo vaihtelee ja $n=10^6$:
+when the value of $B$ varies and $n=10^6$:
 \begin{center}
 \begin{tabular}{rrrr}
-vakio $B$ & tapaus 1 & tapaus 2 & tapaus 3 \\
+constant $B$ & scenario 1 & scenario 2 & scenario 3 \\
 \hline
 $10^3$ & $0.001000$ & $1.000000$ & $1.000000$ \\
 $10^6$ & $0.000001$ & $0.632121$ & $1.000000$ \\
@ -372,44 +367,41 @@ $10^{18}$ & $0.000000$ & $0.000000$ & $0.000001$ \\
 \end{tabular}
 \end{center}
-Taulukosta näkee, että tapauksessa 1
+The table shows that in scenario 1,
-törmäyksen riski on olematon
+the probability of a collision is negligible
-valinnalla $B \approx 10^9$.
+when $B \approx 10^9$.
-Tapauksessa 2 riski on olemassa, mutta se on silti edelleen vähäinen.
+In scenario 2, a collision is possible but the
-Tapauksessa 3 tilanne on kuitenkin täysin toinen:
+probability is still quite small.
-törmäys tapahtuu käytännössä varmasti
+However, in scenario 3 the situation is very different:
-vielä valinnalla $B \approx 10^9$.
+a collision will almost always happen when
 $B \approx 10^9$.
-\index{syntymxpxivxparadoksi@syntymäpäiväparadoksi}
+\index{birthday paradox}
-Tapauksen 3 ilmiö tunnetaan nimellä
+The phenomenon in scenario 3 is known as the
-\key{syntymäpäiväparadoksi}:
+\key{birthday paradox}: if there are $n$ people
-jos huoneessa on $n$ henkilöä, on suuri
+in a room, the probability that some two people
-todennäköisyys, että jollain kahdella
+have the same birthday is large even if $n$ is quite small.
-henkilöllä on sama syntymäpäivä, vaikka
+In hashing, correspondingly, when all hash values are compared
-$n$ olisi melko pieni.
+with each other, the probability that some two
-Vastaavasti hajautuksessa kun kaikkia
+hash values are the same is large.
 hajautusarvoja verrataan keskenään,
 käy helposti niin, että jotkin
 kaksi ovat sattumalta samoja.
-Hyvä tapa pienentää törmäyksen riskiä on laskea
+A good way to make the probability of a collision
-\emph{useita} hajautusarvoja eri parametreilla
+smaller is to calculate \emph{multiple} hash values
-ja verrata niitä kaikkia.
+using different parameters.
-On hyvin pieni todennäköisyys,
+It is very unlikely that a collision would occur
-että törmäys tapahtuisi samaan aikaan
+in all hash values at the same time.
-kaikissa hajautusarvoissa.
+For example, two hash values with parameter
-Esimerkiksi kaksi hajautusarvoa parametrilla
+$B \approx 10^9$ corresponds to one hash
-$B \approx 10^9$ vastaa yhtä hajautusarvoa
+value with parameter $B \approx 10^{18}$,
-parametrilla $B \approx 10^{18}$,
+which makes the probability of a collision very small.
 mikä takaa hyvän suojan törmäyksiltä.
-Jotkut käyttävät hajautuksessa vakioita $B=2^{32}$ tai $B=2^{64}$,
+Some people use constants $B=2^{32}$ and $B=2^{64}$,
-jolloin modulo $B$ tulee laskettua
+which is convenient, because operations with 32 and 64
-automaattisesti, kun muuttujan arvo pyörähtää ympäri.
+bit integers are calculated modulo $2^{32}$ and $2^{64}$.
-Tämä ei ole kuitenkaan hyvä valinta,
+However, this is not a good choice, because it is possible
-koska muotoa $2^x$ olevaa moduloa vastaan
+to construct inputs that always generate collisions when
-pystyy tekemään testisyötteen, joka aiheuttaa varmasti törmäyksen\footnote{
+remainders of the form $2^x$ are used\footnote{
 J. Pachocki ja Jakub Radoszweski:
 ''Where to use and how not to use polynomial string hashing''.
 \textit{Olympiads in Informatics}, 2013.