Definitions and trie

2017-01-22 13:15:41 +02:00 · 2017-01-22 13:15:41 +02:00 · e03e9906a8
parent 0203944cc2
commit e03e9906a8
1 changed files with 119 additions and 119 deletions
--- a/luku26.tex
+++ b/luku26.tex
@ -1,108 +1,107 @@
 \chapter{String algorithms}

-\index{merkkijono@merkkijono}
-\index{aakkosto@aakkosto}
+\index{string}
+\index{alphabet}

-Merkkijonon $s$ merkit ovat $s[1],s[2],\ldots,s[n]$,
-missä $n$ on merkkijonon pituus.
+A string $s$ of length $n$
+is a sequence of characters
+$s[1],s[2],\ldots,s[n]$.

-\key{Aakkosto} sisältää ne merkit,
-joita merkkijonossa voi esiintyä.
-Esimerkiksi aakkosto $\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
-sisältää englannin kielen suuret kirjaimet.
+An \key{alphabet} is a set of characters
+that may appear in strings.
+For example, the alphabet
+$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
+consists of the capital letters of English.

-\index{osajono@osajono}
+\index{substring}

-\key{Osajono}
-sisältää merkkijonon merkit
-yhtenäiseltä väliltä.
-Merkkijonon osa\-jonojen määrä on $n(n+1)/2$.
-Esimerkiksi merkkijonon \texttt{ALGORITMI}
-yksi osajono on \texttt{ORITM},
-joka muodostuu valitsemalla välin \texttt{ALG\underline{ORITM}I}.
+A \key{substring} consists of consecutive
+characters in a string.
+The number of substrings in a string is $n(n+1)/2$.
+For example, \texttt{ORITH} is a substring
+in \texttt{ALGORITHM}, and it corresponds
+to \texttt{ALG\underline{ORITH}M}.

-\index{alijono@alijono}
+\index{subsequence}

-\key{Alijono}
-on osajoukko merkkijonon merkeistä.
-Merkkijonon alijonojen määrä on $2^n-1$.
-Esimerkiksi merkkijonon \texttt{ALGORITMI}
-yksi alijono on \texttt{LGRMI}, joka muodostuu
-valitsemalla merkit \texttt{A\underline{LG}O\underline{R}IT\underline{MI}}.
+A \key{subsequence} is a subset of characters
+in a string in their original order.
+The number of subsequences in a string is $2^n-1$.
+For example, \texttt{LGRHM} is a subsequece
+in \texttt{ALGORITHM}, and it corresponds
+to \texttt{A\underline{LG}O\underline{R}IT\underline{HM}}.

-\index{alkuosa@alkuosa}
-\index{loppuosa@loppuosa}
-\index{prefiksi@prefiksi}
-\index{suffiksi@suffiksi}
+\index{prefix}
+\index{suffix}

-\key{Alkuosa} on merkkijonon
-alusta alkava osajono,
-ja \key{loppuosa} on merkkijonon
-loppuun päättyvä osajono.
-Esimerkiksi merkkijonon \texttt{KISSA}
-alkuosat ovat \texttt{K}, \texttt{KI},
-\texttt{KIS}, \texttt{KISS} ja \texttt{KISSA}
-ja loppuosat ovat \texttt{A}, \texttt{SA},
-\texttt{SSA}, \texttt{ISSA} ja \texttt{KISSA}.
-Alkuosa tai loppuosa on \key{aito},
-jos se ei ole koko merkkijono.
+A \key{prefix} is a subtring that contains the first
+character of a string,
+and a \key{suffix} is a substring that contains the last character.
+For example, the prefixes of
+\texttt{STORY} are \texttt{S}, \texttt{ST},
+\texttt{STO}, \texttt{STOR} and \texttt{STORY},
+and the suffixes are \texttt{Y}, \texttt{RY},
+\texttt{ORY}, \texttt{TORY} and \texttt{STORY}.
+A prefix or a suffix is \key{proper}
+if it is not the whole string.

-\index{kierto@kierto}
+\index{rotation}

-\key{Kierto} syntyy
-siirtämällä jokin alkuosa merkkijonon loppuun
-tai jokin loppuosa merkkijonon alkuun.
-Esimerkiksi merkkijonon \texttt{APILA}
-kierrot ovat
-\texttt{APILA},
-\texttt{PILAA},
-\texttt{ILAAP},
-\texttt{LAAPI} ja
-\texttt{AAPIL}.
+A \key{rotation} can be generated by moving
+characters one by one from the beginning to the end
+in a string (or vice versa).
+For example, the rotations of \texttt{STORY} are
+\texttt{STORY},
+\texttt{TORYS},
+\texttt{ORYST},
+\texttt{RYSTO} and
+\texttt{YSTOR}.

-\index{jakso@jakso}
+\index{period}

-\key{Jakso} on alkuosa,
-jota toistamalla merkkijono muodostuu.
-Jakson viimeinen toistokerta voi olla osittainen
-niin, että siinä on vain jakson alkuosa.
-Usein on kiinnostavaa selvittää, mikä on merkkijonon
-\key{lyhin jakso}.
-Esimerkiksi merkkijonon \texttt{ABCABCA} lyhin jakso on \texttt{ABC}.
-Tässä tapauksessa merkkijono syntyy toistamalla jaksoa ensin kahdesti kokonaan
-ja sitten kerran osittain.
+A \key{period} is a prefix of a string such that
+we can construct the string by repeating the period.
+The last repetition may be partial and contain
+only a prefix of the period.
+Often it is interesting to find the \key{shortest period}
+of a string.
+For example, the shortest period of
+\texttt{ABCABCA} is \texttt{ABC}.
+In this case, we first repeat the period twice
+and then partially.

-\key{Reuna} on
-merkkijono, joka on sekä
-alkuosa että loppuosa.
-Esimerkiksi merkkijonon \texttt{ABADABA}
-reunat ovat \texttt{A}, \texttt{ABA} ja
-\texttt{ABADABA}.
-Usein halutaan etsiä \key{pisin reuna},
-joka ei ole koko merkkijono.
+\index{border}

-\index{leksikografinen jxrjestys@leksikografinen järjestys}
+A \key{border} is a string that is both
+a prefix and a suffix of a string.
+For example, the borders for \texttt{ABADABA}
+are \texttt{A}, \texttt{ABA} and \texttt{ABADABA}.
+Often we want to find the \key{longest border}
+that is not the whole string.

-Merkkijonojen vertailussa käytössä on yleensä
-\key{leksikografinen järjestys}, joka vastaa aakkosjärjestystä.
-Siinä $x<y$, jos joko $x$ on $y$:n aito alkuosa
-tai on olemassa kohta $k$ niin,
-että $x[i]=y[i]$, kun $i<k$, ja $x[k]<y[k]$.
+\index{lexicographical order}

-\section{Trie-rakenne}
+Usually we compare string using the \key{lexicographical order}
+that corresponds to the alphabetical order.
+It means that $x<y$ if either $x$ is a proper prefix of $y$,
+or there is an index $k$ such that
+$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.

-\index{trie@trie}
+\section{Trie structure}

-\key{Trie} on puurakenne,
-joka pitää yllä joukkoa merkkijonoja.
-Merkkijonot tallennetaan puuhun
-juuresta lähtevinä merkkien ketjuina.
-Jos useammalla merkkijonolla on sama alkuosa,
-niiden ketjun alkuosa on yhteinen.
+\index{trie}

-Esimerkiksi joukkoa
-$\{\texttt{APILA},\texttt{APINA},\texttt{SUU},\texttt{SUURI}\}$
-vastaa seuraava trie:
+A \key{trie} is a tree structure that
+maintains a set of strings.
+Strings are stored in a trie as chains
+of characters that start at the root
+of the tree.
+If two strings have a common prefix,
+they also share a chain in the tree.
+
+For example, the following trie corresponds
+to the set
+$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$:

 \begin{center}
 \begin{tikzpicture}[scale=0.9]
@ -120,50 +119,51 @@ vastaa seuraava trie:
 \node[draw, circle] (12) at (1.5,14.5) {$\phantom{1}$};
 \node[draw, circle] (13) at (1.5,13) {*};

-\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{A}] {} (2);
-\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{S}] {} (3);
-\path[draw,thick,->] (2) -- node[font=\small,label=left:\texttt{P}] {} (4);
-\path[draw,thick,->] (4) -- node[font=\small,label=left:\texttt{I}] {} (5);
-\path[draw,thick,->] (5) -- node[font=\small,label=left:\texttt{L}] {} (6);
-\path[draw,thick,->] (5) -- node[font=\small,label=right:\texttt{N}] {} (7);
-\path[draw,thick,->] (6) -- node[font=\small,label=left:\texttt{A}] {}(8);
-\path[draw,thick,->] (7) -- node[font=\small,label=right:\texttt{A}] {} (9);
-\path[draw,thick,->] (3) -- node[font=\small,label=right:\texttt{U}] {} (10);
-\path[draw,thick,->] (10) -- node[font=\small,label=right:\texttt{U}] {} (11);
+\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{C}] {} (2);
+\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{T}] {} (3);
+\path[draw,thick,->] (2) -- node[font=\small,label=left:\texttt{A}] {} (4);
+\path[draw,thick,->] (4) -- node[font=\small,label=left:\texttt{N}] {} (5);
+\path[draw,thick,->] (5) -- node[font=\small,label=left:\texttt{A}] {} (6);
+\path[draw,thick,->] (5) -- node[font=\small,label=right:\texttt{D}] {} (7);
+\path[draw,thick,->] (6) -- node[font=\small,label=left:\texttt{L}] {}(8);
+\path[draw,thick,->] (7) -- node[font=\small,label=right:\texttt{Y}] {} (9);
+\path[draw,thick,->] (3) -- node[font=\small,label=right:\texttt{H}] {} (10);
+\path[draw,thick,->] (10) -- node[font=\small,label=right:\texttt{E}] {} (11);
 \path[draw,thick,->] (11) -- node[font=\small,label=right:\texttt{R}] {} (12);
-\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{I}] {} (13);
+\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
 \end{tikzpicture}
 \end{center}
-Merkki * solmussa tarkoittaa,
-että jokin merkkijono päättyy kyseiseen solmuun.
-Tämä merkki on tarpeen,
-koska merkkijono voi olla toisen merkkijonon alkuosa,
-kuten tässä puussa merkkijono \texttt{SUU} 
-on merkkijonon \texttt{SUURI} alkuosa.
+The character * in a node means that
+a string ends at the node.
+This character is needed because a string
+may be a prefix of another string.
+For example, in this trie, \texttt{THE}
+is a suffix of \texttt{THERE}.

-Triessä merkkijonon lisääminen ja hakeminen
-vievät aikaa $O(n)$, kun $n$ on merkkijonon pituus.
-Molemmat operaatiot voi toteuttaa lähtemällä liikkeelle juuresta
-ja kulkemalla alaspäin ketjua merkkien mukaisesti.
-Tarvittaessa puuhun lisätään uusia solmuja.
+Inserting and searching a string in a trie take $O(n)$ time
+where $n$ is the length of the string.
+Both operations can be implemented by
+starting at the root node and following the
+chain of characters that appear in the string.
+If needed, new nodes will be added to the trie.

-Triestä on mahdollista etsiä
-sekä merkkijonoja että merkkijonojen alkuosia.
-Lisäksi puun solmuissa voi pitää kirjaa,
-monessako merkkijonossa on solmua vastaava alkuosa,
-mikä lisää trien käyttömahdollisuuksia.
+Trie can be used for searching both strings
+and prefixes of strings.
+In addition, we can keep track of the number
+of strings that have each prefix,
+that can be useful in some applications.

-Trie on kätevää tallentaa taulukkona
+A trie can be stored as an array
 \begin{lstlisting}
 int t[N][A];
 \end{lstlisting}
-missä $N$ on solmujen suurin mahdollinen määrä
-(eli tallennettavien merkkijonojen yhteispituus)
-ja $A$ on aakkoston koko.
-Trien solmut numeroidaan $1,2,3,\ldots$ niin,
-että juuren numero on 1,
-ja taulukon kohta $\texttt{t}[s][c]$ kertoo,
-mihin solmuun solmusta $s$ pääsee merkillä $c$.
+where $N$ is the maximum number of nodes
+(the total length of the string to be stored)
+and $A$ is the size of the alphabet.
+The nodes of a trie are numbered
+$1,2,3,\ldots$ so that the number of the root is 1,
+and $\texttt{t}[s][c]$ is the next node in chain
+from node $s$ using character $c$.

 \section{Merkkijonohajautus}