Definitions and trie
This commit is contained in:
parent
0203944cc2
commit
e03e9906a8
238
luku26.tex
238
luku26.tex
|
@ -1,108 +1,107 @@
|
|||
\chapter{String algorithms}
|
||||
|
||||
\index{merkkijono@merkkijono}
|
||||
\index{aakkosto@aakkosto}
|
||||
\index{string}
|
||||
\index{alphabet}
|
||||
|
||||
Merkkijonon $s$ merkit ovat $s[1],s[2],\ldots,s[n]$,
|
||||
missä $n$ on merkkijonon pituus.
|
||||
A string $s$ of length $n$
|
||||
is a sequence of characters
|
||||
$s[1],s[2],\ldots,s[n]$.
|
||||
|
||||
\key{Aakkosto} sisältää ne merkit,
|
||||
joita merkkijonossa voi esiintyä.
|
||||
Esimerkiksi aakkosto $\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
|
||||
sisältää englannin kielen suuret kirjaimet.
|
||||
An \key{alphabet} is a set of characters
|
||||
that may appear in strings.
|
||||
For example, the alphabet
|
||||
$\{\texttt{A},\texttt{B},\ldots,\texttt{Z}\}$
|
||||
consists of the capital letters of English.
|
||||
|
||||
\index{osajono@osajono}
|
||||
\index{substring}
|
||||
|
||||
\key{Osajono}
|
||||
sisältää merkkijonon merkit
|
||||
yhtenäiseltä väliltä.
|
||||
Merkkijonon osa\-jonojen määrä on $n(n+1)/2$.
|
||||
Esimerkiksi merkkijonon \texttt{ALGORITMI}
|
||||
yksi osajono on \texttt{ORITM},
|
||||
joka muodostuu valitsemalla välin \texttt{ALG\underline{ORITM}I}.
|
||||
A \key{substring} consists of consecutive
|
||||
characters in a string.
|
||||
The number of substrings in a string is $n(n+1)/2$.
|
||||
For example, \texttt{ORITH} is a substring
|
||||
in \texttt{ALGORITHM}, and it corresponds
|
||||
to \texttt{ALG\underline{ORITH}M}.
|
||||
|
||||
\index{alijono@alijono}
|
||||
\index{subsequence}
|
||||
|
||||
\key{Alijono}
|
||||
on osajoukko merkkijonon merkeistä.
|
||||
Merkkijonon alijonojen määrä on $2^n-1$.
|
||||
Esimerkiksi merkkijonon \texttt{ALGORITMI}
|
||||
yksi alijono on \texttt{LGRMI}, joka muodostuu
|
||||
valitsemalla merkit \texttt{A\underline{LG}O\underline{R}IT\underline{MI}}.
|
||||
A \key{subsequence} is a subset of characters
|
||||
in a string in their original order.
|
||||
The number of subsequences in a string is $2^n-1$.
|
||||
For example, \texttt{LGRHM} is a subsequece
|
||||
in \texttt{ALGORITHM}, and it corresponds
|
||||
to \texttt{A\underline{LG}O\underline{R}IT\underline{HM}}.
|
||||
|
||||
\index{alkuosa@alkuosa}
|
||||
\index{loppuosa@loppuosa}
|
||||
\index{prefiksi@prefiksi}
|
||||
\index{suffiksi@suffiksi}
|
||||
\index{prefix}
|
||||
\index{suffix}
|
||||
|
||||
\key{Alkuosa} on merkkijonon
|
||||
alusta alkava osajono,
|
||||
ja \key{loppuosa} on merkkijonon
|
||||
loppuun päättyvä osajono.
|
||||
Esimerkiksi merkkijonon \texttt{KISSA}
|
||||
alkuosat ovat \texttt{K}, \texttt{KI},
|
||||
\texttt{KIS}, \texttt{KISS} ja \texttt{KISSA}
|
||||
ja loppuosat ovat \texttt{A}, \texttt{SA},
|
||||
\texttt{SSA}, \texttt{ISSA} ja \texttt{KISSA}.
|
||||
Alkuosa tai loppuosa on \key{aito},
|
||||
jos se ei ole koko merkkijono.
|
||||
A \key{prefix} is a subtring that contains the first
|
||||
character of a string,
|
||||
and a \key{suffix} is a substring that contains the last character.
|
||||
For example, the prefixes of
|
||||
\texttt{STORY} are \texttt{S}, \texttt{ST},
|
||||
\texttt{STO}, \texttt{STOR} and \texttt{STORY},
|
||||
and the suffixes are \texttt{Y}, \texttt{RY},
|
||||
\texttt{ORY}, \texttt{TORY} and \texttt{STORY}.
|
||||
A prefix or a suffix is \key{proper}
|
||||
if it is not the whole string.
|
||||
|
||||
\index{kierto@kierto}
|
||||
\index{rotation}
|
||||
|
||||
\key{Kierto} syntyy
|
||||
siirtämällä jokin alkuosa merkkijonon loppuun
|
||||
tai jokin loppuosa merkkijonon alkuun.
|
||||
Esimerkiksi merkkijonon \texttt{APILA}
|
||||
kierrot ovat
|
||||
\texttt{APILA},
|
||||
\texttt{PILAA},
|
||||
\texttt{ILAAP},
|
||||
\texttt{LAAPI} ja
|
||||
\texttt{AAPIL}.
|
||||
A \key{rotation} can be generated by moving
|
||||
characters one by one from the beginning to the end
|
||||
in a string (or vice versa).
|
||||
For example, the rotations of \texttt{STORY} are
|
||||
\texttt{STORY},
|
||||
\texttt{TORYS},
|
||||
\texttt{ORYST},
|
||||
\texttt{RYSTO} and
|
||||
\texttt{YSTOR}.
|
||||
|
||||
\index{jakso@jakso}
|
||||
\index{period}
|
||||
|
||||
\key{Jakso} on alkuosa,
|
||||
jota toistamalla merkkijono muodostuu.
|
||||
Jakson viimeinen toistokerta voi olla osittainen
|
||||
niin, että siinä on vain jakson alkuosa.
|
||||
Usein on kiinnostavaa selvittää, mikä on merkkijonon
|
||||
\key{lyhin jakso}.
|
||||
Esimerkiksi merkkijonon \texttt{ABCABCA} lyhin jakso on \texttt{ABC}.
|
||||
Tässä tapauksessa merkkijono syntyy toistamalla jaksoa ensin kahdesti kokonaan
|
||||
ja sitten kerran osittain.
|
||||
A \key{period} is a prefix of a string such that
|
||||
we can construct the string by repeating the period.
|
||||
The last repetition may be partial and contain
|
||||
only a prefix of the period.
|
||||
Often it is interesting to find the \key{shortest period}
|
||||
of a string.
|
||||
For example, the shortest period of
|
||||
\texttt{ABCABCA} is \texttt{ABC}.
|
||||
In this case, we first repeat the period twice
|
||||
and then partially.
|
||||
|
||||
\key{Reuna} on
|
||||
merkkijono, joka on sekä
|
||||
alkuosa että loppuosa.
|
||||
Esimerkiksi merkkijonon \texttt{ABADABA}
|
||||
reunat ovat \texttt{A}, \texttt{ABA} ja
|
||||
\texttt{ABADABA}.
|
||||
Usein halutaan etsiä \key{pisin reuna},
|
||||
joka ei ole koko merkkijono.
|
||||
\index{border}
|
||||
|
||||
\index{leksikografinen jxrjestys@leksikografinen järjestys}
|
||||
A \key{border} is a string that is both
|
||||
a prefix and a suffix of a string.
|
||||
For example, the borders for \texttt{ABADABA}
|
||||
are \texttt{A}, \texttt{ABA} and \texttt{ABADABA}.
|
||||
Often we want to find the \key{longest border}
|
||||
that is not the whole string.
|
||||
|
||||
Merkkijonojen vertailussa käytössä on yleensä
|
||||
\key{leksikografinen järjestys}, joka vastaa aakkosjärjestystä.
|
||||
Siinä $x<y$, jos joko $x$ on $y$:n aito alkuosa
|
||||
tai on olemassa kohta $k$ niin,
|
||||
että $x[i]=y[i]$, kun $i<k$, ja $x[k]<y[k]$.
|
||||
\index{lexicographical order}
|
||||
|
||||
\section{Trie-rakenne}
|
||||
Usually we compare string using the \key{lexicographical order}
|
||||
that corresponds to the alphabetical order.
|
||||
It means that $x<y$ if either $x$ is a proper prefix of $y$,
|
||||
or there is an index $k$ such that
|
||||
$x[i]=y[i]$ when $i<k$ and $x[k]<y[k]$.
|
||||
|
||||
\index{trie@trie}
|
||||
\section{Trie structure}
|
||||
|
||||
\key{Trie} on puurakenne,
|
||||
joka pitää yllä joukkoa merkkijonoja.
|
||||
Merkkijonot tallennetaan puuhun
|
||||
juuresta lähtevinä merkkien ketjuina.
|
||||
Jos useammalla merkkijonolla on sama alkuosa,
|
||||
niiden ketjun alkuosa on yhteinen.
|
||||
\index{trie}
|
||||
|
||||
Esimerkiksi joukkoa
|
||||
$\{\texttt{APILA},\texttt{APINA},\texttt{SUU},\texttt{SUURI}\}$
|
||||
vastaa seuraava trie:
|
||||
A \key{trie} is a tree structure that
|
||||
maintains a set of strings.
|
||||
Strings are stored in a trie as chains
|
||||
of characters that start at the root
|
||||
of the tree.
|
||||
If two strings have a common prefix,
|
||||
they also share a chain in the tree.
|
||||
|
||||
For example, the following trie corresponds
|
||||
to the set
|
||||
$\{\texttt{CANAL},\texttt{CANDY},\texttt{THE},\texttt{THERE}\}$:
|
||||
|
||||
\begin{center}
|
||||
\begin{tikzpicture}[scale=0.9]
|
||||
|
@ -120,50 +119,51 @@ vastaa seuraava trie:
|
|||
\node[draw, circle] (12) at (1.5,14.5) {$\phantom{1}$};
|
||||
\node[draw, circle] (13) at (1.5,13) {*};
|
||||
|
||||
\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{A}] {} (2);
|
||||
\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{S}] {} (3);
|
||||
\path[draw,thick,->] (2) -- node[font=\small,label=left:\texttt{P}] {} (4);
|
||||
\path[draw,thick,->] (4) -- node[font=\small,label=left:\texttt{I}] {} (5);
|
||||
\path[draw,thick,->] (5) -- node[font=\small,label=left:\texttt{L}] {} (6);
|
||||
\path[draw,thick,->] (5) -- node[font=\small,label=right:\texttt{N}] {} (7);
|
||||
\path[draw,thick,->] (6) -- node[font=\small,label=left:\texttt{A}] {}(8);
|
||||
\path[draw,thick,->] (7) -- node[font=\small,label=right:\texttt{A}] {} (9);
|
||||
\path[draw,thick,->] (3) -- node[font=\small,label=right:\texttt{U}] {} (10);
|
||||
\path[draw,thick,->] (10) -- node[font=\small,label=right:\texttt{U}] {} (11);
|
||||
\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{C}] {} (2);
|
||||
\path[draw,thick,->] (1) -- node[font=\small,label=\texttt{T}] {} (3);
|
||||
\path[draw,thick,->] (2) -- node[font=\small,label=left:\texttt{A}] {} (4);
|
||||
\path[draw,thick,->] (4) -- node[font=\small,label=left:\texttt{N}] {} (5);
|
||||
\path[draw,thick,->] (5) -- node[font=\small,label=left:\texttt{A}] {} (6);
|
||||
\path[draw,thick,->] (5) -- node[font=\small,label=right:\texttt{D}] {} (7);
|
||||
\path[draw,thick,->] (6) -- node[font=\small,label=left:\texttt{L}] {}(8);
|
||||
\path[draw,thick,->] (7) -- node[font=\small,label=right:\texttt{Y}] {} (9);
|
||||
\path[draw,thick,->] (3) -- node[font=\small,label=right:\texttt{H}] {} (10);
|
||||
\path[draw,thick,->] (10) -- node[font=\small,label=right:\texttt{E}] {} (11);
|
||||
\path[draw,thick,->] (11) -- node[font=\small,label=right:\texttt{R}] {} (12);
|
||||
\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{I}] {} (13);
|
||||
\path[draw,thick,->] (12) -- node[font=\small,label=right:\texttt{E}] {} (13);
|
||||
\end{tikzpicture}
|
||||
\end{center}
|
||||
Merkki * solmussa tarkoittaa,
|
||||
että jokin merkkijono päättyy kyseiseen solmuun.
|
||||
Tämä merkki on tarpeen,
|
||||
koska merkkijono voi olla toisen merkkijonon alkuosa,
|
||||
kuten tässä puussa merkkijono \texttt{SUU}
|
||||
on merkkijonon \texttt{SUURI} alkuosa.
|
||||
The character * in a node means that
|
||||
a string ends at the node.
|
||||
This character is needed because a string
|
||||
may be a prefix of another string.
|
||||
For example, in this trie, \texttt{THE}
|
||||
is a suffix of \texttt{THERE}.
|
||||
|
||||
Triessä merkkijonon lisääminen ja hakeminen
|
||||
vievät aikaa $O(n)$, kun $n$ on merkkijonon pituus.
|
||||
Molemmat operaatiot voi toteuttaa lähtemällä liikkeelle juuresta
|
||||
ja kulkemalla alaspäin ketjua merkkien mukaisesti.
|
||||
Tarvittaessa puuhun lisätään uusia solmuja.
|
||||
Inserting and searching a string in a trie take $O(n)$ time
|
||||
where $n$ is the length of the string.
|
||||
Both operations can be implemented by
|
||||
starting at the root node and following the
|
||||
chain of characters that appear in the string.
|
||||
If needed, new nodes will be added to the trie.
|
||||
|
||||
Triestä on mahdollista etsiä
|
||||
sekä merkkijonoja että merkkijonojen alkuosia.
|
||||
Lisäksi puun solmuissa voi pitää kirjaa,
|
||||
monessako merkkijonossa on solmua vastaava alkuosa,
|
||||
mikä lisää trien käyttömahdollisuuksia.
|
||||
Trie can be used for searching both strings
|
||||
and prefixes of strings.
|
||||
In addition, we can keep track of the number
|
||||
of strings that have each prefix,
|
||||
that can be useful in some applications.
|
||||
|
||||
Trie on kätevää tallentaa taulukkona
|
||||
A trie can be stored as an array
|
||||
\begin{lstlisting}
|
||||
int t[N][A];
|
||||
\end{lstlisting}
|
||||
missä $N$ on solmujen suurin mahdollinen määrä
|
||||
(eli tallennettavien merkkijonojen yhteispituus)
|
||||
ja $A$ on aakkoston koko.
|
||||
Trien solmut numeroidaan $1,2,3,\ldots$ niin,
|
||||
että juuren numero on 1,
|
||||
ja taulukon kohta $\texttt{t}[s][c]$ kertoo,
|
||||
mihin solmuun solmusta $s$ pääsee merkillä $c$.
|
||||
where $N$ is the maximum number of nodes
|
||||
(the total length of the string to be stored)
|
||||
and $A$ is the size of the alphabet.
|
||||
The nodes of a trie are numbered
|
||||
$1,2,3,\ldots$ so that the number of the root is 1,
|
||||
and $\texttt{t}[s][c]$ is the next node in chain
|
||||
from node $s$ using character $c$.
|
||||
|
||||
\section{Merkkijonohajautus}
|
||||
|
||||
|
|
Loading…
Reference in New Issue