Edit distance
This commit is contained in:
parent
62ef5d9f93
commit
69458aab08
146
luku07.tex
146
luku07.tex
|
@ -710,85 +710,88 @@ The efficiency of solution 1 depends on the weights
|
||||||
of the objects, while the efficiency of solution 2
|
of the objects, while the efficiency of solution 2
|
||||||
depends on the values of the objects.
|
depends on the values of the objects.
|
||||||
|
|
||||||
\section{Editointietäisyys}
|
\section{Edit distance}
|
||||||
|
|
||||||
\index{editointietxisyys@editointietäisyys}
|
\index{edit distance}
|
||||||
\index{Levenšteinin etäisyys}
|
\index{Levenshtein distance}
|
||||||
|
|
||||||
\key{Editointietäisyys} eli
|
The \key{edit distance},
|
||||||
\key{Levenšteinin etäisyys}
|
also known as the \key{Levenshtein distance},
|
||||||
kuvaa, kuinka kaukana kaksi merkkijonoa ovat toisistaan.
|
indicates how similar two strings are.
|
||||||
Se on pienin määrä editointioperaatioita,
|
It is the minimum number of editing operations
|
||||||
joilla ensimmäisen merkkijonon saa muutettua toiseksi.
|
needed for transforming the first string
|
||||||
Sallitut operaatiot ovat:
|
into the second string.
|
||||||
|
The allowed editing operations are as follows:
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item merkin lisäys (esim. \texttt{ABC} $\rightarrow$ \texttt{ABCA})
|
\item insert a character (e.g. \texttt{ABC} $\rightarrow$ \texttt{ABCA})
|
||||||
\item merkin poisto (esim. \texttt{ABC} $\rightarrow$ \texttt{AC})
|
\item remove a character (e.g. \texttt{ABC} $\rightarrow$ \texttt{AC})
|
||||||
\item merkin muutos (esim. \texttt{ABC} $\rightarrow$ \texttt{ADC})
|
\item change a character (e.g. \texttt{ABC} $\rightarrow$ \texttt{ADC})
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
Esimerkiksi merkkijonojen \texttt{TALO} ja \texttt{PALLO}
|
For example, the edit distance between
|
||||||
editointietäisyys on 2, koska voimme tehdä ensin
|
\texttt{LOVE} and \texttt{MOVIE} is 2
|
||||||
operaation \texttt{TALO} $\rightarrow$ \texttt{TALLO}
|
because we can first perform operation
|
||||||
(merkin lisäys) ja sen jälkeen operaation
|
\texttt{LOVE} $\rightarrow$ \texttt{MOVE}
|
||||||
\texttt{TALLO} $\rightarrow$ \texttt{PALLO}
|
(change) and then operation
|
||||||
(merkin muutos).
|
\texttt{MOVE} $\rightarrow$ \texttt{MOVIE}
|
||||||
Tämä on pienin mahdollinen määrä operaatioita, koska
|
(insertion).
|
||||||
selvästikään yksi operaatio ei riitä.
|
This is the smallest possible number of operations
|
||||||
|
because it is clear that one operation is not enough.
|
||||||
|
|
||||||
Oletetaan, että annettuna on merkkijonot
|
Suppose we are given strings
|
||||||
\texttt{x} (pituus $n$ merkkiä) ja
|
\texttt{x} of $n$ characters and
|
||||||
\texttt{y} (pituus $m$ merkkiä),
|
\texttt{y} of $m$ characters,
|
||||||
ja haluamme laskea niiden editointietäisyyden.
|
and we want to calculate the edit distance
|
||||||
Tämä onnistuu tehokkaasti dynaamisella
|
between them.
|
||||||
ohjelmoinnilla ajassa $O(nm)$.
|
This can be efficiently done using
|
||||||
Merkitään funktiolla $f(a,b)$
|
dynamic programming in $O(nm)$ time.
|
||||||
editointietäisyyttä \texttt{x}:n $a$
|
Let $f(a,b)$ denote the edit distance
|
||||||
ensimmäisen merkin sekä
|
between the first $a$ characters of \texttt{x}
|
||||||
\texttt{y}:n $b$:n ensimmäisen merkin välillä.
|
and the first $b$ characters of \texttt{y}.
|
||||||
Tätä funktiota käyttäen
|
Using this function, the edit distance between
|
||||||
merkkijonojen
|
\texttt{x} and \texttt{y} is $f(n,m)$,
|
||||||
\texttt{x} ja \texttt{y} editointietäisyys
|
and the function also determines
|
||||||
on $f(n,m)$, ja funktio kertoo myös tarvittavat
|
the editing operations needed.
|
||||||
editointioperaatiot.
|
|
||||||
|
|
||||||
Funktion pohjatapaukset ovat
|
The base cases for the function are
|
||||||
\[
|
\[
|
||||||
\begin{array}{lcl}
|
\begin{array}{lcl}
|
||||||
f(0,b) & = & b \\
|
f(0,b) & = & b \\
|
||||||
f(a,0) & = & a \\
|
f(a,0) & = & a \\
|
||||||
\end{array}
|
\end{array}
|
||||||
\]
|
\]
|
||||||
ja yleisessä tapauksessa pätee kaava
|
and in the general case the formula is
|
||||||
\[ f(a,b) = \min(f(a,b-1)+1,f(a-1,b)+1,f(a-1,b-1)+c),\]
|
\[ f(a,b) = \min(f(a,b-1)+1,f(a-1,b)+1,f(a-1,b-1)+c),\]
|
||||||
missä $c=0$, jos \texttt{x}:n merkki $a$
|
where $c=0$ if the $a$th character of \texttt{x}
|
||||||
ja \texttt{y}:n merkki $b$ ovat samat,
|
equals the $b$th character of \texttt{y},
|
||||||
ja muussa tapauksessa $c=1$.
|
and otherwise $c=1$.
|
||||||
Kaava käy läpi mahdollisuudet lyhentää merkkijonoja:
|
The formula covers all ways to shorten the strings:
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item $f(a,b-1)$ tarkoittaa, että $x$:ään lisätään merkki
|
\item $f(a,b-1)$ means that a character is inserted to \texttt{x}
|
||||||
\item $f(a-1,b)$ tarkoittaa, että $x$:stä poistetaan merkki
|
\item $f(a-1,b)$ means that a chacater is removed from \texttt{x}
|
||||||
\item $f(a-1,b-1)$ tarkoittaa, että $x$:ssä ja $y$:ssä on
|
\item $f(a-1,b-1)$ means that \texttt{x} and \texttt{y} contain
|
||||||
sama merkki ($c=0$) tai $x$:n merkki muutetaan $y$:n merkiksi ($c=1$)
|
the same character ($c=0$),
|
||||||
|
or a character in \texttt{x} is transformed into
|
||||||
|
a character in \texttt{y} ($c=1$)
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
Seuraava taulukko sisältää funktion $f$ arvot
|
The following table shows the values of $f$
|
||||||
esimerkin tapauksessa:
|
in the example case:
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=.65]
|
\begin{tikzpicture}[scale=.65]
|
||||||
\begin{scope}
|
\begin{scope}
|
||||||
%\fill [color=lightgray] (5, -3) rectangle (6, -4);
|
%\fill [color=lightgray] (5, -3) rectangle (6, -4);
|
||||||
\draw (1, -1) grid (7, -6);
|
\draw (1, -1) grid (7, -6);
|
||||||
|
|
||||||
\node at (0.5,-2.5) {\texttt{T}};
|
\node at (0.5,-2.5) {\texttt{L}};
|
||||||
\node at (0.5,-3.5) {\texttt{A}};
|
\node at (0.5,-3.5) {\texttt{O}};
|
||||||
\node at (0.5,-4.5) {\texttt{L}};
|
\node at (0.5,-4.5) {\texttt{V}};
|
||||||
\node at (0.5,-5.5) {\texttt{O}};
|
\node at (0.5,-5.5) {\texttt{E}};
|
||||||
|
|
||||||
\node at (2.5,-0.5) {\texttt{P}};
|
\node at (2.5,-0.5) {\texttt{M}};
|
||||||
\node at (3.5,-0.5) {\texttt{A}};
|
\node at (3.5,-0.5) {\texttt{O}};
|
||||||
\node at (4.5,-0.5) {\texttt{L}};
|
\node at (4.5,-0.5) {\texttt{V}};
|
||||||
\node at (5.5,-0.5) {\texttt{L}};
|
\node at (5.5,-0.5) {\texttt{I}};
|
||||||
\node at (6.5,-0.5) {\texttt{O}};
|
\node at (6.5,-0.5) {\texttt{E}};
|
||||||
|
|
||||||
\node at (1.5,-1.5) {$0$};
|
\node at (1.5,-1.5) {$0$};
|
||||||
\node at (1.5,-2.5) {$1$};
|
\node at (1.5,-2.5) {$1$};
|
||||||
|
@ -824,29 +827,28 @@ esimerkin tapauksessa:
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
Taulukon oikean alanurkan ruutu
|
The lower-right corner of the table
|
||||||
kertoo, että merkkijonojen \texttt{TALO}
|
indicates that the edit distance between
|
||||||
ja \texttt{PALLO} editointietäisyys on 2.
|
\texttt{LOVE} and \texttt{MOVIE} is 2.
|
||||||
Taulukosta pystyy myös
|
The table also shows how to construct
|
||||||
lukemaan, miten pienimmän editointietäisyyden
|
the shortest sequence of editing operations.
|
||||||
voi saavuttaa.
|
In this case the path is as follows:
|
||||||
Tässä tapauksessa polku on seuraava:
|
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\begin{tikzpicture}[scale=.65]
|
\begin{tikzpicture}[scale=.65]
|
||||||
\begin{scope}
|
\begin{scope}
|
||||||
\draw (1, -1) grid (7, -6);
|
\draw (1, -1) grid (7, -6);
|
||||||
|
|
||||||
\node at (0.5,-2.5) {\texttt{T}};
|
\node at (0.5,-2.5) {\texttt{L}};
|
||||||
\node at (0.5,-3.5) {\texttt{A}};
|
\node at (0.5,-3.5) {\texttt{O}};
|
||||||
\node at (0.5,-4.5) {\texttt{L}};
|
\node at (0.5,-4.5) {\texttt{V}};
|
||||||
\node at (0.5,-5.5) {\texttt{O}};
|
\node at (0.5,-5.5) {\texttt{E}};
|
||||||
|
|
||||||
\node at (2.5,-0.5) {\texttt{P}};
|
\node at (2.5,-0.5) {\texttt{M}};
|
||||||
\node at (3.5,-0.5) {\texttt{A}};
|
\node at (3.5,-0.5) {\texttt{O}};
|
||||||
\node at (4.5,-0.5) {\texttt{L}};
|
\node at (4.5,-0.5) {\texttt{V}};
|
||||||
\node at (5.5,-0.5) {\texttt{L}};
|
\node at (5.5,-0.5) {\texttt{I}};
|
||||||
\node at (6.5,-0.5) {\texttt{O}};
|
\node at (6.5,-0.5) {\texttt{E}};
|
||||||
|
|
||||||
\node at (1.5,-1.5) {$0$};
|
\node at (1.5,-1.5) {$0$};
|
||||||
\node at (1.5,-2.5) {$1$};
|
\node at (1.5,-2.5) {$1$};
|
||||||
|
|
Loading…
Reference in New Issue