Edit Distance

LA home
Computing
 Algorithms
 Bioinformatics
 FP,  λ
 Logic,  π
 MML
 Prog.Langs

FP
 Haskell
  Haskell98

paper

"Lazy-evaluation in a functional language is exploited to make the simple dynamic-programming algorithm for the edit-distance problem run quickly on similar strings: being lazy can be fast." [Inf. Proc. Lett., 43(4), pp207-212, Sept' 1992].

a = acgtacgtacgt
b = acatacttgtact

a = acgtac  gtacgt
    || |||  |||| |
b = acatacttgtac t
      ^   ^^    ^
      |   ||    |
      |   ||    delete
      |   ||
      |   insert*2
      |
      change

d a b = 4

The algorithm below organises the edit distance "matrix" by diagonals, not by rows or columns. Its time-complexity is O(n×d) where n is the length of the sequences and d is the edit distance between them, i.e. it is fast if the sequences are similar, d<<n. The original program was in Lazy ML; it is given in Haskell 98 here:


module Edit_IPL_V43_N4 (d) where
-- compute the edit distance of sequences a and b.
d a b =
 let 
   -- diagonal from the top-left element
   mainDiag = oneDiag  a b (head uppers) ( -1 : (head lowers))

   -- diagonals above the mainDiag
   uppers   = eachDiag a b (mainDiag : uppers)

   -- diagonals below the mainDiag
   lowers   = eachDiag b a (mainDiag : lowers)  -- !

   oneDiag a b diagAbove diagBelow =  -- \   \   \
    let                               --  \   \   \
      doDiag [] b nw n w = []         --   \  nw   n
      doDiag a [] nw n w = []         --    \   \
      doDiag (a:as) (b:bs) nw n w =   --      w   me
       let me = if a==b then nw       -- dynamic programming DPA
                else 1+min3 (head w) nw (head n)
       in  me : (doDiag as bs me (tail n) (tail w))

      firstelt = 1+(head diagBelow)
      thisdiag =
        firstelt:(doDiag a b firstelt diagAbove (tail diagBelow))

    in thisdiag

   min3 x y z =
   -- see L. Allison, Lazy Dynamic-Programming can be Eager
   --     Inf. Proc. Letters 43(4) pp207-212, Sept' 1992
     if x < y then x else min y z   -- makes it O(|a|*D(a,b))
     -- min x (min y z)             -- makes it O(|a|*|b|)
   -- the fast one does not always evaluate all three values.

   eachDiag a [] diags = []
   eachDiag a (b:bs) (lastDiag:diags) =
    let nextDiag = head(tail diags)
    in  (oneDiag a bs nextDiag lastDiag):(eachDiag a bs diags)

   -- which is the diagonal containing the bottom R.H. elt?
   lab = (length a) - (length b)

 in last( if      lab == 0 then mainDiag
          else if lab > 0 then lowers !! ( lab-1)
          else                 uppers !! (-lab-1) )

 -- module under Gnu `copyleft' GPL General Public Licence.

The code above calculates the value of the edit distance between the sequences a and b only. The "matrix" does contain enough information to recover an alignment of a and b that achieves this value, but this is left as an exercise.

dynamic programming algorithm DPA in O(n*d) time
-- Evaluating O(n×d) entries --
window on the wide world:

Computer Science Education Week

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite,
ver 3.4+

The GIMP
~ free photoshop
Firefox
web browser
FlashBlock
like it says!

Haskell:
(:) cons
[x1,...] list
[ ]list
(++) append
\ λ :-)
:: has type
Compared

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Friday, 18-Apr-2014 10:32:42 EST.