[2nd>info.>]

Variations on Sequence Alignment.

  1. basic DPA,
  2. generic DPA,
  3. information theory interpretation,
  4. alphabet,
  5. mutation costs,
  6. saving space-1,
  7. checkpointing, saving space-2,
  8. bit-vectors, LCS 2-seqs., fast, e.g., ×32,
  9. O(n.d)-time, edit-distance 2-seqs., fast,
  10. Substring problem, with wild-card, fast,
  11. multiple alignment,
  12. 3-sequences, fast & small,
  13. modelling alignment, for non-random sequences, and
  14. references.
Lloyd Allison
www.csse.monash.edu.au/~lloyd/Seminars/2007-DPA/index.shtml  
Basic DPA:
 
s1:
s2:
[dpa-basic.js] -- global alignment, (optimal), minimise simple {0, 1} costs, ascii, 2-sequences.
 
some variations
question objective α  
local-al.
global-al.
(optimal) min-
max-
cost
score
int
real
DNA
A.A.
Struct
# seqs
related? sum probability
. . .
 
A global (local) alignment is a hypothesis about how all (part) of one seq. is related to all (part) of another seq..
 
We can ask how probable is such a hypothesis?
 
Can add all their probabilities.
 
Generic, 1-state, 2-sequence, dynamic programming algorithm (DPA)
 
Boundary conditions:
m[0, 0] = identity value of g(,),
m[i, 0]  = g( m[i-1, 0], c( s1[i], '-' ) ),
m[0, j]  = g( m[0, j-1], c( '-', s2[j] ) ),
 
General step:
m[i, j] = f( g( m[i-1, j-1], c( s1[i], s2[j] ) ),
  g( m[i-1, j], c( s1[i], '-' ) ),
  g( m[i, j-1], c( '-', s2[j] ) ) )
 
1 ≤ i ≤ |s1|,   1 ≤ j ≤ |s2|
. . .
the DPA's general step,   <i, j>th cell.
. . . instantiations
 
Longest Common Subsequence (LCS)
f = max
g = +
c( x, x ) = 1,
c( x, y ) = c( '-', x ) = c( x, '-' ) = 0
 
Levenshtein "edit-distance" [Lev66]
f = min
g = +
c( x, x ) = 0,
c( x, y ) = c( '-', x ) = c( x, '-' ) = 1
 

Maximising good (matches) is not quite equivalent to minimising bad (changes).

 
. . . instantiations, an observation
 
the rank-order of (min,+) alignments is unchanged by:
 
multiplying costs, c(.,.), by a +ve constant ( - ve const flips min/max), or by
 
adding a constant, per character, to each cost
(note indel ~ 1 char., match/mismatch ~ 2 chars.)
 

Using this fact, the equivalent indel- and change-probabilities can be recovered from ad hoc scores or cost penalties [All93].

 
. . . instantiations
 
Most probable alignment
f = max
g = ×
c( x, x ) = pr( x, x ),
c( x, y ) = pr( x, y ),
c( '-', x ) = pr( '-', x ), c( x, '-' ) = pr( x, '-' )
or
f = min
g = +
c( x, x ) = - log pr( x, x ),
c( x, y ) = - log pr( x, y ),
c( '-', x ) = - log pr( '-', x ), c( x, '-' ) = - log pr( x, '-' )
. . . instantiations
 
sum of probabilities
f = +
g = ×
c( x, x ) = pr( x, x ),
c( x, y ) = pr( x, x ),
c( '-', x ) = pr( '-', x ), c( x, '-' ) = pr( x, '-' )
or
f = logPlus
g = +
c( x, x ) = - log pr( x, x ),
c( x, y ) = - log pr( x, x ),
c( '-', x ) = - log pr( '-', x ), c( x, '-' ) = - log pr( x, '-' )
. . ., e.g., Alignment probability density plot
 
. . . alignment probability
 
 
Estimates of pr(=), pr(≠), pr(insert), pr(delete) from an optimal alignment are biased, but
 
estimates from sum-of-all alignments are not [YA93]
 
(and the latter works for v. distantly related seqs.).
 
An information theory interpretation
 
Now, - log2(1/4) = 2, so
a DNA base is worth about 2 bits, usually.
 
Two unrelated DNA seqs. take ~ 2-bits/base,
two related seqs. take 1 to 2-bits/base, &
two identical seqs. take ~ 1-bit/base to encode together.
This gives a hypothesis test [AWY92].
 
In Plasmodium falciparum
pr(A) = pr(T) = 0.4,
- log2(0.4) = 1+ bits,
pr(C) = pr(G) = 0.1,
- log2(0.1) = 3+ bits.
 
(Note, for protein, - log2(1/20) ~ 4.3-bits.)
. . . information
 
An alignment is a hypothesis about how 2 seqs. are related, if they are related.
 
It answers the question, "how did the seqs. evolve, if they are related?"
This is more detailed than the Q., "are the seqs. related?"
 
The sum-of-probabilities of all alignments answers the latter.
 
NB. In the "grey zone", 2 distant seqs. may be related even if no one alignment is an acceptable hypothesis.
 
. . . information and local alignment
s1 = α β γ
s2 = δ β' ε,   α γ δ ε about 2bits/base (DNA),
β ~ β',     β+β' about 1bit/base if v.similar.
Must state lengths |α|, |β|, etc..
. . . information and sub-sequence matching
s2 = α s1' β,   α β about 2bits/base (DNA),
s1 ~ s1',   s1+s1' about 1bit/base if v.similar.
. . . information and overlaps
s1 = α β
s2 = γ α',   β γ about 2bits/base (DNA),
α ~ α',       α+α' about 1bit/base if v.similar.
. . . information and parameters
 
A 5-state mutation model [AWY92] is more complex than a 3-state model which is more complex than a 1-state model [BT86].
 
All parameter values are part of any hypothesis that two, or more, sequences are related.
 
They must be paid for to prevent overfitting. [MML] shows how to do this.
. . . information and lengths
 
Must state all lengths that are part of any hypothesis, unless they are common knowledge --
many plausible prob. distributions for such [integer≥0] values.
 
Null hypothesis (seqs. unrelated):
State |s1| and |s2|, or
equivalently |s1|+|s2| and |s1|-|s2|,
the last is expected to be small +ve or -ve, mean 0, (e.g., can use a code based on the binomial).
 
Global alignment (hence seqs. related):
State |alignment|     (typically > |s1|, |s2|).
 
Local alignment, sub-sequence, and overlap problems have more lengths to state.
 
Alphabet
 
Small alphabets, DNA, RNA,
many "chance" symbol matches, e.g., [Dek83] found lower & upper bounds of 0.5454 & 0.7181 on the (fractional) |LCS(s1,s2)| for 2 unrelated seqs. and an alphabet of size 4.
Can reduce this effect by "k-tupling" symbols of a sequence, e.g., [WL83], also see [HS77] and [Epp92].
 
Amino acids (20)
skew & bio-chem. properties, a 20×20 substitution matrix is necessary.
Because of "similar" amino acids, k × = may be too strong a condition.
 
Unbounded alphabets, ints, reals
 
3-D structure . . .
 
. . . alphabet, 3-D structure
 
An orientation-independent representation of structure helps:
 
  1. Protein tertiary structure as set of points [NW91], O(|s|4)-time, although can reduce to O(|s|3) with some loss of info.,
    adapts geometric hashing of sets of points.
     
  2. Protein tertiary structure as sequence of points [TO89],
    DPA, need a   c(s1[i], s2[j]) =
    (i)  ∑m=-n,+n f(|s1[i]-s1[i+m]|, |s2[j]-s2[j+m]|),
    takes O(|s1|.|s2|.n)-time,
    (ii) DPAi',j'(f(|s1[i]-s1[i']|, |s2[j]-s2[j']|))   --!
    the straightforward implementation takes O(|s1|2.|s2|2)-time.
 
Mutation costs
 
Integers,
particularly small integers as in LCS, edit-distance,
allow bit-vector techniques [AD86], and
a "contour" representation [Ukk83],
-> fast algorithms for two or more sequences.
 
Reals,
particularly ( - log) probabilities which allow sum over all alignments.
 
Indels with "memory",
linear (affine) indel (gap-) costs [Got82] ~ 3-state automaton, and
piece-wise linear costs ~ 5-state automaton [AWY92].
 
O( |s1| . |s2| . states2 )-time.
 
((P)FSA sometimes called HMM.)
. . . mutation costs, indels with "memory"
3-state HMM machine or automaton or model of mutation or sequence generation
3-state model (automaton, machine), horizontal, diagonal, vertical travel.
 
O( |s1| . |s2| . states2 )-time.
 
Saving space
s1:
s2:
al:
tr:
[dpa-dc.js], after [Hir75].
. . . saving space
 
If the sequences are "small", solve directly.
 
Otherwise, divide s1 in half.
 
Compute s1A:S2 forward and s1B:s2 in reverse, and
 
find cheapest corresponding split of s2.
 
Solve s1A : s2A and s1B : s2B recursively.
. . . saving space
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
O(|s2|)-space  for a small, fixed # of "rows", and
 
|s1|×|s2|×(1+1/2+1/4+...) ~ 2×|s1|×|s2| time, i.e. remains O(|s1|×|s2|)-time.
 
Checkpointing: Saving space another way [PAD99].
 
Run algorithm forward (using minimum space) to completion,
save a checkpoint (row|plane) when alg. reaches half-way stage.
 
Augment each "cell" to store optimal crossing location of the checkpoint.
 
Recursively divide and conquer by,
(i) running alg. from start to crossing location, and
(ii) from crossing location to completion.
 
Reduces space to O(|s|k-1) for k≥2 sequences.
Can trade space for time by saving >1 checkpoint.
Much easier programming than [Hir75] for complex multi-state models.
Can readily apply to the U[,] matrix of fast [Ukk82]-style algorithms & even [3 seqs fast].
 
Bit-vectors, 2 sequences, LCS, fast [AD86].
 
bit string or vector representation of LCSS problem
NB. origin is bottom-right.
. . . bit-vectors
 
A row is computed from the previous row using operations on |s2|-bit   l o n g - i n t e g e r s.
 
A `1' marks an increase in |LCS| along the row, from the right.
 
Potential to use vector operations on a suitable computer architecture (inc. graphics cards).
 
Relies on {0, 1}-costs, LCS above, and can be applied to edit-distance.
 
Time-complexity remains O(|s1|.|s2|) but
 
the speed-up factor ~ word-length, e.g., 32× or 64×.
 
2 sequences, edit-distance, fast [Ukk82]
. . . 2 seqs, fast
 
Note, contours in D[,] do not cross.
 
U[,] gives positions of contours in D[,].
 
The two representations, U[,] and D[,] are equivalent;
 
can use one or the other, so do not need D[,].
 
Relies on small integer costs, here {0, 1}.
 
The occupied area of U[,] is ~ d2, small if s1 close to s2.
. . . 2 seqs, fast
 
base case:
U[0, 0] = 0
 
general case:
U[dg, c] = max(
U[dg+1, c-1],
U[dg, c-1] + 1,
U[dg-1, c-1] + 1 ) ;
while S1[ U[dg, c] ] = S2[ U[dg, c] - dg ] do
U[ dg, c ] ++
end_while
 
Iterate over diagonal `dg' within iterating over cost `c'.
NB. |dg| ≤ c.
Watch out for boundary conditions.
Edit distance = min c such that U[|S1|-|S2|, c] = |S1|.
. . . 2 seqs, fast
 
O( |s|×d )-time worst-case, but
O( |s| + d2 )-time on average!
 
Fast if the sequences similar, d<<|s|.
 
O(d)-space if only need d,
O(d2) if need the alignment, but...
 
can also use [PAD99]'s checkpointing technique to reduce the space to O(d) and still get the alignment.
 
Substring problem with wild-card, fast [FP74]
 
recall "school" long-multiplication, e.g.,
    1 2 3
x   4 5 6
---------
4 9 2
  6 1 5
+   7 3 8
---------
5 6 0 8 8
---------
 
O(n2)-time to × n-digit numbers.
In fact can do integer × in O(n . log(n) . log(log(n)))-time [SS71].
 
Similar ideas, but replacing (×,+) with (=,+), solve substring problem in O(log|alphabet|.n.log(n)2.(log(log(n)))-time [FP74].
 
Multiple Alignment of k sequences,  k≥3.
 
All pairs:
Alignment {cost | score} is summed over all 2-sequence projections.
 
Star based:
Alignment {cost | score} is summed from the central hypothetical "ancestor" to each given sequence.
 
Tree based:
Alignment {cost | score} is summed over edges of {given | inferred} tree. Given sequences are leaves of tree. Internal nodes ~ hypothetical ancestors. (Probabilities are summed over all possible values at internal nodes.)
. . .
 
O( |s|k . states2 )-time in general.

NB. k=3 is a useful special case -- each internal node in a phylogenetic-tree has 3 neighbours.

. . . multiple, complexity of,
 
|s|k = 1,000,000,000, say,
 
k

2
3
4
6
   |s|

32,000
1,000
180
32
 
The brute-force algorithm is impractical unless k is small, or |s| is small, or both.
 
2-way alignments may somewhat limit the k-dimensional space to be searched[AL89] but very quickly heuristics and/or stochastic methods are required.
. . . multiple, tree-based multiple alignment is based on evolution,
 
internal sequences are unknown, hypothetical, but
pr(symbols) can be estimated, e.g., fig.4 [AW94]
 A                               A 
  .                             .
   .                           .
    . A:0.43           A:0.40 .
     .C:0.56           C:0.59.
      +---------------------+
     .                       .
    .                         .
   .                           .
  .                             .
 C                               C
for certain
[edge parameters]
Probable hypothetical characters. Tuple ACAC's probability = 0.00027   (cost= - log...)
(Semantics of "gaps" require careful thought.)
. . . multiple, tree based,
 
a phylogenetic tree and a multiple alignment are a chicken and an egg:
 
Can search for a tree given a multiple alignment (hard), or
 
can search for a multiple alignment given a tree (hard), or
 
in principle, can search for both together (v.hard).
 
3 sequences, linear costs (ints), fast [PAD00].
 
Tree and star alignment are equivalent for k=3 seqs..
Linear gap costs == 3-state machines.
3 machines,   mi:ancestor->si.
Their composition has 27 = 33 states
(aka "algorithm of doom".-).
 
Time and space complexities would be O(|s1|.|s2|.|s3|) naively but using small integer costs and contours
time becomes O(|s|.d2) in the worst-case and only O(|s|+d3) on average and,
with checkpointing, space is only O(d2).
 
Can use, iterated, for tree-based multiple alignment of k>3 seqs..
 
Non-random (compressible) sequences [PAD04].
 
Non-random <=> compressible (<2 bits DNA, <4.2 bits protein), e.g., Plasmodium falciparum ~ 1.6 bits/base,
 
causes false +ve matches.
 
Standard fix:
  1. align[*], get cost or score,
  2. repeat: shuffle, align, get cost or score,
  3. if #1 ~ #2, match due to bias only, reject.
[*]based on a false assumption!
Better solution [PAD04]:
Build a population-model into the DPA.
Changes rank-order of alignments[+], as it should.
Gives a natural hypothesis test.
Can do this for almost any population-model, and for local- & global-alignment, 1-, 3- 5-state mutation models, and optimal & sum-of-probabilities.
[+]false assumption removed.
. . . non-random seqs, the standard fix, e.g., PRSS (FASTA package) from NIH:
 
  1. align sequences, get cost or score,
  2. repeat: shuffle, align, get cost or score,
  3. if #1 ~ #2, match due to non-randomness only, reject.
 
Step #1 gets an "optimal" alignment under the false(!) assumption that the sequences are uniform random.
 
Step #2, shuffling preserves 0-order stats. of sequences
(can modify to preserve 1-order or even codon stats., but hard to see how to preserve stats. of an arbitrary population).
 
Step #3 may reject a bad alignment but was there a better one assuming non-randomness? (exactly why we were concerned)!
. . .
. . . non-random seqs, actually ∃ another standard fix:
 
mask-out (delete) low-information subsequences, which
 
assumes they contain zero information (not true), and
 
may leave nothing at all!
 
 
 
. . .
. . . non-random seqs, Modelling-alignment (M-alignment) [PAD04] the better solution:
 
Note that not all symbols are equal, and
not all instances of a symbol are equal in all contexts.
 
A compression algorithm gives a code-length for a symbol, e.g.,
C or G worth 3 × A or T in P.falciparum.
 
So modify the DPA to account for the code-lengths of symbols in context.
 
Results in
  1. rank-order of ||s may change,
  2. built-in hypothesis test,
  3. fewer false +ves and fewer false  -ves.
. . .
. . . M-alignment, an [example]
 
draw two unrelated seqs from
 
    A    C    G    T
 .-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
 |
C| 9/20 1/20 1/20 9/10
 |
G| 9/20 1/20 1/20 9/10
 |
T| 9/12 1/12 1/12 1/12
 
-- AT-rich, and tending to alternate A/T
 
. . .
. . . M-alignment, an example
 
assuming "uniform random", they appear related
> GCTATAGTAATGCTATAATGATATA-TTATATATCTATA-TATATATTAT
> || || || ||  ||| || ||||| |||| |||   || ||||||| ||
> GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTATTATATATCAT

> ATA-TACTAATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT-
> | | || | ||||  | ||| |||||| | || |||| | |||| ||||
> AGACTA-TCATATATTTATA-ATATATCACATATATA-TGATATACTATG

> ATA-C-AT
> ||| | ||
> ATATCTAT
null 400 bits;   opt.align 390 bits
 
but with an order-1 population model:
> GCTATAGTAATGCTATAATGATATA-TTATATATCTATATATATATTAT-
> $$ || $| ||  ||| || ||||| |$|| |||   ||| |||| $||
> GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTAT-TATA-TATC

> ATATACTA--ATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT
> ||| |$||  $|||  | ||| $||||| | || |||| | |||| $|||
> ATAGACTATCATATATTTATA-ATATATCACATATATA-TGATATACTAT

> -ATA-C-AT
>  ||| $ ||
> GATATCTAT
null: 283.1* bits;   opt.align 339.2 bits
are correctly seen to be unrelated*.
. . . M-alignment,
 
consider 
pr(insert) &
pr(s2[j] | s2[1..j-1]);
pr(delete) &
pr(s1[i]|s1[1..i-1]);
pr(match) &
pr(s1[i]|s1[1..i-1]) & pr(s1[i]|s2[1..j-1]);
pr(change) &
pr(s1[i]|s1[1..i-1]) & pr(s2[j]|s2[1..j-1]), s1[i]≠s2[j].
. . .
. . . M-alignment
 
weight {indel, change, match} as you prefer, but for the symbol(s) involved . . .
 
for insert & delete, one symbol involved,
weight symbol according to code-length given its preceding seq.,
 
for change, two different symbols,
weight each symbol according to code-length given its preceding seq.,
 
for match, two instances of one symbol,
blend weights for symbol considering | s1[1..i-1] and | s2[1..j-1].
. . .
. . . M-alignment
 
the above decouples {copy,change,ins,del} from pr(s1[i] | s1[1..i-1]) and pr(s2[i] | s2[1..i-1])
 
-- might consider it a simplifying approximation but
 
-- reduces algorithmic complexity (want to keep |states| small).
 
Time complexity is max of compression alg.'s, and the DPA's. Usually the latter dominates. . . .
 
. . .
. . . M-alignment
 
Above idea was implemented [PAD04] for
  1. global and local matches,
  2. optimal-alignment and sum-of-all-probabilities,
  3. linear gap costs (3-state mutation model),
  4. various population models, (pr(s[i]|s[1..i-1]), including low order Markov models and mixtures of models.
 
It was compared to PRSS (FASTA package) on
  1. artificial data, and
  2. real data from [P.falciparum].
 
Modelling-alignment method performed better as shown by ROC curves . . .
(Printer warning: May lose graph lines if page is shrunk or ¬colour.)
. . . M-alignment (non-random seqs,  although 1st test is random seqs.)
ROC CURVE for Markov based alignment edit distance model algorithm v. Smith Waterman DPA, PRSS, significance
Easiest problem, new method ~ shuffling (bottom-right is "good")
green: PRSS p-value (blue: S-W raw score);
red: (summed) M-align (Markov=1).
Uniform random pop'n (2 bits/base):
All methods should do well, & they do.
. . . M-alignment (non-random seqs)
ROC CURVE compressible 0-order DNA sequences Markov DPA alignment compared to Smith Waterman, PRSS, significance
Easy problem, new method ~ shuffling (bottom-right is "good")

green: PRSS p-value (blue: S-W raw score);
red: (summed) M-align (Markov=1).

0-order data, biased composition:
PRSS good, M-align best.
. . . M-alignment (non-random seqs)
ROC CURVE, mixed pop'n, generalized pair hidden machine, Hidden Markov Model, GPHMM HMM PFSA PFSM
Harder problem, M-alignment method (red) is best (bottom-right is "good")
green: PRSS p-value (blue: S-W raw score);
red: (summed) M-align (Markov=1).
Mixed pop'n, high entropy 0-order & low entropy 0-order seq's.
. . . M-alignment (non-random seqs)
ROC CURVE, population of mixed sequences, generalised pair hidden Markov model alignment machine GPHMM HMM HFSA HFSM PFSA PFSM
Hard problem, M-alignment method (red & purple) is best (bottom-right is "good")
green: PRSS p-value; (blue: S-W raw score);
red: (summed) M-align (Markov=1);
purple: M-align (blended seq' model).
Pop'n of mixed seq's: high (2-bit/base) &
low entropy 1st-order regions.
 

Thanks to Trevor Dix, David Powell, Chris Wallace, and Chut N. Yee.

References
 
[All93] L. Allison, Normalisation of affine gap costs used in optimal sequence alignment.
J. Theor. Biol., 161(2), pp.263-269, March 1993, doi:10.1006/jtbi.1993.1054, [more].
Recover the equivalent probabilities of mutations, {change, insert, delete}, from ad hoc costs or scores.
[AD86] L. Allison, T. I. Dix, A bit-string longest-common-subsequence algorithm.
Inf. Proc. Lett., 23, pp.305-310, 1986, doi:10.1016/0020-0190(86)90091-8, [html] & [C-code].
Uses bit-vector techniques to get a speed-up proportional to the computer word-length, e.g., 32× or 64×.
[APD99] L. Allison, D. Powell, T. I. Dix, Compression and approximate matching,
Computer Journal, 42(1), pp.1-10, 1999, doi:10.1093/comjnl/42.1.1 [more]
Introduced idea of modelling alignment (here of a 1-state edit machine); also see [PAD04] which generalized the idea.
[AW94] L. Allison, C. S. Wallace, The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimisation of multiple alignments.
J. Molec. Evol., 39(4), pp.418-430, 1994, doi:10.1007/BF00160274, [code].
Samples alignments from their posterior probability distribution. Firstly applied to the estimation of the edges of a given evolutionary tree over several sequences. Secondly, used in conjunction with simulated annealing, it gives a stochastic search method for an optimal multiple alignment.
[AWY92] L. Allison, C. S. Wallace, C. N. Yee, Finite-state models in the alignment of macro-molecules.
J. Mol. Evol., 35(1), pp.77-89, 1992, doi:10.1007/BF00160262, [.ps].
Sum probabilities of all alignments for 1-, 3-, and 5-state FSAs (HMMs); alignment density plots; parameter costs by MML prevent overfitting.
[AL89] S. F. Altschul, D. J. Lipman, Trees, stars and multiple biological sequence alignment.
SIAM J. Appl. Math., 49(1), pp.197-209, Feb. 1989.
Uses pairwise alignments to place an upper bound on the projections of the optimal multi-alignment onto each pair thus restricting the volume in the k-dimensional lattice that need be searched for that optimal k-way alignment.
 
more . . .
 
. . . references
 
[Bel57] R. E. Bellman, Dynamic Programming.
Princeton University Press, 1957.
D.P. in general, not for bioinformatics, e.g., the DP paradigm has been used for shortest-paths and minimum spanning tree problems in graphs, segmentation of time-series, polygon fitting, and optimal text layout.
[BT86] M. J. Bishop, E. A. Thompson, Maximum likelihood alignment of DNA sequences.
J. Mol. Biol., 190, pp.159-165, 1986.
Probability based, 1-state.
[Dek83] J. Deken, Probabilistic behaviour of longest-common-subsequence length.
In Time Warps, String Edits and Macromolecules, Addison-Wesley, pp.359-362, 1983.
Analysed effect of alphabet size on chance symbol matches between two sequences.
[Epp92] D. Eppstein, Z. Galil, R. Giancarlo, G. F. Italiano. Sparse dynamic programming I: linear cost functions.
Jrnl A.C.M., 39(3), pp.519-545, July 1992, doi:10.1145/146637.146650.
Includes alignment in O(|s1|+|s2|+r.log(log min(r,|s1|.|s2|/r)))-time, where r≤|s1|.|s2| is # of matching fragments found (and log always +ve). Fast if r is small. Also see [HS77].
Also part II: convex and concave cost functions, pp.546-567.
[FP74] M. J. Fischer, M. S. Paterson, String matching and other products.
MIT, technical report MAC TM 41, 1974, pdf@mit.
Sub-string matching with a don't care (wildcard) symbol, O(log(|alphabet|).n.log(n)2.log(log(n)))-time, i.e. nearly linear. Related to fast integer multiplication.
[Got82] O. Gotoh, An improved algorithm for matching biological sequences.
J. Mol. Biol., 162, pp.705-708, 1982.
Linear "gap costs".
[Hir75] D. S. Hirschberg, A linear space algorithm for computing maximal common subsequences.
Comm. ACM, 18(6), pp.341-343, 1975, doi:10.1145/360825.360861
Reduces space to linear in length of one sequence; framed for LCS but can be applied to edit distance.
[Hir77] D. S. Hirschberg, Algorithms for the longest common subsequence problem.
Jrnl A.C.M., 24(4), pp.664-675, 1977, doi:10.1145/322033.322044.
Fast algorithms for certain situations: (i) O(|LCS|.|s|+|s|.log|s|)-time, fast of |LCS|<<|s|. (ii) O(|LCS|.(|s1|+1-|LCS|).log(|s2|))-time.
 
more . . .
 
. . . references
 
[HS77] J. Hunt, T. Szymanski, A fast algorithm for computing longest common subsequences.
Comm. A.C.M., 20(5), pp.350-353, 1977, doi:10.1145/359581.359603.
O((r+|s|)log|s|)-time where r = # of matching pairs from s1 & s2, fast if r<<|si|, often so for alarge alphabet. Also see [Epp92].
[Lev66] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals.
Soviet Physics Doklady. 10(8), pp.707-710, Feb. 1966.
The original edit-distance.
[MP83] W. J. Masek, M. S. Paterson, How to compute string-edit distances quickly.
In Time Warps, String Edits and Macromolecules,
Addison Wesley, pp.337-349, 1983.
For a finite alphabet, O(n*n/log(n))-time alg., beats O(n2) if n>200,000. Of theoretical interest only.
[NW91] R. Nussinow, H. J. Wolfson, Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques.
Proc. Natl. Acad. Sci. USA, 88, pp.10495-10499, Dec. 1991.
Related to geometric hashing: Any set of 3 residues defines a coordinate system (CS). For each candidate seq., for each such CS, hash (bucket) every residue. Count the hash "votes."
[PAD99] D. R. Powell, L. Allison, T. I. Dix, A versatile divide and conquer technique for optimal string alignment.
Inf. Proc. Lett., 70(3), pp.127-139, 1999, doi:10.1016/S0020-0190(99)00053-8, [more].
Reduces space complexity to linear in string length for 2 sequences, quadratic for 3 sequences, etc., easy to use with complex indel/gap-costs. Can also be used with fast [Ukk82]-style algorithms.
[PAD00] D. R. Powell, L. Allison, T. I. Dix, Fast, optimal alignment of three sequences using linear gap costs.
J. Theor. Biol., 207(3), pp.325-336, Dec. 2000, doi:10.1006/jtbi.2000.2177, [code].
O(n.d2)-time worst case, O(n+d3) on average. Can use the algorithm iterated -- for tree-based multiple-alignment of more than 3 sequences.
 
more . . .
 
. . . references
 
[PAD04] D. R. Powell, L. Allison, T. I. Dix, Modelling alignment for non-random sequences.
Springer, LNCS 3339, pp.203-214, 2004, [pdf@springer], [code].
Aligns non-random sequences, builds seq. model into DPA, global and local-alignment, optimal and sum of probabilities, inc. linear gap costs. (Also see [APD99].)
[TO89] W. R. Taylor, C. A. Orengo, Protein structure alignment.
J. Mol. Bio., 208(1), pp.1-22, July 1989, doi:10.1016/0022-2836(89)90084-3.
Compares protein structures based on distance plots.
[Ukk83] E. Ukkonen, On approximate string matching.
Proc. Int. Conf. on Foundations of Computation Theory, LNCS 158, pp.487-495, Aug. 1983, doi:10.1007/3-540-12689-9_129.
Edit dist., fast, O(d×n)-time worst case, O(n+d2) on average.
[WL83] W. J. Wilbur, D. J. Lipman, Rapid similarity searches of nucleic acid and protein banks.
Proc. Natl, Acad. Sci. USA, 80, pp.726-730, Feb. 1983, @pnas
Makes k-tuples whuch increases the effective alphabet size by a power of k. This gives a small number of cross-sequence matches, so a faster algorithm (~Hirschberg?) can be used. Tupling, aka "word", "k-mers" etc., later used in BLAST and other such programs.
[YA93] C. N. Yee, L. Allison, Reconstruction of strings past.
J. Bioinformatics (was Comp. Appl. BioSci, CABIOS), 9(1), pp.1-7, Feb. 1993, [more].
Use of single optimal alignment gives biased estimates of the evolutionary "distance" between two strings but the r-theory, average over all alignments, recovers accurate estimates over a very wide range of similarity.

© Lloyd Allison Faculty of Information Technology (Clayton), Monash University, Australia 3800.
Created with "vi (Linux & Solaris)",   charset=iso-8859-1