Alignment of Low to Medium Information Content Sequences

LA home
Computing
 Algorithms
 Bioinformatics
 FP,  λ
 Logic,  π
 MML
 Prog.Langs

Bioinformatics
 Compression
  +alignment

Also see
 #Alignment
 software
    A    C    G    T
 .-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
 |
C| 9/20 1/20 1/20 9/10
 |
G| 9/20 1/20 1/20 9/10
 |
T| 9/12 1/12 1/12 1/12

MMg: an AT-rich, order-1 Markov model.

The two sequences initially in the FORM below are generated from the AT-rich model MMg (right). They are completely unrelated, apart from both being generated from MMg, and thus sharing some general statistical biases.

If the sequences are aligned under the usual uniform model (try `Run'), i.e. under the assumption that all characters are equally likely then you will conclude that they are related by an acceptable optimal alignment, but the high number of matches is only due to their both coming from MMg. If they are aligned under the correct model (here, MMg) or under a model that can be fitted to their biases (e.g. an order-1 Markov model), then it is revealed that they are not related by an acceptable alignment.

You can replace the sequences with others of similar length and have them aligned. The most interesting behaviour comes with sequences that are distantly related and of medium or low information content. In such cases, aligning with knowledge of the true model (of the sequences) or at least with knowledge of the model class, gives more reliable results.

Demonstration program requires 2 seqs. separated by at least one blank line:
S1:
 
 
 
S2:

NB. no leading blank lines!
[precomputed]

NB. It is assumed that the lengths of the sequences are "common knowledge", otherwise the lengths should also be included in the encodings, using your favourite prior for a pair of lengths.
The formula used to estimate the cost of a Hypothesis (H) and its parameters assumes that any sequence is much longer than the number of parameters (12 for an order-1 MM over DNA).
This demonstration is based on the simple point-mutation model of sequence mutation; it is inappropriate to use it if the sequence lengths differ greatly, for example. But note that a similar treatment can be given to linear (affine) gap-costs, piecewise linear gap costs, global alignment, local alignment, optimal aligment, summed alignment, etc. [Powell et al 2004].

Limits

The web-demonstration above is limited to sequences of up to a couple of hundred characters, this is only because it is run on our web-server.

Models

The new alignment method can in general incorporate almost any model of a "sequence" -- see Allison et al (1998, 1999), Powell et al (2004) for details. It is not limited to (hidden) Markov models (HMM).

References

window on the wide world:

Computer Science Education Week

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite,
ver 3.4+

The GIMP
~ free photoshop
Firefox
web browser
FlashBlock
like it says!

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Wednesday, 01-Oct-2014 05:56:33 EST.