Alignment of Low to Medium Information Content Sequences

home1 home2
 Bib
 Algorithms
 Bioinfo
 FP
 Logic
 MML
 Prog.Lang
and the
 Book

Bioinformatics
 Compression
  +alignment

Also see
 #Alignment
 software
    A    C    G    T
 .-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
 |
C| 9/20 1/20 1/20 9/10
 |
G| 9/20 1/20 1/20 9/10
 |
T| 9/12 1/12 1/12 1/12

MMg: an AT-rich, order-1 Markov model.

The two sequences initially in the FORM below are generated from the AT-rich model MMg (right). They are completely unrelated, apart from both being generated from MMg, and thus sharing some general statistical biases.

If the sequences are aligned under the usual uniform model (try `Run'), i.e. under the assumption that all characters are equally likely then you will conclude that they are related by an acceptable optimal alignment, but the high number of matches is only due to their both coming from MMg. If they are aligned under the correct model (here, MMg) or under a model that can be fitted to their biases (e.g. an order-1 Markov model), then it is revealed that they are not related by an acceptable alignment.

You can replace the sequences with others of similar length and have them aligned. The most interesting behaviour comes with sequences that are distantly related and of medium or low information content. In such cases, aligning with knowledge of the true model (of the sequences) or at least with knowledge of the model class, gives more reliable results.

Demonstration program requires 2 seqs. separated by at least one blank line:
S1:
 
 
 
S2:

NB. no leading blank lines!
[precomputed]

NB. It is assumed that the lengths of the sequences are "common knowledge", otherwise the lengths should also be included in the encodings, using your favourite prior for a pair of lengths.
The formula used to estimate the cost of a Hypothesis (H) and its parameters assumes that any sequence is much longer than the number of parameters (12 for an order-1 MM over DNA).
This demonstration is based on the simple point-mutation model of sequence mutation; it is inappropriate to use it if the sequence lengths differ greatly, for example. But note that a similar treatment can be given to linear (affine) gap-costs, piecewise linear gap costs, global alignment, local alignment, optimal aligment, summed alignment, etc. [Powell et al 2004].

Limits

The web-demonstration above is limited to sequences of up to a couple of hundred characters, this is only because it is run on our web-server.

Models

The new alignment method can in general incorporate almost any model of a "sequence" -- see Allison et al (1998, 1999), Powell et al (2004) for details. It is not limited to (hidden) Markov models (HMM).

References

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Tuesday, 19-Mar-2024 18:54:01 AEDT.