Modelling v. Shuffling

David R. Powell, Lloyd Allison & Trevor I. Dix, Computer Science and Software Engineering, Monash University, Australia 3800.[*]

A seminar to the Bioinformatics Group & Department of Computer Science, University of Wales Aberystwyth, B20, 4pm, Wednesday, 11 August 2004.

We Define:
Global and local, optimal and summed, affine (linear) gap-costs (i.e. 3-state), modelling-alignment (M-alignment).
Compare it to:
Standard Smith-Waterman local-alignment with
significance-test based on shuffling and re-alignment.
Results:
Both solve easy problems equally well.
M-alignment performs better on compressible populations, mixed populations, and populations of mixed sequences.
This file is at www.csse.monash.edu.au/~lloyd/Seminars/200408-Align/index.shtml and includes links to extra material.

Some Different Questions on Sequences S1 & S2

1. Globally Related?
I.e. Is (all of) S1 related to (all of) S2?
Sum probabilities of all alignments   (an alignment ~ a sed script).
 
2. Global optimal-alignment:
I.e. How is (all of) S1 best related to (all of) S2?
 
3. Locally Related?
I.e. Is part of S1 related to part of S2?
Sum prob's of all possible ways this can be so.
 
4. Local optimal-alignment:
I.e. How is part of S1 best related to part of S2?

Compression

S1 and S2 are related if S1 tells us something new & useful about S2, and v.v., i.e. if
 
Pr(S1&S2 | unrelated) = Pr(S1) . Pr(S2) < Pr(S1&S2 | related) = Pr(S1) . Pr(S2 | S1&related) = Pr(S2) . Pr(S1 | S2&related)
 
---> logs;  think compression;  use [MML].
 
`Compressible', `non-random', & `low to medium information content' all mean the same thing.
local modeling-alignment pair hidden Markov Model PHMM HMM
Local v. global.
3-state pair hidden Markov Model PHMM HMM or FSA FSMPFSA PFSM for linear/ affine gap costs edit distance
3-State model (machine): Affine (linear) gap-costs, summed- or optimal-alignment

e.g. LCS ~ max;   Edit D' ~ min;   summed ~ Σ probs.


Modelling-, M-Alignment

Characters (bases, amino acids) are not all equal in all contexts, hence model in context (red):

m[0,0] = z
 
m[i,0] = f(m[i-1,0],
 c(a[i],`_')),  i=1..|a|
 
m[0,j] = f(m[0,j-1],
 c(`_',b[j])),  j=1..|b|
 
m[i,j] = g(f(m[i-1,j-1], c(a[i],b[j])),
 f(m[i-1,j ], c(a[i],`_')),
 f(m[i, j-1], c(`_',b[j]))),
 i=1..|a|, j=1..|b|
g = max
f = ×
c(x, y) = Pr(x, y)
z = 1
r
e
a
d
 
o
f
f
l
i
n
e
At m[i,j]:
c(x,x) = Pr(match) ×  Pr(x | a[1..i-1], b[1..j-1])
c(x,y) = Pr(mismatch) ×  Pr(x,y | x<>y, a[1..i-1], b[1..j-1])
c(x,`_') = Pr(delete) ×  Pr(x | a[1..i-1])
c(`_',y) = Pr(insert) ×  Pr(y | b[1..j-1])
generic (global, 1-state) DPA (global, optimal) M-alignment

Generalizes to k-states, local & global, optimal & summed.


Receiver Operator Characteristics (ROC)

ROC curves often used in this area.


Want high coverage (few false -ves) and few errors (few false +ves), i.e. near bottom-right of graph is "good".


Data Generation

  (Printer warning: May lose graph lines if page is shrunk or ¬colour.)
ROC CURVE for Markov based alignment edit distance modeling-alignment algorithm v. Smith Waterman DPA, PRSS, significance
Easiest problem, new method ~ shuffling.
green: PRSS p-value (blue: S-W raw score);
red: (summed) M-align (Markov=1).
Uniform random pop'n (2 bits/base):
All methods should do well, & they do.
(Printer warning: May lose graph lines if page is shrunk or ¬colour.)
ROC CURVE compressible 0-order DNA sequences Markov DPA modeling-alignment compared to Smith Waterman, PRSS, significance
Easy problem, new method ~ shuffling

green: PRSS p-value (blue: S-W raw score);
red: (summed) M-align (Markov=1).

0-order data, biased composition:
PRSS good, M-align best.
(Printer warning: May lose graph lines if page is shrunk or ¬colour.)
ROC CURVE, mixed pop'n, generalized pair hidden machine, Hidden Markov Model, GPHMM HMM PFSA PFSM
Harder problem, new M-alignment method (red) best
green: PRSS p-value (blue: S-W raw score);
red: (summed) M-align (Markov=1).
Mixed pop'n, high entropy 0-order seq's and low entropy 0-order seq's
(Printer warning: May lose graph lines if page is shrunk or ¬colour.)
ROC CURVE, population of mixed sequences, generalised pair hidden Markov modeling-alignment machine GPHMM HMM HFSA HFSM PFSA PFSM
Hard problem, new M-alignment method (red & purple) best
green: PRSS p-value; (blue: S-W raw score);
red: (summed) M-align (Markov=1);
purple: M-align (blended seq' model).
Pop'n of mixed seq's of high (2-bit/base) & low entropy 1st-order regions

Conclusion

Reading:

[*] Like many things out of Computer Science, Monash, this work owes a debt to Chris Wallace (1933 - 7/8/2004).


© D. Powell, L. Allison, T. I. Dix, School of Computer Science and Software Engineering, Monash University, Australia 3800.