e.g., unrelated

LA home
Bioinfomatics
 compression
  +alignment
   e.g.
    A    C    G    T
.-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
 |
C| 9/20 1/20 1/20 9/10
 |
G| 9/20 1/20 1/20 9/10
 |
T| 9/12 1/12 1/12 1/12

MMg: an AT-rich, order-1 Markov model.

S1 and S2 are two unrelated sequences drawn from a population modelled by an order-1 Markov model (right). (The model is just an example but it is not implausible -- the genome of Plasmodium falciparum in 80% AT, and AT-rich regions appear in other genomes.)

Assuming a uniform random population, they appear to be related (390:400 bits), but assuming an order-0, or (better) an order-1 population model (whose parameters are learned from the data), they are correctly seen to be unrelated (283:339 bits).

> Align Compressible Sequences:

> S1:
     1 GCTATAGTAA TGCTATAATG ATATATTATA TATCTATATA TATATTATAT
    51 ATACTAATAT GATAATATAT ATATATATCT ATAGTCATAT CTATATACAT  100

> S2:
     1 GCATGTATAT TATATATATA CTTATGTATG ATTATTATAT ATCATAGACT
    51 ATCATATATT TATAATATAT CACATATATA TGATATACTA TGATATCTAT  100

> Models: 2 x Uniform:
> msgLen  null = 400.0 bits = 200.0{S1} + 200.0{S2} = 2.0000 b/ch
> msgLen S1~S2 = 390.0 bits = 9.7+0.0+0.0{H} + 380.3{S1~S2|H} = 1.9498 b/ch
> GCTATAGTAATGCTATAATGATATA-TTATATATCTATA-TATATATTAT
> || || || ||  ||| || ||||| |||| |||   || ||||||| ||
> GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTATTATATATCAT

> ATA-TACTAATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT-
> | | || | ||||  | ||| |||||| | || |||| | |||| |||| 
> AGACTA-TCATATATTTATA-ATATATCACATATATA-TGATATACTATG

> ATA-C-AT
> ||| | ||
> ATATCTAT

> [Frequencies =:77.0 ~:15.0 i:8.0 d:8.0 tot:108.0]
> model implies  ALIGNMENT:unrelated = 2^10.0 : 1  +/- a pinch of salt
> ---

> Models: 2 x Order-0 Markov:
> msgLen  null = 339.4 bits = 167.4{S1} + 172.0{S2} = 1.6969 b/ch
> msgLen S1~S2 = 370.9 bits = 9.7+9.2+9.0{H} + 343.0{S1~S2|H} = 1.8545 b/ch
> GCTATAGTAATGCTATAATGATATA-TTATATATCTATA-TATATATTAT
> $$ || $| ||  ||| || ||||| |||| |||   || ||||||| ||
> GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTATTATATATCAT

> ATA-TA-CTAATA-TGATAATATATATATATATCTATAGTCATATCTAT-
> | | || $  ||| | |||||||||  | |||| ||| $  ||| $||| 
> AGACTATCATATATTTATAATATAT-CACATATATAT-GATATA-CTATG

> ATA-C-AT
> ||| $ ||
> ATATCTAT

> [Frequencies =:76.0 ~:16.0 i:8.0 d:8.0 tot:108.0]
> model implies  alignment:UNRELATED = 1 : 2^31.5  +/- a pinch of salt
> ---

> Models: 2 x Order-1 Markov:
> msgLen  null = 283.1 bits = 135.6{S1} + 147.4{S2} = 1.4153 b/ch
> msgLen S1~S2 = 339.2 bits = 9.7+26.4+26.1{H} + 277.0{S1~S2|H} = 1.6959 b/ch
> GCTATAGTAATGCTATAATGATATA-TTATATATCTATATATATATTAT-
> $$ || $| ||  ||| || ||||| |$|| |||   ||| |||| $|| 
> GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTAT-TATA-TATC

> ATATACTA--ATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT
> ||| |$||  $|||  | ||| $||||| | || |||| | |||| $|||
> ATAGACTATCATATATTTATA-ATATATCACATATATA-TGATATACTAT

> -ATA-C-AT
>  ||| $ ||
> GATATCTAT

> [Frequencies =:78.0 ~:13.0 i:9.0 d:9.0 tot:109.0]
> model implies  alignment:UNRELATED = 1 : 2^56.1  +/- a pinch of salt
> --- end ---
[more]
window on the wide world:

Linux
free op. sys.
OpenOffice
free office suite,
ver 3.1+

The GIMP
~ free photoshop
Firefox
web browser
FlashBlock
like it says!

Computer Science Education Week, USA, week of 7 Dec. 2009

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Tuesday, 24-Nov-2009 21:03:35 EST.