# DNA Sequence Compression Example

```> Compress: L.Allison, Computer Science, Monash University 4/1998
1 TGATAGGTGA TAGATAGATT GATAGATGAT AGAAGATTGA TAGATGATAG
51 ATACATAGGT GATAGTAGAT GTAAGATGAT AGATGATAGA TAGATAGATG
101 ATAGACAGAT TGATAGATGA TAGAGAGA  128

> order-0 Markov Model
>                          .                         .           |   4.0 +
>                                                                |   3.5 b
>                                                                |   3.0 b
>                                                                |   2.5 b
>+..+.....+...+..-.+...+...-..+..+..+-.+..........+..-..+........|-  2.0 b
> .. ..... ... ..+. ... ...... .. .. +. .......... ..... ........|   1.5 b
>                                                                |   1.0 b
>                                                                |   0.5 b
>                                                                |   0.0 b
> compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =10.0 bits
> data:      (D|H) =211.0 bits, =1.6487 b/ch
> total: (H)+(D|H) =221.0 bits, =1.7269 b/ch
> ran 00/01/21  from 15:32:55  to 15:32:55

> order-1 Markov Model
>   .            .         .  .      .               .           |   4.0 +
>         .        .                                    .        |   3.5 b
>   .                         .  .  .                            |   3.0 b
>                                                                |   2.5 b
>----------------------------------------------------------------|-  2.0 b
>. . . . . . .. . . . .. .   . . .. . .. . . . . .. . . . . . ...|   1.5 b
>...  + + . + ...  . + .......  + .. . .... + + + ...  . ... +   |   1.0 b
> .  . . . . . . .. . . . . .  .   .  . . .. . . . . ... . .. ...|   0.5 b
>                                                                |   0.0 b
> compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =30.0 bits
> data:      (D|H) =152.1 bits, =1.1886 b/ch
> total: (H)+(D|H) =182.2 bits, =1.4233 b/ch
> ran 00/01/21  from 15:32:55  to 15:32:55

> AED fwd approx repeats
> [Frequencies B:58.6 R:3.2 C:68.2 E:3.2 =:66.4 ~:2.0 i:1.0 d:2.1 tot:204.8]
> [Frequencies B:41.0 R:4.2 C:86.0 E:4.2 =:83.6 ~:2.0 i:1.4 d:3.3 tot:225.8]
> [Frequencies B:37.3 R:4.8 C:89.3 E:4.8 =:87.4 ~:1.7 i:1.5 d:3.6 tot:230.6]
>         .      .         .     .                   .           |   4.0 +
>   +                     .   .      .                         . |   3.5 b
>                                   +          .                 |   3.0 b
>             .                                         .        |   2.5 b
>------.----------------------------------.-------.-------------.|-  2.0 b
>. . .     .                  .      ..    .    .     .          |   1.5 b
>...    +.  .   . ..       ..+         .      .  +               |   1.0 b
> .  .+. ....+.+....   .    .  .  +   ..++ .+.... .++..+..+..  ..|   0.5 b
>                   +++.++.    .+. +      .  .           . ..++  |   0.0 b
> compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =49.5 bits
> data:      (D|H) =128.5 bits, =1.0040 b/ch
> total: (H)+(D|H) =178.0 bits, =1.3906 b/ch
> ran 00/01/21  from 15:32:55  to 15:32:58
> --- end ---

Note that the approximate repeats (AED) model, the most complex one, gives the best compression of this sequence, even when the cost of stating the model's parameters are included.