DNA Compression and DNA Complexity

LA home
Computing
 Algorithms
 Bioinformatics
 FP,  λ
 Logic,  π
 MML
 Prog.Langs

Bioinformatics
 glossary
 Compression
  +alignment

also see
compression(2)

Other software:
 approx.repeats
 other software

References

Demonstration

Below is (i) a repetitive or low-information-content DNA sequence and (ii) a random or high information content sequence.

When analysed the complexity (aka information content, entropy) of a sequence is computed under

  1. a zero-order Markov model,
  2. a first-order Markov model, and
  3. the [AED] model (see refs)

The parameters of a particular model (zero-order MM etc.) are fitted to the given sequence and the calculations include an estimate of the cost, in bits, required to state the parameters of the hypothetical model (H) to optimum accuracy. The cost of encoding the data given the model (D|H) is also given. The sum, H+(D|H), is the total cost of encoding the sequence under a particular model. The model that gives the lowest total cost has the greatest posterior probability. Costing H prevents over-fitting: although a more complex model may fit the data better (D|H), it has to "pay" for its own great complexity, H. There is a natural null-hypothesis which is that the sequence simply consists of random characters and in this case (H)=0 and (D|H)=2 bits/base, for DNA.

Limitations

The demonstration program is a cut-down version. It implements forward approximate repeats only, and is limited to sequences of a few hundred characters because it is run on our web server for demonstration purposes only. The full program implements both forward and reverse complementary approximate repeats and can analyse sequences of hundreds of thousands of characters.

A later algorithm [CDA07 (click)] can run on complete genomes.

Low Information Content Sequence:

Here is a low-information content sequence:


(or precomputed)

NB. It is assumed that the length of the sequence is "common knowledge", otherwise the length should also be included in the encoding, using your favourite prior for lengths.
The formula used to estimate the cost of the Hypothesis (H) and its parameters assumes that the sequence is much longer than the number of parameters (12 for an first-order MM over DNA).

Random Sequence:

On the other hand, this is a sequence of 100 bases, generated uniformly and independently at random:


(or precomputed)

The more complex models find some chance patterns in the sequence and give a figure of less than 2.0 bits per character for (D|H), but when the models' complexities are included their totals are greater than 2.0 bits per character and they are shown to be unacceptable hypotheses: The sequence is seen to be uniform random after all.

Long Sequences

The full algorithm (see ref' above) also implements reverse complementary repeats and, having certain speed-up techniques, it can be run on sequences containing tens or hundreds of thousands of bases (below).

information content per nucleotide along HUMHBB by the HMM / PFSA
HUMHBB

Later

A later [algorithm (click)] can be run on complete genomes.


Also See:

window on the wide world:

Computer Science Education Week

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite,
ver 3.4+

The GIMP
~ free photoshop
Firefox
web browser
FlashBlock
like it says!

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Thursday, 23-Oct-2014 11:40:36 EST.