A Simple Statistical Algorithm for Biological Sequence Compression, IEEE Data Compression Conference 2007 (DCC'07)

A Simple Statistical Algorithm for Biological Sequence Compression

Minh Duc Cao, Trevor I. Dix, Lloyd Allison, Chris Mears

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

Bioinformatics
Compression
  DCC'07
   preprint.pdf
   XM software

Abstract: This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.
Minh Duc Cao, Trevor I. Dix, Lloyd Allison, Chris Mears.
A Simple Statistical Algorithm for Biological Sequence Compression.
IEEE Data Compression Conference (DCC), pp.43-52, 2007
[doi:10.1109/DCC.2007.7]['07]
[preprint.pdf]
Also see:: [BMC Bioinformatics '07], [Compression], [Bioinformatics],
and a related [seminar]

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Friday, 26-Apr-2024 21:11:19 AEST.