Elementary Probability
- A sample-space
(e.g. S={a,c,g,t}) is the set of
possible outcomes of some experiment.
- Events A, B, C,
..., H, ... An event is a subset
(possibly a singleton) of the sample space, e.g. Purine={a,g}.
- Events have
probabilities P(A), P(B), etc.
- Random variables
X, Y, Z, ... A random variable X takes
values, with certain probabilities, from the sample space.
- We may write
P(X=a), P(a) or P({a}) for the probability
that X=a.
Thomas Bayes
(1702-1761)
Thomas
Bayes
made an early study of probability and games of chance.
Bayes' Theorem
If B1, B2,
..., Bk
is a partition of a set B (of causes) then
P(A|Bi) P(Bi)
P(Bi|A) = ------------------- i=1, 2, ..., k
k
SUM P(A|Bj) P(Bj)
j=1
One and only one
of the Bi must occur
because they are a partition if B.
Inference
Bayes theorem is
relevant to inference because
we may be entertaining a number of exclusive and exhaustive hypotheses
H1, H2, ..., Hk, and
wish to know which is the besti|D) is called the posterior
probability
of Hi, "posterior" because
it is the probability after the data has been observed.
explanation of some observed
data D.
In that case P(Hi|D) is called the posterior probability
of Hi, "posterior" because
it is the probability after the data has been observed.
-
P(A|Bi) P(Bi)
P(Bi|A) = ------------------- i=1, 2, ..., k
k
SUM P(A|Bj) P(Bj)
j=1
Note that the Hi
can
even be an infinite enumerable set.
P(Hi) is
called the prior
probability
of Hi, "prior" because
it is the probability before D is known.
Notes
- T. Bayes. An
essay towards
solving a problem in the doctrine of chance. Phil. Trans. of the
Royal Soc. of London, 53,
pp370-418, 1763. Reprinted in Biometrika 45 pp296-315, 1958.
The probability of B given A is written P(B|A).
It is the probability of B provided that A is true;
we do not care, either way, if A is false.
Conditional probability is defined by:
-
P(A&B) = P(A).P(B|A) = P(B).P(A|B)
P(A|B) = P(A&B)/P(B)
P(B|A) = P(A&B)/P(A)
These rules are a special case of Bayes' theorem with k=2.
There are four combinations for two Boolean variables:
-
| |
A |
not A |
margin |
| B |
A & B |
not A & B |
(A or not A)& B = B |
| not B |
A & not B |
not A & not B |
(A or not A)& not B = not B |
| margin |
A = A&(B or not B) |
not A = not A &(B or not B) |
-L.Allison- |
We can still ask what is the probability of A, say, alone
-
P(A) = P(A & B) + P(A & not B)
P(B) = P(A & B) + P(not A & B)
Independence
A and B are said to be independent
if the probability of A does not depend on B and v.v..
In that case P(A|B)=P(A) and P(B|A)=P(B) so
-
P(A&B) = P(A).P(B)
P(A & not B) = P(A).P(not B)
P(not A & B) = P(not A).P(B)
P(not A & not B) = P(not A).P(not B)
A Puzzle
I have a dice (made it myself, so it might be "tricky")
which has 1, 2, 3, 4, 5 & 6 on different faces.
Opposite faces sum to 7.
The results of rolling the dice 100 times (good vigorous rolls on
carpet) were:
1- 20: 3 1 1 3 3 5 1 4 4 2
3 4 3 1 2 4 6 6 6 6 21- 40: 3 3 5 1 3 1 5 3 6 5
1 6 2 4 1 2 2 4 5 5 41- 60: 1 1 1 1 6 6 5 5 3 5
4 3 3 3 4 3 2 2 2 3 61- 80: 5 1 3 3 2 2 2 2 1 2
4 4 1 4 1 5 4 1 4 2 81-100: 5 5 6 4 4 6 6 4 6 6
6 3 1 1 1 6 6 2 4 5
We are interested in noiseless coding,
i.e. in efficient codes that can be used to send
data over a communication line that does not introduce noise (errors).
Such codes can also be used to compress data for storage purposes.
Here efficiency refers only to the code itself,
not to the amount of time or space required
to do the encoding and decoding processes.
source_symbols* -------> (0|1)* -------> source_symbols*
encoding decoding
The average code-word length (message length),
for events v from S, is
SUMv in S {P(v) |code(v)|}
This cannot be less than the entropy, H.
This lower bound is achieved if
|code(v)| = -log2(P(v)).
Decoding
A code is uniquely decodable if every finite string of bits
represents at most one sequence of source text.
i.e. There is no possibility of two different source texts
giving the same coded string.
Synchronizable Codes
A code is comma-free if it is impossible
to get one out-of-frame code-word in the middle of an encoded string.
In the early days of molecular biology it
was briefly suspected that the genetic code
(DNA3->amino acids)
was comma-free, but this turned out not to be the case.
A code has a stagger limit of k
if the synchronisation of all windows of length k+1 or greater
is apparent without any more context.
B. Neveln, Comma-Free and Synchnronizable Codes,
J. Theor. Biol 144 pp209-212, 1990,
suggests synchronizability, not comma free ness,
was the requirement of an early genetic code.
The genetic code is very nearly a Gray code,
i.e. an error in the (en|de)coding (DNA) will
probably either be silent (code for the same amino acid),
or code for a chemically similar amino acid.
The code seems to have evolved to minimise
the effect of mutations and misreading:
R. Swanson, A Unifying Concept for the Amino Acid Code,
Bull. Math. Biol. 46(2) pp187-203, 1984.
Prefix Property
Another desirable property for a code to have is
the prefix property:
No code word is a prefix of any other code word. A code having
the prefix property is uniquely decodable, but not necessarily v.v..
The prefix property allows us to identify the end of the first
(and second etc.) code word in a data stream immediately, and
is required to be able to decode a stream of data unambiguously
without look-ahead.
While the prefix property seems to be quite strict,
some fundamental results show that such codes are
all that we need to consider for the purposes of inference.
Example
Recalling the biased four sided dice, P(a)=1/2, P(c)=1/4, P(g)=P(t)=1/8,
the following is an efficient code for data generated by the dice:
-
| a:0 |
1? |
| c:10 |
11? |
| g:110 |
t:111 |
| 1/2 |
1/4 |
1/8 |
1/8 |
This is a particularly simple example of a
Huffman code (see below).
You will notice that the probabilities 1/2, 1/4, 1/8, 1/8,
were deliberately chosen so that their -log2s correspond exactly
to convenient code word lengths; things do not always work out so easily.
When they do, code words can be allocated by using a binary tree.
Codes and Entropy
Suppose we have events with probabilites
p1, p2 ..., pN
and there is a code for these events with code-word length
l1, l2 ..., lN,
where SUMi=1..N{2-li}<=1.
The average code-word length will be
N
SUM pi li
i=1
We saw under entropy, that this expression is minimised
when li = -log2(pi).
Such values will not be integers in general,
but if we had a code with code-word lengths of
ceiling(-log2(pi)),
then it would have an average code-word length less then the entropy plus one,
< H+1.
Kraft Inequality
There is a prefix code with code-words of lengths
l1, l2, ..., lN iff
SUMi=l..N{2-li} <= 1.
- Kraft (1949).
Proof (a) code exists => inequality:
There must be a maximum code-word length, say max.
Consider the full binary tree of depth max with edges labelled (0|1),
e.g. below if max=3 (root at the left):
-
| |
1 | 2 | 3 |
| root |
1 |
11 |
111 |
| 110 |
| 10 |
101 |
| 100 |
| 0 |
01 |
011 |
| 010 |
| 00 |
001 |
| 000 |
Every code-word must appear somewhere in the tree and
because of the prefix property no other
code-word can be a descendant or an ancestor of it in the tree.
There are 2k-max.
The probability of a shorter bit-string is the sum of the probabilities
of the max-bit strings below it.
The sum of the probabilities of the code-words cannot exceed 1.0.
bit-strings of length k.
A bit-string of length max, has probability 2
Proof (b) inequality => code exits.
Sort the lengths so that
l1<=l2<= ... <=lN.
- If lN = 1 then N=1 or N=2
and in the latter case {'0', '1'} will do as our prefix code.
-
Otherwise lN > 1.
Let S = SUMi=1..N{2-li}.
- If S <= 1/2, note that either
- l1=1, and in that case N=1, and {'0'} is a satisfactory code.
- l1>=2
Now S <= 1/2,
so 2S = SUMi=1..N{2-(li-1)} <= 1
We have shrunk the max' length, so by induction there is a code
with word lengths l1-1, ..., lN-1.
Prepend 0 to each code-word to get one with the required word lengths.
- Otherwise S > 1/2
Remember l1<=l2<= ... <=lN.
So 2-l1 >= 2-l2 >= ...
Let m be the smallest value for which
Sm
= SUMi=1..m{2-li} >= 1/2.
In fact Sm = 1/2,
because the "gap",
1/2-Sk for k<m,
must be a multiple of 2-lm.
Now deal with {l1, ..., lm}
and {lm+1, ..., lN}
separately, prefixing the code words for the former with 0
and the code words for the latter with 1, ... done.
Theorem
There exists a prefix code with code-words of length
l1, l2, ..., lN,
iff there is a uniquely decodable code with these lengths
G.F.'s notes give Welsh Codes and Cryptography OUP 1988
as a reference.
So it is reasonable to insist on the use of prefix codes
because if there is any uniquely decodable code
then there is a prefix code.
Shannon (1948)
Noiseless coding theorem for sources without memory:
Let S be a source with entropy H.
- Any uniquely decodable code for S has average length >= H.
- There exists a uniquely decodable code (in fact a prefix code)
for S with average length <=H+1.
See G.Farr (1999) p9 for (1).
Proof (2):
Let li = ceiling(-log2(pi)).
Clearly
SUMi=1..N{2-li} <= 1.
By the Kraft inequality there is a prefix code
with these code-word lengths.
SUMi=1..N{pi li}
<= SUMi=1..N{pi(-log2(pi) + 1)}
= H+1
Huffman (1952)
Algorithm for the construction of minimum redundancy codes.
Huffman's algorithm
starts with a list of probabilities for events,
p1, p2, .., pN.
It forms a binary tree with these probabilities
at leaves of the tree:
- Part 1:
- Find the two smallest values, x, y, in the list.
- Replace them by the value x+y associated with a
node in the tree fork(x,y).
- Repeat until the list contains one entry.
- Now allocate code-words from the top of the tree
downwards, as implied by the
example at the top of this web page.
Huffman's algorithm does give an optimal
code if one is restricted to transmiting symbols one at a time.
It may be possible to do better by grouping symbols together.
(Arithmetic coding effectively takes this to the extreme.)
(Huffman's algorithm is bottom-up - in the initial phase.
The other strategy that one might think of is
"top-down" but this does not
lead to efficient codes in general.)
Arithmetic Coding
Arithmetic coding works by dividing the interval [0, 1),
i.e. {x | x>=0 & x <1},
into smaller and smaller intervals which represent
the successive symbols of an input sequence.
It can be thought of as an extension of prefix coding
where the interval boundaries lie neatly between adjacent powers of 1/2.
e.g. The (trivial) prefix code for {F,T}*
where P(F)=P(T)=1/2:
- e.g. TTFT... where P(F)=1/2, P(T)=1/2
| 1/2 F |
1/2 T
|
1 |
| 1/4 F |
1/4 T
|
2 |
|
1/8 F
|
1/8 T |
3 |
1/16
F |
1/16
T
|
4 |
| 0 |
|
|
|
|
|
|
|
1/2 |
|
|
|
|
|
|
1 |
|
Note that
1/2 = 0.510 = 0.12,
1/4 = 0.2510 = 0.012,
1/8 = 0.12510 = 0.0012,
etc.
Langdon (1984)
gives an excellent tutorial introduction to arithmetic coding
[But beware:
fig 3 does not correspond to table 3 and the text:
a and b seem to have got switched.],
crediting Pasco (1976) and independently Martin (1979), and Jones (1981)
with discovering a successful FIFO arithmetic code.
In arithmetic coding, probabilities of events need not be neat powers of 1/2.
e.g. he gives a 2-symbol example where P(F)=1/8, P(T)=7/8 in position one
and P(F)=P(T)=1/2 thereafter.
(The probabilities are allowed to vary with position in the sequence,
so long as encoder and decoder both know them or can calculate them.)
For input "TTFT..."
the successive interval widths would ideally be
7/8 (T), 7/16 (T), 7/32 (F), 7/64 (T).
In general, the precision required to state an interval width could
increase without limit.
Arithmetic coding places a limit on the accuracy required;
the price is a small loss in coding efficiency
and Langdon (1984) states "less than 4%" for the
choices in that paper.
e.g. TTFT... where
1: P(F)=1/8, P(T)=7/8, and then
2, 3, 4: P(F)=P(T)=1/2
-
| 1/8 F |
7/8 T
|
P(F)=1/8 |
| 1/4 F |
5/8 T
|
P(F)=1/2 |
|
1/4 F
|
3/8 T |
P(F)=1/2 |
1/8
F |
1/8
T
|
P(F)=1/2 |
| 0 |
|
|
|
|
|
|
|
1/2 |
|
|
|
|
|
|
1 |
|
Note that the second-level partition is
1/4:5/8 not the ideal 7/16:7/16.
Successive intervals [C, C+A), of left-hand end C and width A,
are encoded by C:
"0...",
"0...",
"0...",
"1...", ... .
The end of the output is padded out by sufficient zeros.
Decoding
The decoder follows a similar set of actions
to the encoder which enables the input source
string to be recovered.
Precision
The left-hand end, C, of the current interval
is a binary number.
The method guarantees that it (and, A, the interval width)
can be represented with bounded precision,
i.e. of the form
0.000...000{0 | 1}k,
and can be represented by a fixed-point number
1.{0 | 1}k-1 (and exponent).
This removes any problems of numerical accuracy.
It is required that P(F)<=P(T) and P(F)
is approximated by the smallest power of 1/2 no larger than it.
P(T) is approximated by "what's left" of the current interval.
If P(F)>P(T), and note that both
the encoder and decoder must know this,
they can switch their roles and carry on as before.
Carry-Over
It can happen that when additions are made
to C, to move the left-hand end of the interval,
a carry can propagate.
This requires keeping the output string in a FIFO buffer,
potentially of arbitrary length.
The length of the buffer can be limited
(e.g. to 16 bits Langdon (1984))
in a technique called bit-stuffing (Langdon & Rissanen 1981)
by forcing a 0 out if 16 1s have been output.
The decoder can detect this situation.
Multi-symbol Alphabets
Larger input symbol-sets can be handled
either by converting the symbols to binary strings
which are then encoded as above, or by more
direct extensions of the method to deal with
multiple symbols and their individual probabilities.
Entropy
As described, arithmetic coding is very efficient
and extensions of it can get arbitrarily close
to the entropy of the data.
In some sense it has "solved" the coding problem,
making it clear that
compression = data modelling + coding.
The (predicted) probabilities of the events can come from anywhere
provided that the encoder and decoder agree on them;
they can for example be fixed, or
come from an order-k Markov model, or
be estimated from statistics gathered
from the data previously (en|de)coded,
etc.
... probability ... predict ... model ... inference ...
Notes
- S. E. Shannon. A Mathematical Theory of Communication.
Bell Syst. Technical Jrnl. 27 pp379-423, pp623-656, 1948
- S. E. Shannon & W. Weaver.
A Mathematical Theory of Communication.
U. of Illinois Press, 1949
- D. A. Huffman.
A Method for the Construction of Minimum-Redundancy Codes.
Proc. Inst. Radio Eng. 40(9) pp1098-1101, Sept. 1952
- R. Pasco. Source Coding Algorithms for Fast Data Compression.
PhD Thesis, Dept. Elec. Eng., Stanford University, 1976
Arithmetic coding
- G. N. N. Martin. Range Encoding: an Algorithm for Removing Redundancy
from a Digitized Message.
Video and Data Recording Conf., Southampton, 1979
Arithmetic coding
- C. B. Jones. An Efficient Coding System for Long Source Sequences.
IEEE Trans. Info Theory. IT-27 pp280-291, 1981
Arithmetic coding
- G. G. Langdon jr. & J. Rissanen.
A Simple general Binary Source Code.
IEEE Trans. Inf. Theory IT-28 pp800-803, 1982
Bit stuffing
- G. G. Langdon jr. An Introduction to Arithmetic Coding.
IBM J. Res. and Dev. 28(2) pp135-149, 1984
- I. H. Witten, R. M. Neal & J. G. Cleary.
Arithmetic Coding for Data Compression.
CACM 30(6) pp520-540, 1987