2-State and Binomial Distributions

LA home
Computing
 Algorithms
 Bioinformatics
 FP,  λ
 Logic,  π
 MML
 Prog.Langs

MML
 Discrete
  2-state
   #demo
  >m-state>

  <Inference<
  >Fisher>
eg. HTTHTTTHTT
n=10, h=3, t=7

`A' is an event for which P(A)=p and hence P(not A)=1-p. Consider `n' independent trials (also known as Bernoulli trails). The random variable X is defined to be the number of times that A occurred. Then X has a binomial distribution with parameters n and p.

The following is taken from Wallace & Boulton (1968) to follow the historical development of MML. The general MML form, explicitly based on the [Fisher] information, is given later.

Coding

To transmit a sequence of `n' coin-tosses containing `h' heads, This is a multi-state distribution with m=2.

A uniform prior is assumed for P(H). The value of `n' is either common knowledge or is transmitted by standard methods. The receiver does not know h or t.

  h     = #heads,  t = #tails,  n = h+t
  r(H)  = h/n = frequency of heads,  r(T) = ...
  P(H)  = probability of a head,  P(T) =...
  P'(H) = estimate of P(H),  P'(T) =...

We describe three methods to transmit the sequence of coin-tosses, from Wallace & Boulton (1968):


Method 1

(i) Transmit h, 0≤h≤n, requires log(n+1) bits.
(ii) Transmit which combination out of C(n,h) appeared,

requires log(n!/(h!(n-h)!)) bits.
Total msg len = log(n+1) + log(n!/(h!(n-h)!)) = log( (n+1)!/(h!(n-h)!) ).


Method 2

Use an `adaptive code'.
H and T running counts are maintained, initialised to one so that there are no zero probabilities.
The code for each H or T is based on the counts at that point.

eg.           H     T     T     H     T     T     T     H     T     T
  H count: 1     2     2     2     3     3     3     3     4     4
  T count: 1     1     2     3     3     4     5     6     6     7
  sum    : 2     3     4     5     6     7     8     9    10    11

    count     1     1     2     2     3     4     5     3     6     7
p = -----  =  -     -     -     -     -     -     -     -    --    --
     sum      2     3     4     5     6     7     8     9    10    11


      p = h!.(n-h)!/(n+1)!

  msg len = log( (n+1)!/(h!(n-h)!) )

So method 1 and method 2 require the same message length.


Method 3

from appendix of W&B(1968) for 2 states, m=2:
(i) Transmit an estimate P'(H) of P(H), to some accuracy. P'(T) = 1-P'(H).
(ii) Transmit the sequence of Hs and Ts in a code based on that estimate, requires -h.log(P'(H))-t.log(P'(T)).

Total msg len = msglen(P'(H)) - h.log(P'(H)) - t.log(P'(T))
Part (ii) is minimised if P'(H)=r(H) but then part (i) is large.

    ^ part(ii)
    | .                         .
    |  .                       .
    |    .                   .
D( -|        .           .
 ( -|              .
    |
    ---------------------------------->P'(H)
    |        |     |              |
    0        |     r(H)           1
             P'(H)
Let r(H) = P'(H)+a(H) etc. where a(H)+a(T)=0.
Consider that P'( ) is fixed, and r( ) and a( ) vary. The extra discrepancy in the cost of part (ii) from using P'( ) instead of r( ) is:
D = -n{ Σ[m:HT] (P'(m)+a(m))log(P'(m))       -- using estimates P'()
       -Σ[m:HT] (P'(m)+a(m))log(P'(m)+a(m))} -- using r()'s

  = n.Σ[m:HT] (P'(m)+a(m))log(1+a(m)/P'(m))
              -- assume a(m)/P'(m) is small, expand log(1+a(m)/P'(m))
              -- to 2nd term NB. log(1+x) = 1.x - x2/2 + x3/3 - ...

  ~ n.Σ[m:HT] (P'(m)+a(m)).(a(m)/P'(m) - a(m)2/2P'(m)2)

  = n.Σ[m:HT] { a(m) - a(m)2/2P'(m) + a(m)2/P'(m) - a(m)3/2P'(m)2 }
                                                       NB. Σ a(m) = 0

  ~ (n/2).Σ[m:HT] a(m)2/P'(m)
--6

The values of P(H) range over G=[0,1] uniformly and a particular estimate will be used over some range of r(H). This range will depend on n and on P'(H).

    r(T)^               Σ[m:HT] r(m) = 1
       1|.              r(m)≥0, m:HT
        |  .
        |    .
        |      .
        |        .
        |          .
       0.------------->r(H)
        0          1

 For 2 state, {H,T}, since a(H)=-a(T) & P'(H)=1-P'(T),
    D = (n/2).{ a(H)2/P'(H) + a(T)2/P'(T) }
      = (n/2).{ a(H)2/P'(H) + a(H)2/P'(T) }
      = (n/2). a(H)2 / (P'(H).(1-P'(H)))
--7
--8










-6b

Note that for a given a(H), D is large for P'(H) close to 0 or to 1, and also that D is symmetric for a(H) +ve or -ve, so P'(H) should be at the centre of the range over which it is used.

Average value of a(H)2 over the range [P'(H)-x/2, P'(H)+x/2] of length x is

      = (1/x).[-x/2,x/2] y2 dy

      = (1/x).{y3/3}[-x/2,x/2]

      = (1/x).(x3 / 3 / 4)

      = x2 / 12
--15b
So average D over the range is


      avgD = n . x2 / (24 . P'(H).(1-P'(H)))
The message length to state the interval [P'(H)-x/2, P'(H)+x/2] and thus P'(H) is

      -log(x)          -- 'cos of the uniform prior in P(H)
Total avg cost to transmit interval and also D due to mismatch of P'(H) to r(H) is

      -log(x) + n . x2 /(24 . P'(H).(1-P'(H)))
to minimise avg increase, differentiate w.r.t. x and find zero

      -1/x + n . x / (12 . P'(H).(1-P'(H))) = 0
      x2 = 12 . P'(H) . (1-P'(H)) / n
So min' increase is

      1/2 log(n/12) - 1/2 {log(P'(H)) - log(1-P'(H))}
        + n. 12.P'(H).(1-P'(H))/n /(24.P'(H).(1-P'(H)))

    = 1/2 {log(n/12) - log(P'(H)) - log(1-P'(H)) + 1}
--23b
If we take P'(H) ~ r(H) and do averaging around r(H), total message length to transmit data by method 3 is approximately:


      1/2 {log(n/12) + 1} -  Σ[m:HT] (n.r(m) + 1/2).log(r(m))
--28b
    = 1/2 {log(n/12) + 1 - Σ[m:HT]log(r(m))} - h.log(h/n) - t.log(t/n)
                                               ^^^^^^^^^^^^^^^^^^^^^
                                         part (ii) in "optimal" code
 

To compare Method 3 with methods 1 and 2:

Stirling's approx: log(x!) = x.log(x) - x + 1/2 log(x) + 1/2log(2 pi) + ...


method 1 or 2:
   log( (n+1)!/(h! t!) )

 = log(n+1) + log(n!/(h! t!))

 = log(n+1) + n log n  - n + 1/2 log n  + 1/2 log(2 pi)
            - h log h  + h - 1/2 log h  - 1/2 log(2 pi)
            - t log t  + t - 1/2 log t  - 1/2 log(2 pi)

 = log(n+1) - h log(h/n) - t log(t/n) + 1/2(log n -log h -log t -log(2 pi))

 = 1/2(log((n+1)2 . n / (h t)) - log(2 pi) ) - h.log(h/n) - t.log(t/n)

 = 1/2{log((n+1)2/n) - log(h/n) - log(t/n) -log(2 pi)} - h.log ......

 = 1/2{log((n+1)2/n)-log(2 pi)-log(r(H))-log(r(T))} -h.log(h/n)-t.log(t/n)

method 3 - method 1
 = 1/2{log(n/12) + 1           - Σ[m:HT]log(r(m))} -h.log(h/n)-t.log(t/n)
 -{1/2{log((n+1)2/n)-log(2 pi)-log(r(H))-log(r(T))} -h.log(h/n)-t.log(t/n)}

 = 1/2{-2log(1+1/n) -log(12)+1+log(2 pi)}

 ~ 1/2 { log(pi/6) + 1 }                                    -- for large n

 = 1/2 log(pi.e/6)

 = 0.176 nits

So method 3 is just marginally(!) worse than methods 1 and 2 - but then it states an estimate P'( ) of P( ) which methods 1 and 2 do not.


Demonstration

Use the HTML FORM below to generate a sequence of coin tosses for a specified pr(H) and length N. The `code' button calculates code length for a fixed code (nominate an estimate p' for P(H)), combinatorial code, adaptive code, and a code which states pr(H) to an optimal level of accuracy. NB. the appoximations used may break down for very small values of N.

©
L
.
A
l
l
i
s
o
n
p=[] N=[]
p'=[]

maxLH(p)=

Don't worry if JavaScript gives a ridiculous number of decimal places, it's just its way.

Notes

This page is directly based on the appendix in the paper Wallace & Boulton (1968); any errors by L.A..

  • C. S. Wallace & D. M. Boulton. An Information Measure for Classification. Computer Journal, 11(2), pp.185-194, Aug. 1968.
    (see the appendix) and ...
  • D. M. Boulton & C. S. Wallace. The Information Content of a Multistate Distribution. J. Theor. Biol., 23, pp.269-278, 1969, [doi:10.1016/0022-5193(69)90041-1].
    The multi-state distribution generalises the 2-state distribution.
window on the wide world:

Computer Science Education Week

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite,
ver 3.4+

The GIMP
~ free photoshop
Firefox
web browser
FlashBlock
like it says!

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Friday, 25-Apr-2014 17:41:38 EST.