Mixtures

home1 home2
 Bib
 Algorithms
 Bioinfo
 FP
 Logic
 MML
 Prog.Lang
and the
 Book

MML
 Structured
  mixtures
  HMM
  DTree
  DGraph
  Supervised
  Unsupervised
  Mixtures

 <Normal<

also see
More Mixtures

The HTML FORM below allows the probability density functions for the two normal distributions N(μ1,σ1) & N(μ2,σ2), scaled by p & 1-p respectively, to be plotted with their mixture in the ratio p:(1-p).

You can vary μ1, σ1, μ2, σ2, & p (also the bounds on the axes of the graph) and press the ``plot'' button to see the result. (Beware there is no "error" checking.)

Gaussian Mixture Model
Needs Sun Microsystems' Java ON!

©
L
.
A
l
l
i
s
o
n
 
(
a
u
)
 
μ1= σ1= p=
μ2= σ2= q=(1-p)    
vertical: min= max=

The above is a very simple example. It is possible to have mixtures of more than two component classes, mixtures of multi-variate distributions, and mixtures of different kinds of distribution, etc..

The Model

Given S things, each thing having D attributes (measurements), a mixture model attempts to describe the things as coming from a mixture of T classes (clusters):

  1. The number of classes, T,
  2. a dictionary of class names,
  3. the distribution function for each class,
  4. for each thing, the class to which it belongs (although see fractional assignment later),
  5. for each thing, the attribute values using a code based on the class to which the thing belongs.

The following are assumed to be common knowledge: The number of things, S, the number of attributes, D, the nature of each attribute.

Number of Classes

The number of classes can be coded in any suitable code for (smallish) integers.

Class Dictionary

This specifies a code word for each class. The choice of classes is described by a multistate distribution.

Class Distributions.

Each class distribution is defined by the distribution parameters for those attributes that are important to it, e.g. mean and standard deviation for a normal distribution, or the value probabilities for a multistate distribution, etc.. A class need not specify all attributes; it can optionally fall back on "general population" class properties.

Fractional Assignment

The discussion above assumed that each thing was assigned wholly to one class or another. If two or more closes overlap strongly, e.g. if they are not several standard deviations apart, such a `definite assignment' description and code are not optimal.

Nuisance Parameters in Definite Assignment

e.g. Consider a 50:50 mixture of C0=N(-1,1) and C1=N(1,1) and a thing, ti=0.0. Now, ti could be in either class, and we do not care which. However with definite assignment, as above, we are forced to specify that ti is in class C0, or that it is in class C1, at a cost of 1-bit. Because of the position of ti the subsequent cost of stating its one attribute is the same in either case.

The are actually two alternative (sub-)hypotheses here: H0 that ti is in C0, and H1 that ti is in C1. Since we do not care about H0 v. H1, we should add their probabilities together. This shortens the message length, and similar considerations apply to every thing, tj, that could be in more than one class.

Class memberships have typical characteristics of nuisance parameters: Their number increases in proportion with the amount of data. If classes are close enough, then regardless of the amount of data, an inference method which uses definite assignment (such as the 1968 Snob) will not be able to detect the separate classes. A second problem is that estimates of class parameters are biased under definite assignment because the model, in effect, becomes a mixture of non-overlapping, truncated distributions.

No-Nuisance Coding

There are two ways to look at a method of coding the things efficiently.

The first view is to "borrow" bits from later in the message. The transmitter considers the code for things ti+1,... . If this starts with a `0', ti is coded as being in class C0 otherwise C1. Either way, the receiver decodes ti, then considers the fact that the transmitter had placed it in Ci, where i=0 or 1, and therefore understand i to be the first bit (which need not therefore be transmitted explicitly) of the rest of the message. Thus a bit is saved.

The second view of the matter is to consider the distributions for C0, C1, and their mixture. Thing ti has some probability, p, under class C0. Because of the form of this example, ti also has probability p under C1. It therefore has probability p+p=2p under the mixture; consider code lengths, -log(2p) = -log(p) - 1.

Generalisation

We have been using an example where thing ti has equal probability of coming from C0 and C1. This was only to keep the arithmetic simple. Similar considerations apply when a thing is not exactly mid-way between classes, and when there are two or more attributes, three or more classes, etc..

Benefits of Fractional Assignment.

Using fractional assignment of things to classes, and given enough data, it is possible to distinguish, i.e. infer, classes that are arbitrarily close together, and even classes that have the same mean but different variances. Infered class distribution parameters are also unbiased.

Notes

  • C. S. Wallace and D. M. Boulton (1968). An information measure for classification. Computer Journal, 11(2), August 1968, pp.185-194.
    Introduced the minimum message length (MML) theory of unsupervised classification (mixture modelling, clustering) and a resulting computer program SNOB, perhaps the first concrete application of Shannon's theory of communication to inductive inference. The first version (only) of Snob used definite assignment.
  • C. S. Wallace and P. R. Freeman. Estimation and Inference by Compact Encoding. J. R. Stat. Soc. B, 49 pp.240-265, 1987 [paper]
  • T. Edgoose & L. Allison. Minimum Message Length Hidden Markov Modelling. IEEE Data Compression Conf., pp.169-178, 1998.
    An extension to sequences, or time series, of multi-variate data.
  • L. Allison. Models for machine learning and data mining in functional programming, J. Functional Programming, 15(1), pp.15-32, Jan. 2005.
    Includes a generalised expectation-maximization (EM) algorithm for mixture-modelling over arbitrary probability distributions.
Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Saturday, 20-Apr-2024 09:56:05 AEST.