Case Study: Mixture of Markov Models

18 December 2003

LA home
Computing
 Algorithms
 Bioinformatics
 FP,  λ
 Logic,  π
 MML
 Prog.Langs

FP
 II
  Ver'1.1
   mix. MM

A colleague mentioned that he was interested in mixture models over Markov models: Given a set of sequences, S=[s0, s1,...], over a finite alphabet, each sequence is believed to come from a Markov model but it is not known how many different components (Markov models) are involved, nor what their parameters are.

The [200309/] Haskell code for inductive inference includes a basic estimator for mixture models (unsupervised classification, intrinsic classification, clustering), so it seemed that it should be easy to create a prototype estimator for mixtures of Markov models.

The first requirement is for a weighted[*] estimator for a Markov model. A Markov model of order k is a time-series model; it predicts the value of the next element in a sequence given the k previous values. It can be based on a finite function-model from the k previous values (the context) to the next value. For simplicity, only Markov models of order k=1 are considered in the prototype.

A weighted estimator, estFiniteFunctionWeighted, for a finite function-model already exists. It can be used by feeding it the inputs [s0,0, s0,1, ..., s0,|s0|-2, s1,0, ...] with the corresponding outputs [s0,1, s0,2, ..., s0,|s0|-1, s1,1, ...] and the corresponding weights [w0, w0, ..., w0, w1, ...] where wi is the weight of sequence si. This is the kind of "massaging" of the training data that functional programming is very good at to get the inputs, ips, outputs, ops, and weights, ws.

The finite function-model, ff, leads to a predictor function for a finite list function-model, flf. flf is converted into a time-series model by functionModel2timeSeries and thence into a model of sequences by timeSeries2model:


module Main where
import StatisticalModels
-- Estimate a mixture of Markov models over a set of sequences. 

est_MM1_wtd trainingSeqs weights =                             -- our estimator
 let scanOneSeq xs w rest =
       let scan (i:is) (o:os) = (i,o,w) : (scan is os)
           scan   _      _    = rest
       in scan xs (tail xs)               -- MM of order 1 has context length 1
       -- match [x0, x1, ...] with [x1, x2, ...]

     scanManySeqs (s:ss) (w:ws) = scanOneSeq s w (scanManySeqs ss ws)
     scanManySeqs  _     _      = []   -- done

     (ips,ops,ws) = unzip3(scanManySeqs trainingSeqs weights)   -- massage data

     ff =  estFiniteFunctionWeighted ips ops ws        -- finite function-model

     predictor (x:_) = condModel ff x         -- prediction for context [x,...]
     predictor  []   = uniform (elt minBound) (elt maxBound)      -- [] context
     elt x = x `asTypeOf` (trainingSeqs!!0!!0)

     flf = FM (msg1 ff) predictor (\() ->show ff)  --finite list function-model

 in (timeSeries2model.functionModel2timeSeries) flf           -- model of seq's

The estimator can be used with estMixture to investigate mixtures of two (or more) components, e.g.


mix2 = estMixture [est_MM1_wtd, est_MM1_wtd] trainSet

It can be seen that there would be no great difficulty in considering Markov models of order k=2 or more. In fact unweighted estimators for finite list function-models and for Markov models of arbitrary order k already exist in [200309/] and it would not be hard to generalize them to weighted data.

Note that in the above statistical model, each sequence is considered to have come from a single Markov model. This is quite different from a situation where an individual sequence may be heterogeneous, i.e. coming from two of more models, in parts.

The exercise suggests that it might be more convenient if it were standard for an estimator for a time-series model to work from a set (list) of data-series, rather than from a single data-series.


Also see
[TR 2004 153], and
[more refs].
 
[*] Fractional assignment of data to components is necessary for unbiased estimation of mixture models -- hence "weighted" data and estimators.
window on the wide world:

Computer Science Education Week

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite,
ver 3.4+

The GIMP
~ free photoshop
Firefox
web browser
FlashBlock
like it says!

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Sunday, 26-Oct-2014 03:47:11 EST.