Problem:
Given a data-set from a data-space,
E.g. from
data CTreeType inSpace opSpace = CTleaf (ModelType opSpace) | CTfork MessageLength (Splitter inSpace) [CTreeType inSpace opSpace] | ... -- can also consider other non-traditional options
A `Splitter
' is used to partition the input space.
instance SuperModel (CTreeType inSpace opSpace) where -- NB. For simplicity only, this costs the -- structure at 1-bit per node. This is -- only optimal for binary trees. msg1 (CTleaf leafModel) = log2 + msg1 leafModel msg1 (CTfork fnLen f dts) = log2 + fnLen + (foldl (+) 0 (map msg1 dts))
Message length of a tree is that of the node plus those of the sub-trees, if any.
instance FunctionModel CTreeType where condModel (CTleaf leafModel) i = leafModel condModel (CTfork fnLen f dts) i = condModel (dts !! (applySplitter f i)) i
It is convenient to have an inner,
local ``search
'' function:
estCTree estLeafMdl splits ipSet opSet = let search ipSet opSet = let . . .
Simplest possible tree is a single
leaf
node:
leaf = CTleaf leafMdl leafMdl = estLeafMdl opSet leafMsg = msg (functionModel2model leaf) (zip ipSet opSet)
simple recursive search for the best (1- or) 2-level tree;
base case:
alternatives [] bestML bestCTree bestIpParts bestOpParts = (bestCTree, bestIpParts, bestOpParts) -- done
and...
...general case:
alternatives (sp:sps) bestML bestCTree bestIpParts bestOpParts = let -- NB. the `1' below is acceptable but not optimal theTree = CTfork (log (fromIntegral (length splts))) sp leaves leaves = map CTleaf leafMdls -- one leaf ... leafMdls = map estLeafMdl opParts -- ... per part partNums = map (applySplitter sp) ipSet ipParts = partition (aritySplitter sp) partNums ipSet opParts = partition (aritySplitter sp) partNums opSet msgLen = msg (functionModel2model theTree) (zip ipSet opSet) in if msgLen < bestML then alternatives sps msgLen theTree ipParts opParts else alternatives sps bestML bestCTree bestIpParts bestOpParts
splts = splits ipSet -- partitions of input space in case alternatives splts leafMsg leaf [ipSet] [opSet] of -- subtrees? search continues... ((CTfork msgLen pf leaves), ipParts, opParts) -> CTfork msgLen pf (zipWith search ipParts opParts); -- the single leaf wins? Search is over! (t, _, _) -> t in search ipSet opSet
estFunctionModel2estModel estFn ipOpPairs = functionModel2model (uncurry estFn (unzip ipOpPairs)) ft = estCTree (estFunctionModel2estModel estFnMdl) splits trainingIp trainingOp