^CSE454^
[01]
>>
Classification- (Decision) TreeSee
L. Allison,
Models for Machine Learning and Data Mining in Functional Programming,
CSE454
2005
:
This document is online at
http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/
and contains hyper-links to other resources
|
^CSE454^
<<
[02]
>>
Partitioning on input attributes
|
^CSE454^
<<
[03]
>>
data CTreeType inSpace opSpace =
CTleaf (ModelType opSpace) |
CTfork MessageLength
(Splitter inSpace)
[CTreeType inSpace opSpace]
| ...
-- can also consider other non-traditional options
A `Splitter' can be used to partition the input space. |
^CSE454^
<<
[04]
>>
instance SuperModel (CTreeType inSpace opSpace) where -- NB. For simplicity only, this costs the -- structure at 1-bit per node. This is -- only optimal for binary trees. msg1 (CTleaf leafModel) = log2 + msg1 leafModel msg1 (CTfork fnLen f dts) = log2 + fnLen + (foldl (+) 0 (map msg1 dts)) Message length of a tree is that of the node plus those of the sub-trees, if any. |
| ^CSE454^
<<
[05]
>>
Make a Classification Tree an instance of a FunctionModel:
instance FunctionModel CTreeType where
condModel (CTleaf leafModel) i = leafModel
condModel (CTfork fnLen f dts) i
= condModel (dts !! (applySplitter f i)) i
|
^CSE454^
<<
[06]
>>
estCTree : Estimate a Classification TreeIt is convenient to have an inner, local ``search'' function:
estCTree estLeafMdl splits ipSet opSet =
let -- NB. estLeafMdl must be able to
-- handle an empty data set
search ipSet opSet =
let
. . .
|
| ^CSE454^
<<
[07]
>>
Simplest possible tree is a single leaf node:
leaf = CTleaf leafMdl
leafMdl = estLeafMdl opSet
leafMsg = msg (functionModel2model leaf)
(zip ipSet opSet)
|
| ^CSE454^
<<
[08]
>>
simple recursive search for the best (1- or) 2-level tree; base case: alternatives [] bestML bestCTree bestIpParts bestOpParts = (bestCTree, bestIpParts, bestOpParts) -- done and... |
| ^CSE454^
<<
[09]
>>
...general case:
alternatives (sp:sps)
bestML bestCTree bestIpParts bestOpParts =
let
-- NB. the `1' below is acceptable but not optimal
theTree = CTfork (log (fromIntegral (length splts)))
sp leaves
leaves = map CTleaf leafMdls -- one leaf ...
leafMdls = map estLeafMdl opParts -- ... per part
partNums = map (applySplitter sp) ipSet
ipParts = partition (aritySplitter sp) partNums ipSet
opParts = partition (aritySplitter sp) partNums opSet
msgLen = msg (functionModel2model theTree)
(zip ipSet opSet)
in
if msgLen < bestML
then alternatives sps msgLen theTree ipParts opParts
else alternatives sps bestML bestCTree bestIpParts bestOpParts
|
^CSE454^
<<
[10]
>>
splts = splits ipSet
in
case alternatives splts leafMsg leaf [ipSet] [opSet] of
-- subtrees?
((CTfork msgLen pf leaves), ipParts, opParts) ->
CTfork msgLen pf (zipWith search ipParts opParts);
-- the single leaf wins?
(t, _, _) -> t
in search ipSet opSet
-- ----------9/2002--6/2003--L.Allison--CSSE--Monash--.au--
|
^CSE454^
<<
[11]
>>
Summary
... FunctionModel- (Regression-) TreesUse a conversion function estFunctionModel2estModel:
estFunctionModel2estModel estFn ipOpPairs =
functionModel2model (uncurry estFn (unzip ipOpPairs))
ft = estCTree (estFunctionModel2estModel estFnMdl)
splits trainIp trainOp
L. Allison.
Models for Machine Learning and Data Mining in Functional Programming,
© 2005 L. Allison, School of Computer Science and Software Engineering, |