^II/Working/^ [01] >>

Classification- (Decision-) Trees

See L. Allison, (i) TR 2003/148, and (ii) Types and Classes of Machine Learning and Data Mining, Twenty-Sixth Australasian Computer Science Conference (ACSC2003) pp207-215, Adelaide, Australia, 4-7 February 2003. Conferences in Research and Practice in Information Technology, Vol.16. Michael Oudshoorn, Ed.

^II/Working/^ << [02] >>

Partitioning on input attributes

Discrete attribute:
k-way split where k is the `arity' of the attribute
 
Continuous attribute:
Try median, left-quartile, right-quartile, octiles, ...
NB. depends on current (part of) data set
 
Multivariate:
Merge the splitting options for all attributes.
^II/Working/^ << [03] >>
data CTreeType inSpace opSpace =

  CTleaf (ModelType opSpace)   |

  CTfork MessageLength
         (Splitter inSpace)
         [CTreeType inSpace opSpace]

  | ...

  -- can also consider other non-traditional options

A `Splitter' can be used to partition the input space.

^II/Working/^ << [04] >>
instance SuperModel (CTreeType inSpace opSpace) where

 -- NB. For simplicity only, this costs the
 -- structure at 1-bit per node.  This is
 -- valid but only optimal for binary trees.

 msg1 (CTleaf leafModel)
   = log2 + msg1 leafModel

 msg1 (CTfork fnLen f dts)
   = log2 + fnLen + (foldl (+) 0 (map msg1 dts))

Message length of a tree is that of the node plus those of the sub-trees, if any.

^II/Working/^ << [05] >>

Make a Classification Tree an instance of a FunctionModel:

instance FunctionModel CTreeType where

  condModel (CTleaf leafModel) i = leafModel

  condModel (CTfork fnLen f dts) i
    = condModel (dts !! (applySplitter f i)) i
^II/Working/^ << [06] >>

estCTree : Estimate a Classification Tree

It is convenient to have an inner, local ``search'' function:

estCTree  estLeafMdl splits  ipSet opSet =
 let  -- NB. estLeafMdl must be able to handle an empty data set

  search ipSet opSet =
   let
       . . .
^II/Working/^ << [07] >>

Simplest possible tree is a single leaf node:

leaf    = CTleaf leafMdl

leafMdl = estLeafMdl opSet

leafMsg = msg (functionModel2model leaf)
              (zip ipSet opSet)
^II/Working/^ << [08] >>

recursive search for the best (1- or) 2-level tree

alternatives  []
 bestML bestCTree bestIpParts bestOpParts
 = (bestCTree, bestIpParts, bestOpParts)  -- done

alternatives ((arity,pFn):pFns)
 bestML bestCTree bestIpParts bestOpParts =
 let
  -- NB. the `1' below is acceptable but not optimal
  theTree  = CTfork (log (fromIntegral (length splts)))
                    sp leaves
  leaves   = map CTleaf leafMdls     -- one leaf ...
  leafMdls = map estLeafMdl opParts  -- ... per part
  partNums = map (applySplitter sp) ipSet
  ipParts  = partition (aritySplitter sp) partNums ipSet
  opParts  = partition (aritySplitter sp) partNums opSet
  msgLen   = msg (functionModel2model theTree)
                 (zip ipSet opSet)
 in
  if msgLen < bestML
  then alternatives sps msgLen theTree ipParts opParts
  else alternatives sps bestML bestCTree bestIpParts bestOpParts
^II/Working/^ << [09] >>
  splts = splits ipSet

 in
  case  alternatives splts leafMsg leaf [ipSet] [opSet]  of

   -- subtrees?
   ((CTfork msgLen pf leaves), ipParts, opParts) ->
     CTfork msgLen pf (zipWith search ipParts opParts);

   -- the single leaf wins?
   (t, _, _) -> t


in search ipSet opSet

-- ----------9/2002--6/2003--L.Allison--CSSE--Monash--.au--
^II/Working/^ << [10] >>

Summary

Can estimate (fit, infer, learn) a classification-tree given (i) ways of partitioning the data on input attributes and (ii) an estimator for leaf models.
 
Any kind of leaf model can be used -- discrete, continuous, or multivariate;   even...

... FunctionModel- (Regression-) Trees

Use a conversion function estFunctionModel2estModel:

estFunctionModel2estModel estFn ipOpPairs =
  functionModel2model (uncurry estFn (unzip ipOpPairs))

ft = estCTree (estFunctionModel2estModel estFnMdl)
        splits trainIp trainOp

L. Allison. Types and Classes of Machine Learning and Data Mining, 26th Australasian Computer Science Conference (ACSC2003) pp207-215, Adelaide, Australia, 4-7 February 2003. Conferences in Research and Practice in Information Technology, Vol.16. Michael Oudshoorn, Ed.


© 2004 L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.
Created with "vi (IRIX)",   charset=iso-8859-1