filler
^up^ [01] >>
>Tree II>

(Decision) Classification Trees I

Problem:

Given multivariate data, each item having n-attributes @1, @2,... @ninfer a tree that models the dependent attribute @n, say, given input attributes @1 ... @n-1.

Model

A tree of

Applications

Expert systems: Examples are given, e.g. case histories or expert judgement, the learned Tree mimics expert behaviour.

Consider binary attributes first, n-ary and continuous attributes later.

This document is online at   http://www.csse.monash.edu.au/~lloyd/Archive/2005-08-Tree01/index.shtml   and contains hyper-links to other resources.


filler
<< [02] >>

e.g. my car won't start

Engine turns over but will not run.   ?Is it a problem with the ignition system (spark plugs, leads, coil, distributor or equivalent)?

We have a training set of cases, i.e. examples:

 1: engine fires2: clicking noise3: petrol smell4: ignition problemnote
1: y n n n fuel
2: y n n y plugs
3: y y n y leads
4: n y n y leads
5: y n y n flooded
6: n n y n leak
7: n n y y plugs
8: y y n y  
  etc. ... ... ...  
 . . . etc. . . .
Prediction (of @4) must be probabilistic, see cases 6 and 7.

filler
<< [03] >>

Simplest Tree:

decision tree
a single leaf node.

filler
<< [04] >>

A Complex Tree:

decision tree
How can we compare a simple tree v. a complex tree fairly?

filler
<< [05] >>

Model (i.e. Tree) Selection

Bayes:
P(Data&Tree)      = P(Tree)      × P(Data|Tree)

MsgLen(Data&Tree) = MsgLen(Tree) + MsgLen(Data|Tree)

A simple tree has a high prior probability, i.e. a short message length.

A complex tree has a low prior probability, i.e. a long message length.

These quantities are commensurable with msgLen(Data|Tree), i.e. the fit. This gives a trade-off.

filler
<< [06] >>

Complexity, i.e. Message Length, of a Tree

Deal with this in 3 parts

  1. topology,

  2. for each fork: split attribute,

  3. for each leaf: distribution parameters.

filler
<< [07] >>

1. Tree Toplogy:

decision tree topology
Traverse Tree in prefix-order, write `F' for a fork (split) and `L' for a leaf:
F F L F L L F L L

filler
<< [08] >>
F F L F L L F L L
NB. It is well known that every binary tree has exactly one more leaf node than it has forks.

Set P(F) = P(L) = 0.5

e.g. The given tree topology costs 9 bits.

Can detect the end of the code when #L = #F+1

filler
<< [09] >>

2. Split Attributes:

It takes   -log2(#a) bits   to specify one of #a attributes,
assuming all are equally likely,

e.g. log2(3) bits for the root node and the given tree.

A split attribute in a fork-node cannot be re-used in a subtree.

e.g. log2(2) = 1 bit for the next split attribute
and -log2(1) = 0 bits for a 3rd-level split.

e.g. Total = log2(3) + 2×log2(2) + log2(1) = log2(3) + 2 bits

filler
<< [10] >>

3. Leaf Distributions:

A binomial distribution is used for a binary attribute, @predicted.

State one parameter, i.e. P(@predicted)=T, to optimum, finite accuracy

3'. An alternative view

Do not code an explicit distribution in a leaf,

instead use the adaptive coding method for items arriving at the leaf,

simple & data message length is a fraction of a bit less per leaf.
(Essentially equivalent.)

filler
<< [11] >>

Data Given Tree (Model)

NB. Only applies to predicted attribute(s), input attributes are given, i.e. common knowledge.

  (# @predicted = T) × -log2 P'(T)

+ (# @predicted = F) × -log2(1-P'(T))

The MML estimator for a 2-state distribution is:

P'(T) = (#Tobs + 1/2) / (#Tobs + #Fobs + 1)

P'(F) = (#Fobs + 1/2) / (#Tobs + #Fobs + 1)

filler
<< [12] >>

Conclusions

Generalizes to multi-state (>2) and continuous attributes, etc..

References


© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (IRIX)",   charset=iso-8859-1