David's Webpage : Honours


Home
Honours Project

Algorithms

Four algorithms were examined and experimented upon during the course of this project, each was implemented by myself.

Bayesian

The first algorithm a "Bayesian" algorithm determines the most probable value given a single field's value to infer from. When presented with a database entry that has fields with missing values, it checks the entry's value for its given field and then attempts to determine the most likely replacement. It does this by checking other entries in the database where the given field's value match that for the entry undergoing replacement and observes their values for the field which is missing. It then selects the most popular value for the missing field from its slightly narrowed section of the database.

Decision Trees

The main focus of this experiment is to evaluate some of the decision tree algorithms that were largely overlooked in Quinlan's paper concerning his experiments. Each of these decision tree algorithms use the same method for constructing a decision tree, but they differ greatly when presented with a situation where the entry undergoing replacement has multiple fields that have missing values. These differences come from the situation that arises when the algorithm tries to follow a branch down the decision tree it has created only to find it doesn't know which path to take. Normally the algorithm builds its tree according to a series of measures developed by Quinlan. First, the algorithm will determine the topmost node in its tree (and the field it represents), using an "Information Gain" measure. This measure tries to select a node such that knowing the decision for this node eliminates as much of the database as possible. The tree is built by choosing node after node (or field after field) in turn and added them to the tree, until all fields except that undergoing replacement have been accounted for.

Once this tree is built, there is a path for each possible set of values an entry could have. For each path and its set of values, a probability distribution is formed from all the complete entries in the database, showing how probable a particular replacement value is for an entry that follows that path. This distribution is then used to guess the most likely replacement value. Because each tree is specific to a particular field within the database, a number of decision trees are needed to cover all the fields in the database. Three different variations of decision tree replacement algorithms are looked at in this experiment, they have been unimaginatively dubbed Decision Tree Algorithms A, B and C.

Decision Tree A

Decision Tree Algorithm A, when presented with a situation involving more than one missing value in a single entry will focus on replacing the value for one field at a time. It takes the decision tree produced for that particular field and when it comes to a branching for which the value is missing will decide depending upon the most popular value across the whole database for that field.

Decision Tree B

Decision Tree Algorithm B, takes a completely different approach to that used by A. It will, like A, focus on a single field with a missing value at a time, but before referring to the decision tree for that field it will try to guess the values for the other missing fields. To do this it uses smaller decision trees. When confronted with an entry that has multiple missing values, it focuses on one at a time and fills in the others with temporary replacements. To produce a replacement, it takes the fields that aren't missing and builds a decision tree using only those fields. This smaller decision tree is then used to determine a likely plausible temporary replacement. It then uses these temporary values like Algorithm A uses the most popular values, with the temporary replacements it has enough information to choose a path on one of the regular decision trees. This is the slowest of the algorithms as it requires many more decision trees than other two, each of which must be consulted and perhaps constructed in turn.

Decision Tree C

Decision Tree Algorithm C, takes perhaps the simplest approach, which allows "missing" to be a possible choice on the decision tree. This differs from the other algorithms in creation of its decision tree in that it will always have one more choice per node, but uses all the entries in the database when constructing the tree, rather than just the intact ones. This method produces wider, more complicated decision trees, and turns out to be slightly slower than Decision Tree Algorithm A. However the fact that it does not discard anywhere near as much possibly usefully information when constructing its trees should hold it in good stead when making its predictions

Back to top