David's Webpage : Honours


Home
Honours Project

Excel spreadsheet results

Network A
Network B
Network C

Results By Algorithm

Bayesian Algorithm

Although it was expected that the Bayesian algorithm would perform badly, however the extent to which it did was surprising. This could perhaps be overlooked in Networks B and C, but it was worrisome for the tests conducted on Network A. The failure of this algorithm to produce replacement values that were any better than a random guess suggested that there was something wrong with the algorithm. Fearing that this was the case the algorithm was audited thoroughly and stepped through by hand, line by line. Regrettably this shed no light upon what the problem was, nor on why the algorithm was performing so poorly in a situation that should have produced its best results. Even though no particular fault was found with the implementation of the algorithm, the results it produced were not consistent with the spirit of the algorithm as described earlier. Because of this the results garnered may be considered suspect, but even if the algorithm produced random guesses they can still be used to contrast the efforts of the other algorithms tested.

Decision Tree Algorithm A

This algorithm performed rather well across all the databases staying close to the strength of association, rarely straying more than 5% lower. It was also among the quickest of the algorithms to run on a database, but the results shown here do not seem to bear out the faith Quinlan had in this algorithm when compared to algorithms B and C. It also seems somewhat puzzling that in general as the strength of association increased the greater the difference between the strength of association and the percentage of correct replacements became. Conversely to this, as the strength of association increased the percentage of correct replacements for the other two decision tree algorithms grew closer to the strength of association. Given these two facts one draws the conclusion that it is the aspects by which decision tree algorithm A differs from B and C that is holding it back. Since the same method for constructing decision trees is used throughout it must be the choice of temporary replacement by popularity that is adversely affecting performance.

Decision Tree Algorithm B

This algorithm performed best overall, although it was only best by a small margin over algorithm C. Over the entire set of tests the percentage of correct replacements by algorithm B rarely varied from the strength of association by more than one and a half percent. This shows that even when the pattern by which entry values were chosen was obfuscated, say when the strength of association was only 60%, the algorithm was still able to discern the pattern and choose replacements accordingly. Such strict adherence to the strength of association shows that this method may indeed have a lot of potential and perhaps warrants further and more rigorous testing. Unfortunately this method did have its downside, that being that it required considerably more time and effort to run this algorithm than any other algorithm tested. More decision trees needed to be constructed and stored, and they were referred to more frequently.

Decision Tree Algorithm C

This algorithm performed very well indeed considering the simplicity of the concept. Although the decision trees for this algorithm were larger due to the greater number of possible values for each field (one more value, that of "missing") they were no more difficult to work with. Each missing value required reference to but a single decision tree and no temporary values needed fabrication, so this algorithm proved both quick and accurate in its operation. This accuracy can be attested to by the fact that it was usually less than a percent behind the lead algorithm's percentage of correct replacements. In the end this algorithm was only slightly outperformed by algorithm B while producing its results with many fewer decision trees and a considerably less complicated algorithm. It is for this reason that any further testing undertaken on algorithm B would be advised to include algorithm C to help ascertain under what conditions each performs best, and whether the extra complexity of B is worth the time and trouble when algorithm C is available.

Conclusion

Of those algorithms examined in this series of tests, the most accurate is decision tree algorithm B. However the best algorithm, if resources expended is taken into consideration, could be either decision tree algorithm B or C depending on whether the emphasis is on accuracy or speed. Decision tree algorithm A performed adequately, however it can be confidently said that it was consistently outperformed by both algorithms B and C. Quinlan's faith in this particular method seems to be misplaced when other more accurate methods are available. The Bayesian algorithm performed worst, as was expected, except the magnitude of its failure raised certain doubts about the correctness of its implementation.

Back to top