Missing Data
David McKenzie, 2003


Home
Honours Project

Abstract

Means for handling missing data is an investigation into new and established methods for dealing with situations where missing data cannot be easily recollected. The methods investigated will focus on but not necessarily be limited to those algorithms that attempt to replace instances of missing data with likely values. Of particular interest will be situations where the data still recorded by the system is not indicative of the data that has gone missing. It is situations such as these that are commonly foudn in databases such as political surveys upon which many possibly invalid predictions are made.

Introduction

Many databases these days, especially those handling data about or gathered by people, suffer from missing data. This missing data can seriously hamper the effectiveness of the database and as such is a serious problem. The data in question can be missing for a number of different reasons, such as corruption of a hard-disk drive, the result of a computer virus or by simple omission on the part of individuals completing a survey. Whatever the reason the integrity of the database is compromised. Re-gathering the data often proves impractical or impossible. By examining various techniques for coping with missing or hidden data this project aims to identify which methods are the most effective.

In this particular study a series of decision tree based algorithms are examined to determine which consistently produce superior results. All the algorithms were conceived by Quinlan and discussed in his publications, however a number of the ideas were only touched upon briefly and were perhaps not given the scrutiny they deserved. This has lead them to be implemented for the purposes of this test so their method and results may be examined more thoroughly.

Back to top