Missing Data
David McKenzie, 2003
Abstract
Means for handling missing data is an investigation into new and established
methods for dealing with situations where missing data cannot be easily recollected.
The methods investigated will focus on but not necessarily be limited to those
algorithms that attempt to replace instances of missing data with likely values.
Of particular interest will be situations where the data still recorded by the
system is not indicative of the data that has gone missing. It is situations
such as these that are commonly foudn in databases such as political surveys
upon which many possibly invalid predictions are made.
Introduction
Many databases these days, especially those
handling data about or gathered by people, suffer from missing data. This missing
data can seriously hamper the effectiveness of the database and as such is a
serious problem. The data in question can be missing for a number of different
reasons, such as corruption of a hard-disk drive, the result of a computer virus
or by simple omission on the part of individuals completing a survey. Whatever
the reason the integrity of the database is compromised. Re-gathering the data
often proves impractical or impossible. By examining various techniques for
coping with missing or hidden data this project aims to identify which methods
are the most effective.
In this particular study a series of decision tree based algorithms are examined
to determine which consistently produce superior results. All the algorithms
were conceived by Quinlan and discussed in his publications, however a number
of the ideas were only touched upon briefly and were perhaps not given the scrutiny
they deserved. This has lead them to be implemented for the purposes of this
test so their method and results may be examined more thoroughly.
Back to top