For
more details on the conclusions drawn from my work, see the thesis
and final presentation. Here is a
summary of the most important conclusions made:
- Support Vector Machines are sensitive to dealing with augmented
feature vectors. The reason is that these added features often don't
have enough discriminative power to improve results. Hence, SVM
performance often degrades with added cluster features
- The quality of the features can be improved by partitioning the
data, which in turn provides more "pure" clusters from which
these features are derived
- Data sets different in size and type of features may behave
differently and give different results, indicating that there is more
to this story - a topic for further research
- Also, an important fact is that adding the features in a fashion
which resembles the "cluster
structure" of text can significantly improve results, because then the added
features have the same distribution and spread like the original
features. This means the original data isn't "disturbed" as
it is with the original way of adding features.
|