Text Classification with Labelled and Unlabelled Data

Location: Conclusion
Home

Introduction

Methodology

Project Diary

Conclusion

Download

 

Conclusion

For more details on the conclusions drawn from my work, see the thesis and final presentation. Here is a summary of the most important conclusions made:
  • Support Vector Machines are sensitive to dealing with augmented feature vectors. The reason is that these added features often don't have enough discriminative power to improve results. Hence, SVM performance often degrades with added cluster features
  • The quality of the features can be improved by partitioning the data, which in turn provides more "pure" clusters from which these features are derived
  • Data sets different in size and type of features may behave differently and give different results, indicating that there is more to this story - a topic for further research
  • Also, an important fact is that adding the features in a fashion which resembles the "cluster structure" of text can significantly improve results, because then the added features have the same distribution and spread like the original features. This means the original data isn't "disturbed" as it is with the original way of adding features.