Text Classification with Labelled and Unlabelled Data

Location: Project Diary
Home

Introduction

Methodology

Project Diary

Conclusion

Download

 

Project Diary

  • November

The thesis has been finalized and submitted on the 5th of November. The night before submitting it discovered an interesting property about the "cluster structure" of text (see Conclusion) and how adding features in a "clustered" fashion can significantly improve results. Unfortunately, this allowed me to only partially obtain results for this new way of adding features (not much you can do in one night) but it is still a good direction for future research. Well, it's over now, time to get some sleep...

  •   October

Spent most of the month finishing my testing and summarizing results, as well as, most importantly, writing up sections of my thesis. Results mostly indicate that adding extra features doesn't always improve results. This method depends heavily on the quality of the features, i.e. how well the clusters from which we derive these features are separated. It seems that partitioning the data and then clustering those partitions separately instead of the whole training set at once improves results. 

 

          This can be explained by the fact that the created clusters are more "pure" compared to those created when clustering the whole set. Had my final presentation - not bad, but have the impression my nerves got the better of me....again......grrrrrrr...

  • September

Met Bhavanni and Adam from Telstra Research labs who started the idea of augmenting feature vectors with cluster features and then classifying them with a SVM. This meeting helped clarify a few ideas and questions, but also made me realize that most of the testing I had done was not done the way they had done it. Unfortunately, they didn't mention one little fact in their research paper about the choice of clusters from which we derive the additional features (I won't bore you with the details!). Of course, a big thanks to them for their time and help, it was really appreciated. This made me re-do all the testing, a process which takes really, really, really long... Also, spent some time trying to investigate ways of getting Snob to handle different feature distributions and the limited number of features it can deal with.

 

  • August

Finished and submitted my Literature Review. The implementation of the Single-Pass algorithm finally done! Spent the rest of the month investigating Snob, it's limitations and how to handle them, as well as testing the Single-Pass method combined with SVMs for various splits of the training and test data I had.

 

  • July

Spent most of the month working on my Literature Review - this means a loooot of reading about different methods, evaluating them and writing these thoughts down in my research log book, and finally in my Literature Review. Also, continued my implementation of the Single-Pass algorithm. Had a few problems with segmentation faults initially, but that was resolved. The bigger problem was the long execution time of the program since it was taking forever to cluster one set of data. Unfortunately, I refused to believe this can't be improved so wasted a bit of time trying to optimize it. Succeeded in speeding it up a little bit, but overall it is still slow, however that is simply the nature of the task at hand. It is simply impossible to cluster 2000 feature vectors with 10,000 features each much faster than it is currently done. For more details, refer to the thesis.

 

  • June

Had my mid-year presentation. I was fairly happy with the way it went, being a bit nervous, I didn't mention a few things I wanted to, but overall I was still happy. Started implementing the Single-Pass algorithm and of course continued doing a lot of reading for my Literature Review but also for the sake of possibly discovering new ideas that could help me out. Handed in my Literature Review Draft. Feedback from my supervisor suggested that I still had a lot of work to do.. :)))

 

  • May

 Submitted my Research Proposal. Wasn't really that successful, apparently my methodology section wasn't detailed and clear enough. Well, obviously must work on it. Did a lot of further reading and started preparations for my interim presentation in June. Also, started writing down sections of my Literature Review Draft. Continued my "research" with SVM Light in order to get reasonably familiar with it and the whole concept of Support Vector Machines. Started investigating ways of how to implement the Single-Pass algorithm.

 

  • April

This month was nothing but reading, reading and more reading..Tried to get familiar with the different methods I was going to investigate. Also, started writing my Research Proposal.

 

  • March

Mid-way through March found out that I got my first preference "Iterative Algorithms for Support Vector Machine Generation"  supervised by Dr. David Albrecht (later it sort of drifted away to "Text Classification with Labelled and Unlabelled Data" which can be viewed as a particular case of the original topic). It was time to get to work....