Text Classification with Labelled and Unlabelled Data

Location: Methodology/SVM+Cluster
Home

Introduction

Methodology

Project Diary

Conclusion

Download

 

Methodology

  • SVM+Cluster

The main idea behind the SVM+Cluster approach is to first cluster both the labelled and unlabelled documents and then add feature to the labelled original feature vectors that represent their relationship to the clusters. These added features are usually the similarities to the different cluster centroids (overall centroid, positive centroid and negative centroid). Note that a cluster centroid is just the average feature vector, averaged over the documents belonging to that cluster. So the overall centroid is the average of all feature vectors, the positive centroid the average of the feature vectors belonging to class A, etc. There is also one binary feature added which indicates which of the clusters is closest to the particular feature vector.

Note that the clustering techniques investigated in this project were the Single-Pass algorithm and Snob (see thesis).

The SVM+Cluster approach is depicted in the following figure:

 

 

Figure showing the SVM+Cluster approach

Figure 1: SVM+Cluster approach

 

Note that the augmented feature vectors have the following format:

 

An augmented feature vector

Figure 2: An augmented feature vector