|
Text Classification with Labelled and Unlabelled Data |
|
| Location: Methodology/SVM+Cluster | |
| Home
Methodology
|
Methodology |
The main idea behind the SVM+Cluster approach is to first cluster both the labelled and unlabelled documents and then add feature to the labelled original feature vectors that represent their relationship to the clusters. These added features are usually the similarities to the different cluster centroids (overall centroid, positive centroid and negative centroid). Note that a cluster centroid is just the average feature vector, averaged over the documents belonging to that cluster. So the overall centroid is the average of all feature vectors, the positive centroid the average of the feature vectors belonging to class A, etc. There is also one binary feature added which indicates which of the clusters is closest to the particular feature vector. Note that the clustering techniques investigated in this project were the Single-Pass algorithm and Snob (see thesis). The SVM+Cluster approach is depicted in the following figure:
|
|
|
Figure 1: SVM+Cluster approach
Note that the augmented feature vectors have the following format:
Figure 2: An augmented feature vector
|
|