Text Classification with Labelled and Unlabelled Data

Location: Methodology/Feature Vectors
Home

Introduction

Methodology

Project Diary

Conclusion

Download

 

Methodology

  • Feature Vectors

Before documents can be clustered and classified, they must be represented in a suitable way. This is done with feature vectors. Feature vectors basically are one-dimensional arrays with as many elements as there are unique words in the document training set. The way each document is represented then is each element, corresponding to some particular word, is given a value depending on the number of times the word appears in the document set. The example in the picture shows a simple word-frequency technique, which means each element is given a value equal to the number of times the word appears in the document. Note that more selective and complicated representations exist as well (see thesis).

 

 

A feature vector as a way of representing documents

Figure 1: A feature vector