Sentence Classifier for Helpdesk Emails

Results

This page discusses the results obtained from the classification experiments. The word "sentence type" and "class" will be used interchangably.

Bag-of-words VS Bigram

Table 1 compares the performance of each classifier by using different sets of features. The first column specifies the feature set, the second column gives the number of features for the corresponding feature set and the last three columns are the F1-measure of Naive Bayes (NB), Decision Trees (DT) and SVM respectively.

Table 1: Unigram vs. Bigram
Features # of features NB DT SVM
Unigram
1622
0.666
0.829
0.883
Unigram + Bigram
8372
0.397
0.834
0.893
Bigram
6750
0.354
0.688
0.733

Combining bag-of-words (unigram) and bigram together showed only marginal effect in the classification performance of DT and SVM, but a significant drop for NB. This is because most of the bigrams (a pair of words) occur very rarely in the entire training set. As NB is not able to handle infrequent features, its performance is affected by bigram significantly. DT and SVM seem to be better at handling infrequent features.

It is clear that using bigram without unigram produced far inferior performance for all classifiers. One reason to explain it is that some individual words served as important features in discriminating the classes as they consistently occur in the same class. However, in bigram, each word is paired with another word. So, a highly discriminative word could be paired with different words in different sentences. This could have reduced its class-discriminating power.

The Effect of Feature Selection

The graph below shows the effect of feature selection on the classifiers' performance. Chi-squared was used as the feature selection method with bag-of-words as the only type of features. DT and SVM are seen to perform consistently until the number of features are reduced to below 300. On the other hand, NB had a significant improvement when the features are removed. As explained previously, NB was not able to handle infrequent words. When the feature selection was performed, only useful features were selected and those unuseful features were discarded.

The results show that the aim of feature selection was achieved as most features could be removed while maintaining the classification performance of DT and SVM and even improving that of NB.

Comparison of Classifiers on Class-by-Class basis

The diagram below compares the performance of different classifiers on each class. SVM is seen to perform as well as the others in some cases but better in most cases. DT outperformed NB in most of the classes and was outperformed only in four classes. SVM also had the smallest standard deviation of F-measure (not reported in this page) for most of the classes, while NB had the largest.

The Effect of Context

The effect of context on the classification performance is shown in table 2. The values in the table are the F-measures of the classifiers. NB had an observable improvement, while DT improved only marginally. In our implementation to work with context, we need to obtain the class probability distribution of classifying a given sentence. NB and DT are two classifiers that provide the class probability estimates, but SVM does not.

In general, SVM is a binary classifier and does not produce probability distribution. To obtain its probability distribution, logistic regression models were used to map the outputs of SVM and pairwise-coupling was used to handle multi-class problem. However, the mapping does not preserve the actual prediction of SVM in some cases, thus hurting its classification performance. When context was not used and logistic regression models were used to obtain the class probability distribution, SVM's F1-measure dropped from 0.888 to 0.858. As seen in table 2, the improvement obtained by using the context with SVM did not cover the loss of its performance due to the need of its probability distribution.

Table 2: The effect of context
Classifier F-measure
Without Context
With Context
Naive Bayes
0.814
0.844
Decision Trees
0.844
0.846
SVM
0.888
0.864

Conclusion

In conclusion, feature selection was demonstrated to reduce the features significantly while maintaining or even improving the classification performance. SVM had the best classification performance in this study, followed by Decision Trees. SVM also had more consistent performance as evidenced by its smaller standard deviation of F1-measures for most classes. The use of context had minor improvement for each classifier. To access the effect of using context in more details, we will need a larger training sentences. This leads to the future directions of this project.

Go back to top

Copyright © Anthony, 2006