Results
This page discusses the results obtained from the classification experiments.
The word "sentence type" and "class" will be used interchangably.
Bag-of-words VS Bigram
Table 1 compares the performance of each classifier by using different sets of features.
The first column specifies the feature set, the second column gives the number of features for
the corresponding feature set and the last three columns are the F1-measure of Naive Bayes (NB), Decision
Trees (DT) and SVM respectively.
Table 1: Unigram vs. Bigram
| Features |
# of features |
NB |
DT |
SVM |
Unigram |
1622 |
0.666 |
0.829 |
0.883 |
Unigram + Bigram |
8372 |
0.397 |
0.834 |
0.893 |
Bigram |
6750 |
0.354 |
0.688 |
0.733 |
Combining bag-of-words (unigram) and bigram together showed only marginal effect in the
classification performance of DT and SVM, but a significant drop for NB. This is because most
of the bigrams (a pair of words) occur very rarely in the entire training set. As NB is not able
to handle infrequent features, its performance is affected by bigram significantly. DT and
SVM seem to be better at handling infrequent features.
It is clear that using bigram without
unigram produced far inferior performance for all classifiers. One reason to explain it is
that some individual words served as important features in discriminating the classes as they
consistently occur in the same class. However, in bigram, each word is paired with another word.
So, a highly discriminative word could be paired with different words in different sentences.
This could have reduced its class-discriminating power.
The Effect of Feature Selection
The graph below shows the effect of feature selection on the classifiers' performance.
Chi-squared was used as the feature selection method with bag-of-words as the only type of
features. DT and SVM are seen to perform consistently until the number of features are reduced
to below 300. On the other hand, NB had a significant improvement when the features are removed.
As explained previously, NB was not able to handle infrequent words. When the feature selection
was performed, only useful features were selected and those unuseful features were discarded.

The results show that the aim of feature selection was achieved as most features could be
removed while maintaining the classification performance of DT and SVM and even improving that of NB.
Comparison of Classifiers on Class-by-Class basis
The diagram below compares the performance of different classifiers on each class. SVM is
seen to perform as well as the others in some cases but better in most cases. DT outperformed
NB in most of the classes and was outperformed only in four classes. SVM also had the smallest
standard deviation of F-measure (not reported in this page) for most of the classes, while NB
had the largest.

The Effect of Context
The effect of context on the classification performance is shown in table 2. The values in
the table are the F-measures of the classifiers. NB had an observable improvement, while DT improved
only marginally. In our implementation to work with context, we need to obtain the class probability
distribution of classifying a given sentence. NB and DT are two classifiers that provide the class
probability estimates, but SVM does not.
In general, SVM is a binary classifier and does not produce probability distribution.
To obtain its probability distribution,
logistic
regression models were used to map the outputs of SVM and
pairwise-coupling was used to handle multi-class problem. However, the mapping does not preserve the
actual prediction of SVM in some cases, thus hurting its classification performance. When context
was not used and logistic regression models were used to obtain the class probability distribution,
SVM's F1-measure dropped from 0.888 to 0.858. As seen in table 2, the improvement obtained by using
the context with SVM did not cover the loss of its performance due to the need of its probability
distribution.
Table 2: The effect of context
| Classifier |
F-measure |
Without Context |
With Context |
| Naive Bayes |
0.814 |
0.844 |
| Decision Trees |
0.844 |
0.846 |
| SVM |
0.888 |
0.864 |
Conclusion
In conclusion, feature selection was demonstrated to reduce the features significantly while
maintaining or even improving the classification performance. SVM had the best classification
performance in this study, followed by Decision Trees. SVM also had more consistent performance
as evidenced by its smaller standard deviation of F1-measures for most classes.
The use of context had minor improvement for each classifier. To access the effect of using
context in more details, we will need a larger training sentences. This leads to the future directions of this project.
Go back to top