This paper describes the first step in a project for topic identification in help-desk applications. In this step, we apply a clustering mechanism to identify the topics of newsgroup discussions. We have used newsgroup discussions as our testbed, as they provide a good approximation to our target application, while obviating the need for manual tagging of topics.
We have found that the postings of individuals who contribute repeatedly to a newsgroup may lead the clustering process astray, in the sense that discussions may be grouped according to their author, rather than according to their topic. To address this problem, we introduce a filtering mechanism, and evaluate it by comparing clustering performance with and without filtering.
The paper is available as gzipped postscript (20 kB) and pdf (32 kB).
Also available on the Springer-Verlag website here.
Alternatively, you can request a copy by e-mailing me.