Comparison of
Overlap Detection Techniques
1 School of Computer Science and Software
Engineering, Monash University, 900 Dandenong Rd, Caulfield East, VIC 3145,
Australia
{Krisztian.Monostori,Arkady.Zaslavsky}@csse.monash.edu.au
2 Computer Science,
raphael@cs.uky.edu
3 Department of
Automation and Applied Informatics,
1111 Budapest,
Goldmann György tér 3. IV.em.433., Hungary,
s8043hod@hszk.bme.hu,
pm205@hszk.bme.hu
Abstract. Easy access to the World Wide Web has raised
concerns about copyright issues and plagiarism. It is easy to copy someone
else’s work and submit it as someone’s own. This problem has been targeted by
many systems, which use very similar approaches. These approaches are compared
in this paper and suggestions are made when different strategies are more
applicable than others. Some alternative approaches are proposed that perform
better than previously presented methods. These previous methods share two
common stages: chunking of documents and selection of representative chunks. We
study both stages and also propose alternatives that are better in terms of accuracy
and space requirement. The applications of these methods are not limited to
plagiarism detection but may target other copy-detection problems. We also
propose a third stage to be applied in the comparison that uses suffix trees
and suffix vectors to identify the overlapping chunks.