Parallel and Distributed Document Overlap Detection on the Web

Krisztián Monostori, Arkady Zaslavsky, Heinz Schmidt

 School of Computer Science and Software Engineering

Monash University, Melbourne, Australia

 

Abstract.
Proliferation of digital libraries plus availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. Documents are easily copied and redistributed or used to create plagiarised assignments and conference papers. This paper presents a new, two-stage approach for identifying overlapping documents. The first stage is identifying a set of candidate documents that are compared in the second stage using a matching-engine. The algorithm of the matching-engine is based on suffix trees and it modifies the known matching statistics algorithm. Parallel and distributed approaches are discussed at both stages and performance results are presented.

Keywords: copy-detection, string matching, job-distribution, cluster

 


Disclaimer