Parallel and Distributed Document Overlap
Detection on the Web
{krisztian.monostori,
arkady.zaslavsky, heinz.schmidt}
@infotech.monash.edu.au
Abstract.
Proliferation of digital libraries plus availability of electronic documents
from the Internet have created new challenges for computer science researchers
and professionals. Documents are easily copied and redistributed or used to
create plagiarised assignments and conference papers. This paper presents a new,
two-stage approach for identifying overlapping documents. The first stage is
identifying a set of candidate documents that are compared in the second stage
using a matching-engine. The algorithm of the matching-engine is based on
suffix trees and it modifies the known matching statistics algorithm. Parallel
and distributed approaches are discussed at both stages and performance results
are presented.
Keywords: copy-detection, string
matching, job-distribution, cluster