Parallel Overlap and Similarity Detection in
Semi-Structured Document Collections
{krisztian.monostori, arkady.zaslavsky, heinz.schmidt}
@infotech.monash.edu.au
Abstract. Proliferation of digital
libraries plus high availability of electronic documents from the Internet have
created new challenges for computer science researchers and professionals. This
paper discusses the problems of using parallel and cluster computing systems
for detecting plagiarism in large collections of semi-structured electronic
texts, including software written in formal languages at one end of the
spectrum and natural language texts at the other end. The main component of the
system is using string matching algorithms and suffix trees. Implementation and
performance issues are also discussed.