Parallel Overlap and Similarity Detection in Semi-Structured Document Collections

Krisztián Monostori, Arkady Zaslavsky, Heinz Schmidt

 School of Computer Science and Software Engineering

Monash University, Melbourne, Australia

Abstract. Proliferation of digital libraries plus high availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. This paper discusses the problems of using parallel and cluster computing systems for detecting plagiarism in large collections of semi-structured electronic texts, including software written in formal languages at one end of the spectrum and natural language texts at the other end. The main component of the system is using string matching algorithms and suffix trees. Implementation and performance issues are also discussed.

 


Disclaimer