MatchDetectReveal: Finding Overlapping and Similar Digital Documents

 

Krisztián Monostori, Arkady Zaslavsky, Heinz Schmidt

School of Computer Science & Software Engineering

Monash University, 900 Dandenong Road, Caulfield East, VIC 3145, Australia

{krisztian.monostori, arkady.zaslavsky, heinz.schmidt}@infotech.monash.edu.au

 

ABSTRACT

 

The Internet provides easy access to large collections of semi-structured digital documents.  WWW browsers, search engines and the "cut & paste" technique are tempting to substitute one's creativity by simple compilation from appropriate digital resources. This paper discusses the problems of detecting plagiarism in large collections of semi-structured electronic texts. Overlaps in and similarity of digital documents and software code are in the focus of this project. The conceptual architecture of the MatchDetectReveal system is presented along with possible applications. The main component of the system is using the string matching algorithms and a suffix tree representation. Both sequential and parallel cluster-based processing issues are addressed. The implementation and performance issues are also discussed.

 

KEYWORDS: Fast search algorithms, document overlap, information retrieval, plagiarism detection

 


Disclaimer