MatchDetectReveal: Finding Overlapping and Similar Digital Documents
ABSTRACT
The Internet
provides easy access to large collections of semi-structured digital
documents. WWW browsers, search engines
and the "cut & paste" technique are tempting to substitute one's
creativity by simple compilation from appropriate digital resources. This paper
discusses the problems of detecting plagiarism in large collections of semi-structured
electronic texts. Overlaps in and similarity of digital documents and software
code are in the focus of this project. The conceptual architecture of the
MatchDetectReveal system is presented along with possible applications. The
main component of the system is using the string matching algorithms and a
suffix tree representation. Both sequential and parallel cluster-based
processing issues are addressed. The implementation and performance issues are
also discussed.
KEYWORDS: Fast search algorithms, document
overlap, information retrieval, plagiarism detection