This page contains information, resources and publications related to our project MatchDetectReveal. This project was made possible through the efforts of Dr Krisztián Monostori (our former PhD student), Prof Heinz Schmidt, A/Prof Arkady Zaslavsky, our colleagues and visiting scholars including Prof Raphael Finkel, Gábor Hodász, Máté Pataki, Alejandro Bia, Prof István Vajk.

MDRLite is the "light" version of the MatchDetectReveal system. It allows you to upload text files to our server and it checks those files against each other for overlap. To try it click here

To try a heavier version click here

This project can be briefly described by the abstract and conceptual architecture figure taken from Krisztián's thesis.

With the rapid growth of the Internet, large collections of digital documents have become available. These documents may be used for various purposes including education, research, entertainment, and many others. Given this diversity of objectives in using these documents, we need tools that are capable of identifying overlap and similarity within a potentially very large set of documents. Applications of such tools include plagiarism detection, search engines, comparative analysis of literary works, and clustering collections of documents.

This project studies different approaches to identifying overlap between electronic documents. Existing approaches are compared based on different attributes including accuracy, performance, and the degree of protection. A novel two-stage approach is proposed in this project. It selects a set of candidate documents in the first phase by applying chunking and indexing methods. The second phase uses exact comparison methods based on suffix trees.

Suffix trees have been identified as an efficient data structure for exact string-matching problems. One of the main arguments against the widespread use of suffix trees is their extensive space-consumption requirements. We propose a new data structure, which has the same versatility as a suffix tree but requires less space than any other representation known to date. This new structure is called a suffix vector because of the way it is organised in memory. We show how this structure can be constructed in linear time and we also prove that this data structure requires the least space among those representations that have the same versatility. Not only does the new representation save space but it is also more time-efficient because it eliminates certain redundancies of a suffix tree. Therefore, some phases of algorithms running on the structure can be eliminated, too.

This project also analyses how existing data structures can be used for document comparison. Sparse suffix trees and directed acyclic graphs generated from suffix trees are discussed in the context of document comparison applications. Some algorithms have been modified to suit the above mentioned data structures.

We have built a prototype system, called MatchDetectReveal (MDR) that implements the algorithms we proposed. The MDR system is capable of efficiently identifying overlapping documents in a large set and uses suffix trees and suffix vectors in its core component - the matching engine.

To speed up the comparison of documents we have developed a parallel algorithm to process documents. We have used a general-purpose cluster of commodity workstations with general-purpose middleware to test our algorithms as well as a special-purpose cluster with purpose-built software based on the message-passing model. Results of these tests are presented in this project and they demonstrate both time- and space-efficiency of the proposed algorithms.

The results of this project have been presented at 13+ international and national conferences in Australia, New Zealand, Europe, and North America.

Here comes the conceptual architecture of MDR.

Users of the MDR system must be able to submit suspicious documents. MDR users will use an Internet-based interface for submitting documents. Submitted documents can be compared to the documents of the local repository or to documents on the Internet or other digital libraries e.g. ACM DL or IEEE DL. Once these documents are identified we can download them and compare to the suspicious document locally. Different indexing technics will be analysed to efficiently find candidate documents. We will also consider 'intelligent' ways of identifying candidate documents by e.g. checking the publications of authors listed in the references.

The matching engine is responsible for comparing the suspicious document to candidate documents. Here we note that the current implementation of the matching engine deals with exact matching and it will be supplemented with a 'Similarity and Overlap Rule Interpreter', which will define how to handle rephrasing, changing the names of localities, substituting synonyms for certain words.

With the constant increase of the processing power of commodity workstations clusters of workstations are widely used for parallel processing. Instead of sequentially analysing suspicious documents we can schedule these tasks on a cluster, which enables us to analyse more documents in parallel resulting in lower response time.

Not only can we use special workstation clusters for processing documents in parallel but we can also utilise idle workstations. When we define the set of candidate documents we can also identify idle workstations in the close proximity of each document. Those workstations are likely to require less time to download documents they are responsible for and response time can again be reduced while workstation utilisation increases.

The visualizer component of the system is responsible for presenting the end user with the parts of the document, which are plagiarised. Currently overlap is defined as positions in the text and the length of the chunks.

Document Generator is a supplementary component of the system, which generates sufficient number of documents to test our algorithm. You can define the size and overall plagiarism content of the document to be generated as well as the number of files to be used for plagiarism and the size of the chunks. It is also possible to define the number of words to be substituted by their synonyms to test more sophisticated ways of plagiarism.

Our Papers and Presentations

Monostori K., Zaslavsky A., Schmidt H. Suffix Vector: Space- and Time-Efficient Alternative To Suffix Trees. Proceedings of the 25th Australasian Computer Science Conference, Monash University, Melbourne, Victoria, 28 January - 1 February, 2002. pp 157-166, 2002.
Finkel R. A., Zaslavsky A., Monostori K., Schmidt H. Signature extraction for overlap detection in documents. Proceedings of the 25th Australasian Computer Science Conference, Monash University, Melbourne, Victoria, 28 January - 1 February, 2002. pp 59-64, 2002.
Monostori K., Finkel R., Zaslavsky A., Hodasz G., Pataki M. Comparison of Overlap Detection Techniques. The 2002 International Conference on Computational Science, Amsterdam, The Netherlands, 21 - 24 April, 2002. (I) pp 51-60, 2002.
Monostori K., Zaslavsky A., Vajk I. Suffix Vector: A Space-Efficient Suffix Tree Representation. Proceedings of the International Symposium on Algorithms and Computation, Christchurch, New Zealand, Dec 19-21, 2001, pp 707-718, 2001.
Zaslavsky A., Bia A., Monostori K. Using copy-detection and text comparison algorithms for cross-referencing multiple editions of literary works. Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries, September 4-9 2001, Darmstadt, Germany, pp 103-114, 2001.
Monostori K., Zaslavsky A., Bia A. Using the MatchDetectReveal System for Comparative Analysis of Texts. Proceedings of the Sixth Australasian Document Computing Symposium (ADCS 2001), Pacific Bay Resort, Coffs Harbour, 7 December, 2001, pp 51-58, 2001.
Monostori K., Zaslavsky A., Schmidt H. Efficiency of Data Structures for Detecting Overlaps in Digital Documents. Proceedings of the 24th Australasian Computer Science Conference, Bond University, Gold Coast, Queensland, 29 January - 2 February, 2001. pp 140-147, 2000.
Monostori K., Zaslavsky A., Schmidt H. Digital Documents in Educational Environment: Misuse, Appropriation, and Detection Issues. Fourth Australasian Computing Education Conference, 4 - 6 December 2000, Monash University, Melbourne. pp 254-255, 2000.
Monostori K., Zaslavsky A., Schmidt H. Parallel and Distributed Document Overlap Detection on the Web. Workshop on Applied Parallel Computing - PARA2000, 18-21 June 2000, Bergen, Norway. pp 206-214, 2000.
Monostori K., Zaslavsky A., Schmidt H. MatchDetectReveal: Finding Overlapping and Similar Digital Documents. Information Resources Management Association International Conference (IRMA2000), 21-24 May, 2000 at Anchorage Hilton Hotel, Anchorage, Alaska, USA. pp 955-957, 2000.
Monostori K., Zaslavsky A., Schmidt H., Document Overlap Detection System for Distributed Digital Libraries. ACM Digital Libraries 2000 (DL00), 2-7 June, 2000 in San Antonio, Texas, USA. pp 226-227, 2000.
Monostori K., Zaslavsky A., Schmidt H., Identifying Overlapping Documents in Semi-Structured Text Collections. Australasian Computer Science Conference, Canberra, Australia, 2000.
PDF 
Monostori K., Zaslavsky A., Schmidt H., MatchDetectReveal: Finding Overlapping and Similar Digital Documents. The 10th Australasian Conference on Information Systems, Wellington, New Zealand, 1999. pp 1223, 1999.
Monostori K., Zaslavsky A., Schmidt H. Parallel Overlap and Similarity Detection in Semi-Structured Document Collections. Proceedings of 6th Annual Australasian Conference on Parallel and Real-Time Systems (PART '99), Melbourne, Australia, 1999. pp 92-103, 1999.

 

Some search engines for papers and publications:

  • Paper Search in Complexity
    Site for search in many search engines (DBLP Advanced Search, Research Index, CORA, ECCC, MOPS, CS bibiliography, MathSciNet)
  • CORA - Computer Science Research Paper Search Engine
    A special-purpose search engine covering computer science research papers. It allows keyword searches over the partial text of Postscript-formatted papers it has found by spidering the Web.
  • Springer Link-Search
    Search in abstracts and bibliographic data. Note that only content available online in LINK will be found. If you are looking for other Springer products, please refer to the Springer Web site www.springer.de
  • NEC Research Index
    A scientific literature digital library that aims to improve the dissemination and feedback of scientific literature, and to provide improvements in functionality, usability, availability, cost, comprehensiveness, efficiency, and timeliness.
  • Northern Light Power Search
    Advanced search engine for any kind of papers, articles, first of all newspaper and journal articles.

Some Related Papers

Related Web Sites about Plagiarism

Maintained by Arkady Zaslavsky. Last updated: 09-Apr-2003 19:13