This page contains information, resources and publications related to our project MatchDetectReveal. This project was made possible through the efforts of Dr Krisztián Monostori (our former PhD student), Prof Heinz Schmidt, A/Prof Arkady Zaslavsky, our colleagues and visiting scholars including Prof Raphael Finkel, Gábor Hodász, Máté Pataki, Alejandro Bia, Prof István Vajk.
MDRLite is the "light" version of the MatchDetectReveal system. It allows you to upload text files to our server and it checks those files against each other for overlap. To try it click here
To try a heavier version click here
This project can be briefly described by the abstract and conceptual architecture figure taken from Krisztián's thesis.
With the rapid growth of the Internet, large collections of digital documents have become available. These documents may be used for various purposes including education, research, entertainment, and many others. Given this diversity of objectives in using these documents, we need tools that are capable of identifying overlap and similarity within a potentially very large set of documents. Applications of such tools include plagiarism detection, search engines, comparative analysis of literary works, and clustering collections of documents.
This project studies different approaches to identifying overlap between electronic documents. Existing approaches are compared based on different attributes including accuracy, performance, and the degree of protection. A novel two-stage approach is proposed in this project. It selects a set of candidate documents in the first phase by applying chunking and indexing methods. The second phase uses exact comparison methods based on suffix trees.
Suffix trees have been identified as an efficient data structure for exact string-matching problems. One of the main arguments against the widespread use of suffix trees is their extensive space-consumption requirements. We propose a new data structure, which has the same versatility as a suffix tree but requires less space than any other representation known to date. This new structure is called a suffix vector because of the way it is organised in memory. We show how this structure can be constructed in linear time and we also prove that this data structure requires the least space among those representations that have the same versatility. Not only does the new representation save space but it is also more time-efficient because it eliminates certain redundancies of a suffix tree. Therefore, some phases of algorithms running on the structure can be eliminated, too.
This project also analyses how existing data structures can be used for document comparison. Sparse suffix trees and directed acyclic graphs generated from suffix trees are discussed in the context of document comparison applications. Some algorithms have been modified to suit the above mentioned data structures.
We have built a prototype system, called MatchDetectReveal (MDR) that implements the algorithms we proposed. The MDR system is capable of efficiently identifying overlapping documents in a large set and uses suffix trees and suffix vectors in its core component - the matching engine.
To speed up the comparison of documents we have developed a parallel algorithm to process documents. We have used a general-purpose cluster of commodity workstations with general-purpose middleware to test our algorithms as well as a special-purpose cluster with purpose-built software based on the message-passing model. Results of these tests are presented in this project and they demonstrate both time- and space-efficiency of the proposed algorithms.
The results of this project have been presented at 13+ international and national conferences in Australia, New Zealand, Europe, and North America.
Here comes the conceptual architecture of MDR.
Users of the MDR system must be able to submit suspicious documents. MDR users will use an Internet-based interface for submitting documents. Submitted documents can be compared to the documents of the local repository or to documents on the Internet or other digital libraries e.g. ACM DL or IEEE DL. Once these documents are identified we can download them and compare to the suspicious document locally. Different indexing technics will be analysed to efficiently find candidate documents. We will also consider 'intelligent' ways of identifying candidate documents by e.g. checking the publications of authors listed in the references.
The matching engine is responsible for comparing the suspicious document to candidate documents. Here we note that the current implementation of the matching engine deals with exact matching and it will be supplemented with a 'Similarity and Overlap Rule Interpreter', which will define how to handle rephrasing, changing the names of localities, substituting synonyms for certain words.
With the constant increase of the processing power of commodity workstations clusters of workstations are widely used for parallel processing. Instead of sequentially analysing suspicious documents we can schedule these tasks on a cluster, which enables us to analyse more documents in parallel resulting in lower response time.
Not only can we use special workstation clusters for processing documents in parallel but we can also utilise idle workstations. When we define the set of candidate documents we can also identify idle workstations in the close proximity of each document. Those workstations are likely to require less time to download documents they are responsible for and response time can again be reduced while workstation utilisation increases.
The visualizer component of the system is responsible for presenting the end user with the parts of the document, which are plagiarised. Currently overlap is defined as positions in the text and the length of the chunks.
Generator is a supplementary component of the system, which generates sufficient
number of documents to test our algorithm. You can define the size and overall
plagiarism content of the document to be generated as well as the number of
files to be used for plagiarism and the size of the chunks. It is also possible
to define the number of words to be substituted by their synonyms to test
more sophisticated ways of plagiarism.
Our Papers and Presentations
Some search engines for papers and publications:
Some Related Papers
Related Web Sites about Plagiarism
|Maintained by Arkady Zaslavsky. Last updated: 09-Apr-2003 19:13|