RMIT University
Browse

Accurate Discovery of Co-derivative Documents Via Duplicate Text Detection

journal contribution
posted on 2024-11-01, 02:34 authored by Yaniv Bernstein, Justin Zobel
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEx, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe DECO, a prototype package that combines the SPEX algorithm with other optimisations and compressed indexing to produce a flexible and scalable co-derivative discovery system. Our experiments with multi-gigabyte document collections demonstrate the effectiveness of the approach.

History

Journal

Information Systems

Volume

31

Start page

595

End page

609

Total pages

15

Publisher

Pergamon

Place published

Oxford

Language

English

Copyright

Copyright © 2005 Elsevier B.V. All rights reserved

Former Identifier

2006000082

Esploro creation date

2020-06-22

Fedora creation date

2009-02-27

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC