RMIT University
Browse

A scalable system for identifying co-derivative documents

conference contribution
posted on 2024-10-30, 14:27 authored by Yaniv Bernstein, Justin Zobel
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEX, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying coderivative clusters, and describe DECO, a prototype system that makes use of SPEX. Our experiments with several document collections demonstrate the effectiveness of the approach.

History

Related Materials

  1. 1.
    ISBN - Is published in 9783540232100 (urn:isbn:9783540232100)

Start page

55

End page

57

Total pages

3

Outlet

String Processing and Information Retrieval: 11th International Conference, SPIRE 2004

Editors

A. Apostolico and M. Melucci

Name of conference

International Conference on String Processing and Information Retrieval

Publisher

Springer

Place published

Berlin, Germany

Start date

2004-12-07

End date

2004-12-07

Language

English

Copyright

© Springer-Verlag Berlin Heidelberg 2004

Former Identifier

2004000512

Esploro creation date

2020-06-22

Fedora creation date

2009-04-08

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC