RMIT University
Browse

Compact features for detection of near duplicates in distributed retrieval

conference contribution
posted on 2024-10-30, 16:51 authored by Yaniv Bernstein, Milad Shokouhi, Justin Zobel
In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system.

History

Outlet

Proceedings of the 13th international conference on string processing and information retrieval, SPIRE 2006

Editors

F. Crestani, P. Ferragina & M. Sanderson

Name of conference

International Conference on String Processing and Information Retrieval

Publisher

Springer

Place published

Germany

Start date

2006-09-29

End date

2006-09-29

Language

English

Copyright

© Springer-Verlag Berlin Heidelberg 2006

Former Identifier

2006001981

Esploro creation date

2020-06-22

Fedora creation date

2009-04-08

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC