RMIT University
Browse

Collection selection for managed distributed document databases

journal contribution
posted on 2024-10-31, 23:40 authored by Daryl D'Souza, James Thom, Justin Zobel
In a distributed document database system, a query is processed by passing it to a set of individual collections and collating the responses. For a system with many such collections, it is attractive to first identify a small subset of collections as likely to hold documents of interest before interrogating only this small subset in more detail. A method for choosing collections that has been widely investigated is the use of a selection index, which captures broad information about each collection and its documents. In this paper, we re-evaluate several techniques for collection selection. We have constructed new sets of test data that reflect one way in which distributed collections would be used in practice, in contrast to the more artificial division into collections reported in much previous work. Using these managed collections, collection ranking based on document surrogates is more effective than techniques such as CORI that are based on collection lexicons. Moreover, these experiments demonstrate that conclusions drawn from artificial collections are of questionable validity.

History

Related Materials

  1. 1.
    ISSN - Is published in 03064573

Journal

Information Processing and Management

Volume

40

Issue

3

Start page

527

End page

546

Total pages

20

Publisher

Pergamon

Place published

UK

Language

English

Copyright

Copyright © 2003 Elsevier Ltd. All rights reserved.

Former Identifier

2004001696

Esploro creation date

2020-06-22

Fedora creation date

2009-02-27

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC