RMIT University
Browse

Sample sizes for query probing in uncooperative distributed information retrieval

conference contribution
posted on 2024-10-30, 16:50 authored by Milad Shokouhi, Falk ScholerFalk Scholer, Justin Zobel
The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative environment, query probing - where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon - has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sampling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good coverage based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.

History

Related Materials

  1. 1.
    ISBN - Is published in 9783540311423 (urn:isbn:9783540311423)

Start page

63

End page

75

Total pages

13

Outlet

Proceedings of the 8th Asia-Pacific web conference (APWeb 2006)

Editors

X. Zhou, J. Li, K. T. Shen, M. Kitsuregawa and Y. Zhang

Name of conference

Asia-Pacific Web Conference

Publisher

Springer

Place published

Berlin, Germany

Start date

2006-01-16

End date

2006-01-18

Language

English

Copyright

© Springer-Verlag Berlin Heidelberg 2006

Former Identifier

2006001977

Esploro creation date

2020-06-22

Fedora creation date

2009-04-08