RMIT University
Browse

Fewer topics? A million topics? Both?! On topics subsets in test collections

journal contribution
posted on 2024-11-02, 11:03 authored by Kevin Roitero, Shane CulpepperShane Culpepper, Mark SandersonMark Sanderson, Falk ScholerFalk Scholer, Stefano Mizzaro
When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.

Funding

Trajectory data processing: Spatial computing meets information retrieval

Australian Research Council

Find out more...

History

Related Materials

  1. 1.
    DOI - Is published in 10.1007/s10791-019-09357-w
  2. 2.
    ISSN - Is published in 13864564

Journal

Information Retrieval Journal

Volume

23

Start page

49

End page

85

Total pages

37

Publisher

Springer Science+Business Media

Place published

Germany

Language

English

Copyright

© Springer Nature B.V. 2019

Former Identifier

2006092066

Esploro creation date

2020-06-22

Fedora creation date

2020-04-21

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC