RMIT University
Browse

On the cost of extracting proximity features for term-dependency models

conference contribution
posted on 2024-10-31, 18:43 authored by Xiao Lu Lu, Alistair Moffat, Shane CulpepperShane Culpepper
Sophisticated ranking mechanisms make use of term dependency features in order to compute similarity scores for documents. These features often include exact phrase occurrences, and term proximity estimates. Both cases build on the intuition that if multiple query terms appear near each other, the document is more likely to be relevant to the query. In this paper we examine the processes used to compute these statistics. Two distinct input structures can be used -- inverted files and direct files. Inverted files must store the position offsets of the terms, while "direct" files represent each document as a sequence of preprocessed term identifiers. Based on these two input modalities, a number of algorithms can be used to compute proximity statistics. Until now, these algorithms have been described in terms of a single set of query terms. But similarity computations such as the Full Dependency Model compute proximity statistics for a collection of related term sets. We present a new approach in which such collections are processed holistically in time that is much less than would be the case if each subquery were to be evaluated independently. The benefits of the new method are demonstrated by a comprehensive experimental study.

History

Start page

293

End page

302

Total pages

10

Outlet

Proceedings of the 24th ACM International Conference on Conference on Information and Knowledge Management (CIKM 2015)

Name of conference

CIKM 2015

Publisher

Association for Computing Machinery

Place published

New York, United States

Start date

2015-10-19

End date

2015-10-23

Language

English

Copyright

© ACM 2015

Former Identifier

2006055838

Esploro creation date

2020-06-22

Fedora creation date

2015-11-11

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC