RMIT University
Browse

Inferential risk measures in information retrieval

Download (11.47 MB)
thesis
posted on 2024-11-24, 04:54 authored by Rodger Benham
When the effectiveness of a search ranker is evaluated, an effectiveness metric is computed by averaging how well the ranker retrieved relevant documents over a set of topics on a test dataset.  If a search practitioner wishes to test a hypothesis on whether one ranking algorithm is more effective than another, traditionally a null hypothesis statistical test is used.  Typically rankers are tested in pairs at a time, but often many effective alternative rankers could be used to provide background context on the likelihood of a ranker outperforming another.  However, current approaches for inferential testing on the outcomes of many rankers tend to reduce statistical power, so much so that prior work has questioned whether accounting for this family-wise error yields ineffectual inferential analyses with the available IR test collections. Regardless, the demand for multiple comparison correction continues to grow in scientific venues to avoid drawing conclusions under false pretenses. Another recent feature in IR evaluation is the improved awareness that retrieval effectiveness over topics varies substantially for different rankers, where a challenger system that performs better on average may be risky to replace a champion system with due to outliers.  As seminal works in economic theory show that people are more sensitive to losses than gains, they may perceive that another system selected for improved mean effectiveness is less effective overall if it previously returned effective rankings and now does not.  Risk overlays aim to support replacing rankers with the joint goal of net improvement without unacceptable drops in effectiveness.  Current inferential risk overlays evaluate pairs of systems, which this thesis extends towards a multiple-system testing approach. The thesis begins by investigating the shape of risk-adjusted score distributions, with the insight that their skewness may violate the parametric assumptions of common inferential risk testing approaches.  Secondly, how amenable Bayesian inference is to the problem of multiple comparison correction is explored for the first time in an IR context, factoring in the above need to handle skewness appropriately in the case of risk analysis.  A novel Bayesian hierarchical modeling approach is applied to IR scores, combining the ability to use many systems as background information and the skewness properties of risk-adjusted scores.  Finally, in finding that directly modeling risk-adjusted scores resulted in low statistical power, the thesis explores modeling IR scores directly and evaluating risk as an inferential summary statistic on the posterior predictive distribution.  The results indicate that this method improves the discriminative capacity of risk inference over many systems while retaining the corrective properties of Bayesian hierarchical modeling.

History

Degree Type

Doctorate by Research

Imprint Date

2023-01-01

School name

School of Computing Technologies, RMIT University

Former Identifier

9922258311901341

Open access

  • Yes