posted on 2024-10-31, 18:44authored byRuey-Cheng Chen, Chia-Jung Lee, W Bruce Croft
We study the problem of static index pruning in a renowned divergence minimization framework, using a range of divergence measures such as f-divergence and R´enyi divergence as the objective. We show that many well-known divergence measures are convex in pruning decisions, and therefore can be exactly minimized using an efficient algorithm. Our approach allows postings be prioritized according to the amount of information they contribute to the index, and through specifying a different divergence measure the contribution is modeled on a different returns curve. In our experiment on GOV2 data, R´enyi divergence of order infinity appears the most effective. This divergence measure significantly outperforms many standard methods and achieves identical retrieval effectiveness as full data using only 50% of the postings. When top-k precision is of the only concern, 10% of the data is sufficient to achieve the accuracy that one would usually expect from a full index.