posted on 2024-11-24, 03:12authored byAmeer Tawfik Abdullah Al Bahem
Many real-world searches are examples of complex information needs, such as exploratory, comparative, or survey-oriented searches. In these search scenarios, users engage interactively with dynamic search systems to tackle their information needs. In addition, users provide various forms of feedback about the relevancy of search results. Examples of such feedback are explicit user ratings of relevancy, text highlighting, clicks, and dwell times. User satisfaction in such search tasks depends on several search quality properties such as topical relevance, coverage of subtopics (diversity), and minimum user effort.
In the pursuit of developing more effective search systems, it is essential to assess the suitability of evaluation metrics for a given interactive search scenario and estimate the contribution of the search process components to the search effectiveness. In interactive (dynamic) search, the soundness of metrics, and their ability to capture desirable properties of search quality have received less attention. Furthermore, many interactive search systems utilized different approaches to exploit various types of structures of user feedback. However, the contributions of different components such as user feedback and the dynamic search strategies to overall effectiveness are still not well understood. In this thesis, we focus on addressing these research gaps.
In evaluating dynamic search systems, researchers have utilized metrics that model some or all properties of search quality. However, some metrics might provide contradictory and counter-intuitive results. For instance, we evaluated runs submitted to the TREC 2016 Dynamic Domain track using various metrics and found that the Cube Test (CT) and alpha-nDCG metrics provided opposite results; CT preferred a baseline method over a manual run that has a perfect precision, whereas alpha-nDCG preferred the manual run. As a result, in this thesis, we developed a case-analysis framework to define and study fundamental properties that seem integral to any evaluation metric. An example of a simple property is that a ranking with only one non-relevant document should never score lower than a ranking with two non-relevant documents. The framework facilitates quantifying the ability of metrics to satisfy properties, both separately and simultaneously, and to identify those cases where metrics violated the properties. Our analysis shows that the Average Cube Test and Intent-Aware Average Precision metrics produce counter-intuitive results related to some properties.
To further understand how dynamic search can be evaluated, for the second contribution of this thesis, we studied the ability of metrics to capture three properties that affect user satisfaction: topical relevance, diversity, and effort. We analyzed the metrics that model these properties, such as the Cube Test (CT), Expected Utility (EU), session Discounted Cumulative Gain (sDCG) alongside diversity metrics such as alpha-nDCG and their normalized versions. In studying these metrics, we adapted two meta-analysis frameworks: the Intuitiveness Test and Metric Unanimity. We also studied how well these two frameworks agree with each other. Our analysis indicated that the normalized Cube Test (nCT) captures the three dimensions better than other metrics, whereas alpha-nDCG captures topical relevance and diversity better than the rest of the metrics.
In seeking better dynamic search systems, some researchers have explored utilizing passage-based feedback instead of document-based feedback or utilizing an initial retrieval model with various quality levels. Recently, many studies compared using reinforcement learning, such as multi-armed bandit algorithms, that utilize subtopic-based user feedback to the Rocchio relevance feedback algorithm. However, as this line of research investigated some components of interactive search in isolated experimental setups, understanding the effects of the search process factors such as topics, the granularity of user feedback (passages vs. documents), and the employed dynamic approach (e.g., Rocchio vs. reinforcement learning) on search effectiveness remains an open question. To answer this research question, we utilized a methodology based on ANalysis Of VAriance (ANOVA) to estimate the effects of dynamic search components. Using TREC Dynamic Domain data and the metrics recommended by the first and second contributions of this thesis, we built several statistical models that allow us to estimate these effects. We first built an ANOVA model that focused on the effects of the topics and search system factors. We found both factors to have significant effects on performance. We then decomposed the system factor into three system-related factors: initial ranker, dynamic reranker, and user feedback. The initial ranker generates an initial ranked list of documents to be dynamically reranked using the dynamic reranker exploiting the user feedback. In performing the analysis, we rely on a grid of points that consists of all possible interactive search systems based on various combinations of instantiations of these factors. Our analysis shows that the initial ranker and dynamic reranker factors have more prominent effects than the user feedback granularity. This indicates that, before seeking more granular feedback from users, we could first improve the effectiveness of dynamic search systems by experimenting with different initial rankers and dynamic rerankers.
In summary, in this thesis, we focused on tackling complex information needs via interactive search. The work conducted in this thesis allows us to select the most appropriate metrics to evaluate dynamic search systems and focus on those components that have prominent effects on search effectiveness. In choosing suitable metrics, we developed an axiomatic framework that allows researchers to assess metrics' ability to intuitively compare rankings. We performed extensive analysis using two meta-analysis frameworks on the strength of the metrics to model desirable properties such as topical relevance, diversity, and user effort. Based on these different types of analysis, we recommend that the normalized Cube Test (nCT) and alpha-nDCG should be used to evaluate dynamic search systems as they provide complementary information about the quality of search systems. Using these metrics, we performed a component-based analysis of the effects of various factors that affect performance in dynamic search. As a result, we recommend that researchers and practitioners should improve the quality of the initial ranker used to generate the initial ranked list of documents and the dynamic approach they have adapted to utilize the user feedback prior to undertaking a process to elicit more details from users, an option that might be expensive or invasive.