posted on 2025-11-20, 01:05authored byMarwah Alaofi
<p dir="ltr">The `ideal' test collections called for diversity in both documents and queries. Although the creation of test collections -- such as those developed in TREC -- was largely inspired by this vision, it has not been fully realised. Constructing test collections that capture this variability and therefore enable a user-centred evaluation is an expensive and labour-intensive process.</p><p dir="ltr">This thesis investigates using LLMs to assist in the construction of test collections to reflect query variation by: (1) generating query variants to create diversified document pools to enable system evaluation across user queries; (2) producing relevance judgements for the generated large document pools; and (3) simulating queries from different user profiles and demonstrating their impact on system evaluation.</p><p dir="ltr">Results show that LLMs can be used to generate query variants to retrieve a set of documents similar to those retrieved using human-generated variants. They can also be used to create variants to represent different user profiles. The utility of these variants is demonstrated to challenge the traditional view of test collections as mere system ranking tools, instead used to understand how different users experience search. Our evaluation of LLMs for relevance labelling suggests that they agree with human judgements at levels comparable to human-to-human agreement and produce similar system rankings. However, they are more positive than humans, likely to be fooled by the presence of query words, and can struggle to effectively distinguish across systems, mainly in recognising meaningful performance improvements.</p><p dir="ltr">This thesis contributes to the realisation of the ideal test collections that reflect the diversity in the queries of search users. Such collections provide a means to investigate how systems perform for different users -- and thereby have the potential to enhance the experience of, or mitigate biases against, particular user groups.</p>