RMIT University
Browse

Feature Extraction for Large-Scale Text Collections

conference contribution
posted on 2024-11-03, 13:53 authored by Luke Gallagher, Antonio Mallia, Shane CulpepperShane Culpepper, Torsten Suel, Berkant Cambazoglu
Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems. In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.

Funding

New approaches to interactive sessional search for complex tasks

Australian Research Council

Find out more...

History

Related Materials

  1. 1.
    DOI - Is published in 10.1145/3340531.3412773
  2. 2.
    ISBN - Is published in 9781450368599 (urn:isbn:9781450368599)

Start page

3015

End page

3022

Total pages

8

Outlet

Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM 2020)

Name of conference

CIKM 2020

Publisher

Association for Computing Machinery

Place published

New York, United States

Start date

2020-10-19

End date

2020-10-23

Language

English

Copyright

© 2020 Copyright held by the owner/author(s).

Former Identifier

2006107153

Esploro creation date

2021-06-01

Usage metrics

    Scholarly Works

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC