RMIT University
Browse

Beyond AI4DB: A Practical Exploration of Learned and Classical Techniques for Query Optimization in Database Systems

Download (10.53 MB)
thesis
posted on 2025-10-23, 05:01 authored by Hai Lan
Modern database systems face growing performance challenges due to increasingly complex data and the limitations of hand-crafted heuristics. The emerging field of AI for Databases (AI4DB) investigates how machine learning techniques can be applied to core DBMS components, such as indexing, cardinality estimation, and query optimization, to improve adaptability, accuracy, and efficiency. While these approaches show significant promise, their practical deployment often faces challenges such as integration overhead and limited generalization. To address these limitations, this thesis explores a practical spectrum of techniques, encompassing both learned and classical methods, that are designed to operate effectively in realistic database settings and deliver robust performance. First, we address the foundational task of data access by revisiting learned indexes in the disk-resident setting. Through systematic evaluation, we show that state-of-the-art in-memory learned indexes fail to outperform traditional B+-trees on disk due to I/O inefficiency, costly structural modifications, and suboptimal storage layouts. To overcome these limitations, we propose AULID, a fully on-disk updatable learned index that combines learned models in inner nodes with B+-tree-style leaf nodes. This hybrid design reduces tree height and lookup I/O while supporting efficient updates and scans, achieving superior performance and storage efficiency across a wide range of workloads. Second, we develop practical cardinality estimators tailored to complex data types where traditional methods often fall short. For high-dimensional data, we introduce a reference-based framework that estimates query cardinality using a small set of representative reference objects. Within this framework, we propose two complementary methods: one non-learning-based and one learning-based. For string data, we propose a classifier-based estimator that reformulates the cardinality estimation task for LIKE queries as a classification problem. The method integrates a novel stacked filter-based architecture with formal error guarantees, delivering accurate and efficient estimates while significantly reducing construction overhead compared to existing approaches. Third, we enhance the query optimizer’s plan selection process without modifying the underlying optimizer. We propose a practical two-stage framework. In the first stage, Plan Candidate Generation, we retrieve high-quality execution plans from a precomputed pool using similarity-based search. In the second stage, Plan Ranking, we apply a list-wise neural ranking model to select the most efficient plan from the candidate set. This approach leverages plan reuse and contextualized ranking to improve end-to-end query performance while preserving the compatibility and stability of existing optimization frameworks. These contributions collectively demonstrate that practical performance improvements can be achieved by thoughtfully applying both learned and classical techniques. By targeting realistic deployment scenarios and emphasizing modularity, robustness, and efficiency, this thesis offers actionable solutions that advance the practical frontiers of AI4DB.<p></p>

History

Degree Type

Doctorate by Research

Imprint Date

2025-07-31

School name

Computing Technologies, RMIT University

Copyright

© Hai Lan 2025

Usage metrics

    Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC