Modern database systems face growing performance challenges due to increasingly complex
data and the limitations of hand-crafted heuristics. The emerging field of AI for
Databases (AI4DB) investigates how machine learning techniques can be applied to core
DBMS components, such as indexing, cardinality estimation, and query optimization, to
improve adaptability, accuracy, and efficiency. While these approaches show significant
promise, their practical deployment often faces challenges such as integration overhead
and limited generalization. To address these limitations, this thesis explores a practical
spectrum of techniques, encompassing both learned and classical methods, that are designed
to operate effectively in realistic database settings and deliver robust performance.
First, we address the foundational task of data access by revisiting learned indexes
in the disk-resident setting. Through systematic evaluation, we show that state-of-the-art
in-memory learned indexes fail to outperform traditional B+-trees on disk due to I/O inefficiency,
costly structural modifications, and suboptimal storage layouts. To overcome
these limitations, we propose AULID, a fully on-disk updatable learned index that combines
learned models in inner nodes with B+-tree-style leaf nodes. This hybrid design
reduces tree height and lookup I/O while supporting efficient updates and scans, achieving
superior performance and storage efficiency across a wide range of workloads.
Second, we develop practical cardinality estimators tailored to complex data types
where traditional methods often fall short. For high-dimensional data, we introduce a
reference-based framework that estimates query cardinality using a small set of representative
reference objects. Within this framework, we propose two complementary methods:
one non-learning-based and one learning-based. For string data, we propose a classifier-based
estimator that reformulates the cardinality estimation task for LIKE queries as a classification problem. The method integrates a novel stacked filter-based architecture
with formal error guarantees, delivering accurate and efficient estimates while significantly
reducing construction overhead compared to existing approaches.
Third, we enhance the query optimizer’s plan selection process without modifying the
underlying optimizer. We propose a practical two-stage framework. In the first stage,
Plan Candidate Generation, we retrieve high-quality execution plans from a precomputed
pool using similarity-based search. In the second stage, Plan Ranking, we apply a list-wise
neural ranking model to select the most efficient plan from the candidate set. This approach
leverages plan reuse and contextualized ranking to improve end-to-end query performance
while preserving the compatibility and stability of existing optimization frameworks.
These contributions collectively demonstrate that practical performance improvements
can be achieved by thoughtfully applying both learned and classical techniques. By targeting
realistic deployment scenarios and emphasizing modularity, robustness, and efficiency,
this thesis offers actionable solutions that advance the practical frontiers of AI4DB.<p></p>