Balance-Aware Distributed String Similarity-Based Query Processing System
conference contribution
posted on 2024-10-31, 22:09authored byJi Sun, Zeyuan Shang, Guoliang Li, Dong Deng, Zhifeng Bao
Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford such expensive requirement. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports four core similarity operations, i.e., similarity selection, similarity join, top-k selection and top-k join. Dima extends SQL for users to easily invoke these similarity-based operations in their data analysis tasks. To avoid expensive data transmission in a distributed environment, we propose balance-aware signatures where two records are similar if they share common signatures, and we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support similarity operations. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support complex similarity-based query processing on large-scale datasets. We have conducted extensive experiments on four real-world datasets. Experimental results show that Dima outperforms state-of-the-art studies by 1--3 orders of magnitude and has good scalability.
Funding
Continuous and summarised search over evolving heterogeneous data
45th International Conference on Very Large Data Bases 2019
Publisher
VLDB Endowment
Place published
New York, USA
Start date
2019-08-26
End date
2019-08-30
Language
English
Copyright
This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s).