RMIT University
Browse

Balance-Aware Distributed String Similarity-Based Query Processing System

conference contribution
posted on 2024-10-31, 22:09 authored by Ji Sun, Zeyuan Shang, Guoliang Li, Dong Deng, Zhifeng Bao
Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford such expensive requirement. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports four core similarity operations, i.e., similarity selection, similarity join, top-k selection and top-k join. Dima extends SQL for users to easily invoke these similarity-based operations in their data analysis tasks. To avoid expensive data transmission in a distributed environment, we propose balance-aware signatures where two records are similar if they share common signatures, and we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support similarity operations. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support complex similarity-based query processing on large-scale datasets. We have conducted extensive experiments on four real-world datasets. Experimental results show that Dima outperforms state-of-the-art studies by 1--3 orders of magnitude and has good scalability.

Funding

Continuous and summarised search over evolving heterogeneous data

Australian Research Council

Find out more...

Continuous intent tracking for virtual assistance using big contextual data

Australian Research Council

Find out more...

History

Related Materials

  1. 1.
    DOI - Is published in 10.14778/3329772.3329774
  2. 2.
    ISSN - Is published in 21508097

Start page

961

End page

974

Total pages

14

Outlet

Proceedings of the VLDB Endowment

Name of conference

45th International Conference on Very Large Data Bases 2019

Publisher

VLDB Endowment

Place published

New York, USA

Start date

2019-08-26

End date

2019-08-30

Language

English

Copyright

This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s).

Former Identifier

2006094733

Esploro creation date

2020-06-22

Fedora creation date

2019-12-02

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC