RMIT University
Browse

Efficient indexing algorithms for approximate pattern matching in text

conference contribution
posted on 2024-10-31, 17:01 authored by Matthias Petri, Shane CulpepperShane Culpepper
Approximate pattern matching is an important computational problem with a wide variety of applications in Information Retrieval. Efficient solutions to approximate pattern matching can be applied to natural language keyword queries with spelling mistakes, OCR scanned text incorporated into indexes, language model ranking algorithms based on term proximity, or DNA databases containing sequencing errors. In this paper, we present a novel approach to constructing text indexes capable of efficiently supporting approximate search queries. Our approach relies on a new variant of the Context Bound Burrows-Wheeler Transform (k-bwt), referred to as the Variable Depth Burrows-Wheeler Transform (v-bwt). First, we describe our new algorithm, and show that it is reversible. Next, we show how to use the transform to support efficient text indexing and approximate pattern matching. Lastly, we empirically evaluate the use of the v-bwt for DNA and English text collections, and show a significant improvement in approximate search efficiency over more traditional q-gram based approximate pattern matching algorithms.

History

Related Materials

  1. 1.
    ISBN - Is published in 9781450314114 (urn:isbn:9781450314114)
  2. 2.

Start page

9

End page

16

Total pages

8

Outlet

The Seventeenth Australasian Document Computing Symposium, ADCS '12, Dunedin, New Zealand, December 5-6, 2012

Editors

Andrew Trotman, Sall Jo Cunningham and Laurianne Sitbon

Name of conference

The Seventeenth Australasian Document Computing Symposium, ADCS '12

Publisher

ACM

Place published

New York

Start date

2012-10-29

End date

2012-11-02

Language

English

Copyright

© ACM 2012

Former Identifier

2006038871

Esploro creation date

2020-06-22

Fedora creation date

2013-01-07

Usage metrics

    Scholarly Works

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC