RMIT University
Browse

Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions

journal contribution
posted on 2024-11-02, 04:56 authored by Qingyu Chen, Xiuzhen ZhangXiuzhen Zhang, Yu Wan, Justin Zobel, Cornelia VerspoorCornelia Verspoor
Duplicate sequence records - that is, records having similar or identical sequences - are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time.

History

Related Materials

  1. 1.
    DOI - Is published in 10.1089/cmb.2018.0198
  2. 2.
    ISSN - Is published in 10665277

Journal

Journal of Computational Biology

Volume

26

Issue

6

Start page

605

End page

617

Total pages

13

Publisher

Mary Ann Liebert

Place published

United States

Language

English

Copyright

© 2019, Mary Ann Liebert, Inc

Former Identifier

2006094245

Esploro creation date

2020-06-22

Fedora creation date

2020-04-09

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC