RMIT University
Browse

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

journal contribution
posted on 2024-11-02, 18:47 authored by Aparna Elangovan, Yuan Li, Douglas Pires, Melissa Davis, Cornelia VerspoorCornelia Verspoor
Motivation: Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. Method: We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models—dubbed PPI-BioBERT-x10—to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion: The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ≈ 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefts and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confdence calibration to facilitate human curation eforts.

Funding

Automated assessment of data quality in biological knowledge resources

Australian Research Council

Find out more...

Biochemical text mining for advancing chemical and pharmaceutical knowledge

Australian Research Council

Find out more...

History

Related Materials

  1. 1.
    DOI - Is published in 10.1186/s12859-021-04504-x
  2. 2.
    ISSN - Is published in 14712105

Journal

BMC Bioinformatics

Volume

23

Number

4

Issue

1

Start page

1

End page

23

Total pages

23

Publisher

Springer

Place published

United Kingdom

Language

English

Copyright

© The Author(s), 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License

Former Identifier

2006113604

Esploro creation date

2022-05-17

Usage metrics

    Scholarly Works

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC