RMIT University
Browse

“Note Bloat” impacts deep learning-based NLP models for clinical prediction tasks

journal contribution
posted on 2024-11-02, 21:25 authored by Jinghui Liu, Daniel Capurro, Anthony Nguyen, Cornelia VerspoorCornelia Verspoor
One unintended consequence of the Electronic Health Records (EHR) implementation is the overuse of content-importing technology, such as copy-and-paste, that creates “bloated” notes containing large amounts of textual redundancy. Despite the rising interest in applying machine learning models to learn from real-patient data, it is unclear how the phenomenon of note bloat might affect the Natural Language Processing (NLP) models derived from these notes. Therefore, in this work we examine the impact of redundancy on deep learning-based NLP models, considering four clinical prediction tasks using a publicly available EHR database. We applied two deduplication methods to the hospital notes, identifying large quantities of redundancy, and found that removing the redundancy usually has little negative impact on downstream performances, and can in certain circumstances assist models to achieve significantly better results. We also showed it is possible to attack model predictions by simply adding note duplicates, causing changes of correct predictions made by trained models into wrong predictions. In conclusion, we demonstrated that EHR text redundancy substantively affects NLP models for clinical prediction tasks, showing that the awareness of clinical contexts and robust modeling methods are important to create effective and reliable NLP systems in healthcare contexts.

History

Related Materials

  1. 1.
    DOI - Is published in 10.1016/j.jbi.2022.104149
  2. 2.
    ISSN - Is published in 15320464

Journal

Journal of Biomedical Informatics

Volume

133

Number

104149

Start page

1

End page

11

Total pages

11

Publisher

Elsevier

Place published

United States

Language

English

Copyright

© 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Former Identifier

2006118702

Esploro creation date

2023-01-30

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC