RMIT University
Browse

The accurate prediction of disordered regions in protein sequences using machine learning approaches

Download (4.14 MB)
thesis
posted on 2024-11-23, 02:39 authored by Peng Fei Han
A major challenge in the post-genome era is to determine the function of proteins. The traditional structure-function paradigm assumes that the function of a protein is contingent on it folding into a stable three-dimensional structure. However many proteins contain intrinsic unstructured or Disordered Regions (DRs) under physiological conditions, and yet they still carry important functions. Determination of the disordered regions in proteins is therefore an important step towards the determination of their functions. Traditional experimental approaches are generally time consuming and expensive. The efficient and cost-effective computer aided automatic prediction of DRs is thus an attractive alternative. To this end, we propose the novel application of machine learning models and physicochemical features extracted from protein sequences for predicting long, short and global disorder in proteins.

To improve the understandability of disorder prediction, rule based predictors are proposed, which are not only able to predict DRs, but can also quantify previously unknown associations between order disorder status and sequences. The prediction process is transparent and simple to explain.

As DRs of different lengths possess different properties, to achieve a high accuracy of prediction, we propose predictors specific to long, short and global disorder prediction. These predictors are distinct from each other in terms of their features, the machine learning models used, and the methods of prediction. We thoroughly investigate the database of physicochemical properties of amino acid indices and select the indices most correlated with disorder.

Based on these properties, novel feature transforms including autocorrelation and wavelet transforms (WTs) are applied to DR prediction. According to the results of cross-validation tests, our long DR predictor based on autocorrelation achieves the highest accuracy of prediction among long DR predictors at an AUC (Area Under ROC Curve) value of 89.5%. A short DR predictor based on WTs achieves an AUC value of 88.7%, which is comparable to the most accurate short DR predictors. The global DR predictor achieves an AUC value of 96.1%, close to the optimal value.

A major bottleneck of large scale DR prediction is the time efficiency constraint that is attributed to slow feature generation stages and complicated prediction methods. Both our long and short DR predictors are built from simple methods of prediction and feature space. Our web service for long DR prediction can process an uploaded file of multiple sequences.

History

Degree Type

Doctorate by Research

Imprint Date

2011-01-01

School name

School of Science, RMIT University

Former Identifier

9921861410201341

Open access

  • Yes

Usage metrics

    Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC