RMIT University
Browse

Towards An efficient unsupervised feature selection methods for high-dimensional data

Download (2.1 MB)
thesis
posted on 2024-11-23, 15:38 authored by Naif Almusallam
With the proliferation of the data, the dimensions of data have increased significantly, producing what is known as high-dimensional data. This increase of data dimensions results in redundant and non-representative features, which pose challenges to existing machine learning algorithms. Firstly, they add extra processing time to the learning algorithms and therefore negatively affect their performance/running time. Secondly, they reduce the accuracy of the learning algorithms by overfitting the data with these redundant and non-representative features. Lastly, they require greater storage capacity. This thesis is concerned with reducing the data dimensions for machine learning algorithms in order to improve their accuracy and run-time efficiently. The reduction is carried out by selecting a reduced set of representative and non-redundant features from the original feature space so it approximates the original feature space. Three research issues have been addressed to achieve the main aim of this thesis. The first research task addresses the issue of accurate selection of representative features from high-dimensional data. An efficient and accurate similarity-based unsupervised feature selection method (called AUFS) is proposed to tackle the issue of the high-dimensionality of data by selecting representative features without the need to use data class labels.

The proposed AUFS method extends the k-mean clustering algorithm to partition the features into k clusters based on different similarity measures in order to accurately partition the features. Then, the proposed centroid-based feature selection method is used to accurately select those representative features.

The second research task is intended to select representative features from streaming features applications where the number of features increases while the number of instances remains fixed. Streaming features applications pose challenges for feature selection methods. These dynamic features applications have the following characteristics: a) features are sequentially generated and are processed one by one upon their arrival while the number of instances/points remains fixed; and b) the complete feature space is not known in advance. A new method known as Unsupervised Feature Selection for Streaming Features (UFSSF), is proposed to select representative features considering these characteristics of streaming features applications. UFSSF further extends the k-mean clustering algorithm to incrementally decide whether to add the newly arrived feature to the existing set of representative features. Those features that are not representative are discarded.

The last research task involves reducing the dimensionality of multi-view data where both the number of features and instances can increase over time. Multi-view learning provides complementary information for machine learning algorithms.
However, it results in high-dimensionality as the data is being considered from different views. Indeed, extra views would definitely result in extra dimensions. In particular, existing solutions assume that the number of the views is static; however, this is not realistic when dealing with real applications as new views can be added. Therefore, an Onlline Unsupervised Feature Selection for Dynamic Views (OUDVFS) is proposed. As we are targeting unsupervised learning, we propose a new clustering-based feature selection method that incrementally clusters the views. The set of selected representative features is updated at each clustering step.

History

Degree Type

Doctorate by Research

Imprint Date

2018-01-01

School name

School of Science, RMIT University

Former Identifier

9921863713601341

Open access

  • Yes

Usage metrics

    Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC