RMIT University
Browse

An ensemble learning approach for addressing the class imbalance problem in twitter spam detection

conference contribution
posted on 2024-11-03, 15:09 authored by Shigang Liu, Yu Wang, Chao ChenChao Chen, Yang Xiang
Being an important source for real-time information dissemination in recent years, Twitter is inevitably a prime target of spammers. It has been showed that the damage caused by Twitter spam can reach far beyond the social media platform itself. To mitigate the threat, a lot of recent studies use machine learning techniques to classify Twitter spam and report very satisfactory results. However, most of the studies overlook a fundamental issue that is widely seen in real-world Twitter data, i.e., the class imbalance problem. In this paper, we show that the unequal distribution between spam and non-spam classes in the data has a great impact on spam detection rate. To address the problem, we propose an ensemble learning approach, which involves three steps. In the first step, we adjust the class distribution in the imbalanced data set using various strategies, including random oversampling, random undersampling and fuzzy-based oversampling. In the next step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine all the classification models. Experimental results obtained using real-world Twitter data indicate that the proposed approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.

History

Related Materials

  1. 1.
    DOI - Is published in 10.1007/978-3-319-40253-6_13
  2. 2.
    ISBN - Is published in 9783319402529 (urn:isbn:9783319402529)

Start page

215

End page

228

Total pages

14

Outlet

Proceedings of the 21st Australasian Conference on Information Security and Privacy

Name of conference

21st Australasian Conference on Information Security and Privacy

Publisher

Springer

Place published

Cham, Switzerland

Start date

2016-07-04

End date

2016-07-06

Language

English

Former Identifier

2006117997

Esploro creation date

2023-03-30

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC