RMIT University
Browse

Language-independent Twitter classification using character-based convolutional networks

conference contribution
posted on 2024-10-31, 20:59 authored by Shiwei Zhang, Xiuzhen ZhangXiuzhen Zhang, Jeffrey ChanJeffrey Chan
Most research on Twitter classification is focused on tweets in English. But Twitter supports over 40 languages and about 50% of tweets are non-English tweets. To fully use the Twitter contents, it is important to develop classifiers that can classify multilingual tweets or tweets of mixed languages (for example tweets mainly in Chinese but containing English words). The translation-based model is a classical approach to achieving multilingual or cross-lingual text classification. Recently character-based neural models are shown to be effective for text classification. But they are designed for limited European languages and require identification of languages to build an alphabet to encode and quantize characters. In this paper, we propose UniCNN (Unicode character Convolutional Networks), a fully language-independent character-based CNN model for the classification of tweets in multiple languages and mixed languages, not requiring language identification. Specifically, we propose to encode the sequence of characters in a tweet into a sequence of numerical UTF-8 codes, and then train a character-based CNN classifier. In addition, a character-based embedding layer is included before the convolutional layer for learning distributed character representation. We conducted experiments on Twitter datasets for multilingual sentiment classification in six languages and for mixed-language informativeness classification in over 40 languages. Our experiments showed that UniCNN mostly performed better than state-of-the-art neural models and traditional feature-based models, while not requiring the extra burden of any translation or tokenization.

History

Related Materials

  1. 1.
    DOI - Is published in 10.1007/978-3-319-69179-4_29
  2. 2.
    ISBN - Is published in 9783319691787 (urn:isbn:9783319691787)

Start page

413

End page

428

Total pages

16

Outlet

Advanced Data Mining and Applications: Proceedings of the 13th International Conference 2017

Editors

G. Cong, W.-C. Peng, W. E. Zhang, C. Li and A. Sun

Name of conference

ADMA 2017: 13th International Conference on Advanced Data Mining and Applications

Publisher

Springer

Place published

Cham, Switzerland

Start date

2017-11-05

End date

2017-11-06

Language

English

Copyright

© Springer International Publishing AG 2017

Former Identifier

2006079845

Esploro creation date

2020-06-22

Fedora creation date

2017-12-04

Usage metrics

    Scholarly Works

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC