A Transformer Based Approach for Abuse Detection in Code Mixed Indic Languages.

Vibhuti Bansal, Mrinal Tyagi, Rajesh Sharma, Vedika Gupta, Qin Xin

Research output: Contribution to journalArticlepeer-review

14 Downloads (Pure)

Abstract

The advancement in the number of online social media platforms has entailed active participation from the web users globally. This has also lead to subsequent increase in the cyberbullying cases online. Such incidents diminish an individual’s reputation or defame a community, also posing a threat to the privacy of users in cyberspace. Traditionally, manual checks and handling mechanisms have been used to deal with such textual content. However, an automatic computer-based approach would provide far better solutions to this problem. Existing approaches to automate this task majorly involves classical machine learning models which tend to perform poorly on low resource languages. Owing to the varied background and language of web users, the cyberspace witnesses the presence of multilingual text. An integrated approach to accommodate multilingual text could be the appropriate solution. This paper explores various methods to detect abusive content in 13 Indic code-mixed languages. Firstly, baseline classical machine learning models are compared with Transformer based architecture. Secondly, the paper presents the experimental analysis of four state-of-the-art transformer-based models vis à vis XLM-RoBERTa, indic-BERT, MurilBert and mBERT, out of which XLM Roberta with BiGRU outperforms. Thirdly, the experimental setup of the best performing model XLM-RoBERTa is fed with emoji embeddings that leads to further enhancement of overall performance of the employed model. Finally, the model is trained with the combined dataset of 13 Indic languages, to compare its performance with those of individual language models. The performance of combined model surpassed those of the individual models in terms of F1 score and accuracy, supporting the fact that combined model fits the data better possibly due to its code-mixed nature. This model reports a F1 score of 0.88 on test data while rendering a training loss of 0.28, validation loss of 0.31 and an AUC score of 0.94 for both training and validation.
Original languageEnglish
Number of pages12
JournalACM Transactions on Asian and Low-Resource Language Information Processing
DOIs
Publication statusPublished - 1 Nov 2022

Keywords

  • online social media
  • Transformer based model
  • Abuse detection
  • machine learning

Fingerprint

Dive into the research topics of 'A Transformer Based Approach for Abuse Detection in Code Mixed Indic Languages.'. Together they form a unique fingerprint.

Cite this