Comparative study of DistilBERT and ELECTRA-Small Models in Spam Email Classification

Ferdy Agusman

doi:10.31294/

Authors

Ferdy Agusman Ministry of Finance of the Republic of Indonesia Author

DOI:

https://doi.org/10.31294/

Keywords:

Spam email, Machine learning, Transformer

Abstract

Spam email detection is one of the challenging tasks in cybersecurity due to the variability of spam content. These characteristics make it harder to identify spam, therefore researchers create different spam detection methods. Among these, Natural Language Processing (NLP) and machine learning techniques have shown outstanding results in classifying emails as spam or non-spam. Transformer-based models, such as BERT, have demonstrated pinpoint accuracy in text classification tasks. However, the computational requirements and resources are not practical in resource-limited environments. To mitigate this, smaller and more lightweight models, such as the DistilBERT and ELECTRA-Small, have been developed. This paper presents a comparative study of the DistilBERT and ELECTRA-Small models for spam email classification. The objective is to evaluate the performance and computational efficiency of these two compact transformer architectures. Both DistilBERT and ELECTRA-Small models were fine-tuned on an email dataset comprising 5728 samples. Our experimental results on the primary test set indicate that both models achieved an accuracy of almost 99%. However, when evaluated on a separate external validation set containing 10,000 emails, the ELECTRA-Small model achieved an accuracy of 86.53%, outperforming DistilBERT's 83.68%. Furthermore, ELECTRA-Small demonstrated superior computational efficiency with a training time of 00:02:00, compared to DistilBERT's 00:04:46. This study represents one of the few studies to directly compare the performance and computational efficiency of these two models in the context of spam email detection, highlighting their potential as lightweight and effective solutions for real-world applications.

Downloads

Download data is not yet available.

References

AbdulNabi, I., & Yaseen, Q. (2021). Spam Email Detection Using Deep Learning Techniques. Procedia Computer Science, 184, 853–858. https://doi.org/10.1016/j.procs.2021.03.107

Agbesi, V. K., Chen, W., Yussif, S. B., Hossin, M. A., Ukwuoma, C. C., Kuadey, N. A., ... & Al-antari, M. A. (2023). Pre-Trained Transformer-Based Models for Text Classification Using Low-Resourced Ewe Language. Systems, 12(1), 1. https://doi.org/10.3390/systems12010001

Ahmed, N., Amin, R., Aldabbas, H., Koundal, D., Alouffi, B., & Shah, T. (2022). Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges. Security and Communication Networks, 2022, 1–19. https://doi.org/10.1155/2022/1862888

Akinyelu, A. A. (2021). Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques. Journal of Computer Security, 29(5),473–529. https://doi.org/10.3233/JCS-210022

Bichri, H., Chergui, A., & Hain, M. (2024). Investigating the Impact of Train / Test Split Ratio on the Performance of Pre-Trained Models with Custom Datasets. International Journal of Advanced Computer Science and Applications, 15(2). 331-339.

Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (No. arXiv:2003.10555). arXiv. https://doi.org/10.48550/arXiv.2003.10555

Fahmy Amin, M. (2022). Confusion Matrix in Binary Classification Problems: A Step-by-Step Tutorial. Journal of Engineering Research: Vol. 6: Iss. 5 Article 1. https://digitalcommons.aaru.edu.jo/erjeng/vol6/iss5/1

Guo, Y., Mustafaoglu, Z., & Koundal, D. (2022). Spam Detection Using Bidirectional Transformers and Machine Learning Classifier Algorithms. Journal of Computational and Cognitive Engineering,2(1),59.https://doi.org/10.47852/bonviewJCCE2202192

Jazzar, M., F. Yousef, R., & Eleyan, D. (2021). Evaluation of Machine Learning Techniques for Email Spam Classification. International Journal of Education and Management Engineering, 11(4),35–42. https://doi.org/10.5815/ijeme.2021.04.04

Jones, I. (2023). Assessing the efficacy of the ELECTRA pre-trained language model for multi-class sarcasm subcategory classification [Master’s thesis, University of Bath]. Bath Research Portal.https://researchportal.bath.ac.uk/en/publications/assessing-the-efficacy-of-the-electra-pre-trained-language-model-

Khan, M., & Ghafoor, L. (2024). Adversarial machine learning in the context of network security: Challenges and solutions. Journal of Computational Intelligence and Robotics, 4(1), 51-63.

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in Vision: A Survey. ACM Computing Surveys, 54(10s), 1–41. https://doi.org/10.1145/3505244

Kofi Akpatsa, S., Lei, H., Li, X., Kofi Setornyo Obeng, V.-H., Mensah Martey, E., Clement Addo, P., & Dodzi Fiawoo, D. (2022). Online News Sentiment Classification Using DistilBERT. Journal of Quantum Computing, 4(1), 1–11. https://doi.org/10.32604/jqc.2022.02665

Li, P., Zhong, P., Mao, K., Wang, D., Yang, X., Liu, Y., Yin, J., & See, S. (2021). ACT: An Attentive Convolutional Transformer for Efficient Text Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13261–13269. https://doi.org/10.1609/aaai.v35i15.17566

Lu, H., Ehwerhemuepha, L., & Rakovski, C. (2022). A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Medical Research Methodology, 22(1), 181. https://doi.org/10.1186/s12874-022-01665-y

Nair, A. R., Singh, R. P., Gupta, D., & Kumar, P. (2024). Evaluating the Impact of Text Data Augmentation on Text Classification Tasks using DistilBERT. Procedia Computer Science, 235, 102–111. https://doi.org/10.1016/j.procs.2024.04.013

Nallamothu, P. T., & Khan, M. S. (2023). Machine learning for SPAM detection. Asian Journal of Advances in Research, 6(1), 167-179. https://jasianresearch.com/index.php/AJOAIR/article/view/296

Ranasinghe, T., Gupte, S., Zampieri, M., & Nwogu, I. (2020). WLV‑RIT at HASOC‑Dravidian‑CodeMix‑FIRE2020: Offensive language identification in code‑switched YouTube comments (arXiv:2011.00559). ArXiv.https://doi.org/10.48550/arXiv.2011.00559

Sahmoud, T., & Mikki, M. (2022). Spam detection using BERT (arXiv:2206.02443). arXiv. https://doi.org/10.48550/arXiv.2206.02443

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter (No. arXiv:1910.01108). arXiv. https://doi.org/10.48550/arXiv.1910.01108

Silva Barbon, R., & Akabane, A. T. (2022). Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study. Sensors, 22(21), 8184. https://doi.org/10.3390/s22218184

Statista. (2023). Number of sent and received e-mails per day worldwide from 2017 to 2026. https://www.statista.com/statistics/456500/daily-number-of-e-mails-worldwide/

Tepecik, A., & Demir, E. (2024). Emotion Detection with Pre-Trained Language Models BERT and ELECTRA Analysis of Turkish Data. Intelligent Methods In Engineering Sciences, 3(1), 7-12.

https://doi.org/10.58190/imiens.2024.82

Tezgider, M., Yildiz, B., & Aydin, G. (2022). Text classification using improved bidirectional transformer. Concurrency and Computation: Practice and Experience, 34(9), e6486. https://doi.org/10.1002/cpe.6486

Wood, T., Basto-Fernandes, V., Boiten, E., & Yevseyeva, I. (2022). Systematic Literature Review: Anti-Phishing Defences and Their Application to Before-the-click Phishing Email Detection. arXiv preprint arXiv:2204.13054.

Zhang, S., Yu, H., & Zhu, G. (2022). An emotional classification method of Chinese short comment text based on ELECTRA. Connection Science, 34(1), 254–273. https://doi.org/10.1080/09540091.2021.1985968

Comparative study of DistilBERT and ELECTRA-Small Models in Spam Email Classification

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

Template

..:: TEMPLATE ::..

menu

..:: ADDITIONAL MENU ::..

Tools

..:: TOOLS ::..

Index by:

Supported by:

Publish by LPPM Universitas Bina Sarana Informatika

Jl. Kramat Raya No.98, Senen, Jakarta Pusat, DKI Jakarta 10450