Exploring a Large Language Model on the ChatGPT Platform for Indonesian Text Preprocessing Tasks

Main Article Content

Cici Suhaeni
Sabrina Adnin Kamila
Fani Fahira
Muhammad Yusran
Gerry Alfa Dito

Abstract

Preprocessing is a crucial step in Natural Language Processing, especially for informal languages like Indonesian, which contain complex morphology, slang, abbreviations, and non-standard expressions. Traditional rule-based tools such as regex, IndoNLP, and Sastrawi are commonly used but often fall short in handling noisy, user-generated text. This study explores the capability of Large Language Model, particularly ChatGPT-o3, in performing Indonesian text preprocessing tasks, namely text cleaning, normalization, stopword removal, and stemming/lemmatization, and compares it to conventional rule-based approaches. Using two types of datasets, consisting of a small example dataset of five manually constructed sentences and a real-world dataset of 100 tweets about the Indonesian “Makan Bergizi Gratis” program, both preprocessing methods were applied and evaluated. Results show that ChatGPT-o3 performs equally well in text cleaning and significantly better in normalization. However, rule-based methods like IndoNLP and Sastrawi still outperform ChatGPT-o3 in stopword removal and stemming. These findings indicate that while ChatGPT-o3 demonstrates strong contextual understanding and linguistic flexibility, they may underperform in rigid, token-based operations without fine-tuning. This study provides initial insights into using Large Language Models as an alternative preprocessing engine for Indonesian text and highlights the need for hybrid approaches or improved prompt design in future applications.

Downloads

Download data is not yet available.

Article Details

How to Cite
1.
Suhaeni C, Kamila SA, Fahira F, Yusran M, Alfa Dito G. Exploring a Large Language Model on the ChatGPT Platform for Indonesian Text Preprocessing Tasks. IJSA [Internet]. 2025 Jun. 24 [cited 2025 Jul. 12];9(1):100-16. Available from: https://journal-stats.ipb.ac.id/index.php/ijsa/article/view/1302
Section
Articles

References

Belal, M., She, J., & Wong, S. (2023). Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis. https://arxiv.org/pdf/2306.17177

Blüthgen, C. (2025). Technical foundations of large language models[Technische Grundlagen großer Sprachmodelle]. Radiologie, 65(4), 227–234. https://doi.org/10.1007/s00117-025-01427-z

Dong, Y., Xiao, C., & Oyamada, M. (2024). Large Language Models as Data Preprocessors. 3–6.

Hamarashid, H. K., Karim, L. T., & Muhammed, D. A. (2023). ChatGPT and Large Language Models: Unraveling Multifaceted Applications, Hallucinations, and Knowledge Extraction. Indonesian Journal of Curriculum and Educational Technology Studies, 11(2), 60–70. https://doi.org/10.15294/IJCETS.V11I2.75617

Hasanah, U., Astuti, T., Wahyudi, R., Rifai, Z., & Pambudi, R. A. (2018). An experimental study of text preprocessing techniques for automatic short answer grading in Indonesian. Proceedings - 2018 3rd International Conference on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2018, 230–234. https://doi.org/10.1109/ICITISEE.2018.8720957

Hyuto. (n.d.). IndoNLP. https://hyuto.github.io/indo-nlp/

Julianto, I. T., Kurniadi, D., & Jr, B. B. B. (2023). ENHANCING SENTIMENT ANALYSIS WITH CHATBOTS: A COMPARATIVE STUDY OF TEXT PRE-PROCESSING. Jurnal Teknik Informatika (Jutif), 4(6), 1419–1430. https://doi.org/10.52436/1.JUTIF.2023.4.6.1448

Lai, V. D., Ngo, N. T., Veyseh, A. P. Ben, Man, H., Dernoncourt, F., Bui, T., & Nguyen, T. H. (2023). ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023, 13171–13189. https://doi.org/10.18653/v1/2023.findings-emnlp.878

Lubis, A. R., Lase, Y. Y., Rahman, D. A., & Witarsyah, D. (2023). Improving Spell Checker Performance for Bahasa Indonesia Using Text Preprocessing Techniques with Deep Learning Models. Ingenierie Des Systemes d’Information, 28(5), 1335–1342. https://doi.org/10.18280/ISI.280522

Nasution, A. H., & Onan, A. (2024). ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access, 12(April), 71876–71900. https://doi.org/10.1109/ACCESS.2024.3402809

Nugraheni, E., Haekal, F. I., Arisal, A., & Perdana, R. S. (2024). Optimizing Indonesian Tweet Preprocessing on Halal Domain. International Conference on Computer, Control, Informatics and Its Applications, IC3INA, 2024, 434–439. https://doi.org/10.1109/IC3INA64086.2024.10732128

Openai, T. (2025). OpenAI o3 and o4-mini System Card. 1–33.

Purbolaksono, M. D., Reskyadita, F. D., Adiwijaya, Suryani, A. A., & Huda, A. F. (2020). Indonesian Text Classification using Back Propagation and Sastrawi Stemming Analysis with Information Gain for Selection Feature. International Journal on Advanced Science, Engineering and Information Technology, 10(1), 234–238. https://doi.org/10.18517/IJASEIT.10.1.8858

Python, S. (2024). re — Regular expression operations. https://docs.python.org/3/library/re.html

Rahman, R. A., & Suyanto. (2024). Performance Analysis of ChatGPT for Indonesian Abstractive Text Summarization. Proceedings - International Seminar on Intelligent Technology and Its Applications, ISITIA, 2024, 477–482. https://doi.org/10.1109/ISITIA63062.2024.10668361

Rahman, T., Agustin, F. E. M., & Rozy, N. F. (2019). Normalization of Unstructured Indonesian Tweet Text for Presidential Candidates Sentiment Analysis. 2019 7th International Conference on Cyber and IT Service Management, CITSM 2019, 2019. https://doi.org/10.1109/CITSM47753.2019.8965324

Rianto, Mutiara, A. B., Wibowo, E. P., & Santosa, P. I. (2021). Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation. Journal of Big Data, 8(1), 1–16. https://doi.org/10.1186/S40537-021-00413-1/FIGURES/6

Rosid, M. A., Fitrani, A. S., Astutik, I. R. I., Mulloh, N. I., & Gozali, H. A. (2020). Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi. IOP Conference Series: Materials Science and Engineering, 874(1), 012017. https://doi.org/10.1088/1757-899X/874/1/012017

Sastrawi. (n.d.). Sastrawi. ttps://github.com/sastrawi/sastrawi

Setiabudi, R., Iswari, N. M. S., & Rusli, A. (2021). Enhancing text classification performance by preprocessing misspelled words in Indonesian language. TELKOMNIKA (Telecommunication Computing Electronics and Control), 19(4), 1234–1241. https://doi.org/10.12928/TELKOMNIKA.V19I4.20369

Belal, M., She, J., & Wong, S. (2023). Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis. https://arxiv.org/pdf/2306.17177

Blüthgen, C. (2025). Technical foundations of large language models[Technische Grundlagen großer Sprachmodelle]. Radiologie, 65(4), 227–234. https://doi.org/10.1007/s00117-025-01427-z

Dong, Y., Xiao, C., & Oyamada, M. (2024). Large Language Models as Data Preprocessors. 3–6.

Hamarashid, H. K., Karim, L. T., & Muhammed, D. A. (2023). ChatGPT and Large Language Models: Unraveling Multifaceted Applications, Hallucinations, and Knowledge Extraction. Indonesian Journal of Curriculum and Educational Technology Studies, 11(2), 60–70. https://doi.org/10.15294/IJCETS.V11I2.75617

Hasanah, U., Astuti, T., Wahyudi, R., Rifai, Z., & Pambudi, R. A. (2018). An experimental study of text preprocessing techniques for automatic short answer grading in Indonesian. Proceedings - 2018 3rd International Conference on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2018, 230–234. https://doi.org/10.1109/ICITISEE.2018.8720957

Hyuto. (n.d.). IndoNLP. https://hyuto.github.io/indo-nlp/

Julianto, I. T., Kurniadi, D., & Jr, B. B. B. (2023). ENHANCING SENTIMENT ANALYSIS WITH CHATBOTS: A COMPARATIVE STUDY OF TEXT PRE-PROCESSING. Jurnal Teknik Informatika (Jutif), 4(6), 1419–1430. https://doi.org/10.52436/1.JUTIF.2023.4.6.1448

Lai, V. D., Ngo, N. T., Veyseh, A. P. Ben, Man, H., Dernoncourt, F., Bui, T., & Nguyen, T. H. (2023). ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023, 13171–13189. https://doi.org/10.18653/v1/2023.findings-emnlp.878

Lubis, A. R., Lase, Y. Y., Rahman, D. A., & Witarsyah, D. (2023). Improving Spell Checker Performance for Bahasa Indonesia Using Text Preprocessing Techniques with Deep Learning Models. Ingenierie Des Systemes d’Information, 28(5), 1335–1342. https://doi.org/10.18280/ISI.280522

Nasution, A. H., & Onan, A. (2024). ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access, 12(April), 71876–71900. https://doi.org/10.1109/ACCESS.2024.3402809

Nugraheni, E., Haekal, F. I., Arisal, A., & Perdana, R. S. (2024). Optimizing Indonesian Tweet Preprocessing on Halal Domain. International Conference on Computer, Control, Informatics and Its Applications, IC3INA, 2024, 434–439. https://doi.org/10.1109/IC3INA64086.2024.10732128

Openai, T. (2025). OpenAI o3 and o4-mini System Card. 1–33.

Purbolaksono, M. D., Reskyadita, F. D., Adiwijaya, Suryani, A. A., & Huda, A. F. (2020). Indonesian Text Classification using Back Propagation and Sastrawi Stemming Analysis with Information Gain for Selection Feature. International Journal on Advanced Science, Engineering and Information Technology, 10(1), 234–238. https://doi.org/10.18517/IJASEIT.10.1.8858

Python, S. (2024). re — Regular expression operations. https://docs.python.org/3/library/re.html

Rahman, R. A., & Suyanto. (2024). Performance Analysis of ChatGPT for Indonesian Abstractive Text Summarization. Proceedings - International Seminar on Intelligent Technology and Its Applications, ISITIA, 2024, 477–482. https://doi.org/10.1109/ISITIA63062.2024.10668361

Rahman, T., Agustin, F. E. M., & Rozy, N. F. (2019). Normalization of Unstructured Indonesian Tweet Text for Presidential Candidates Sentiment Analysis. 2019 7th International Conference on Cyber and IT Service Management, CITSM 2019, 2019. https://doi.org/10.1109/CITSM47753.2019.8965324

Rianto, Mutiara, A. B., Wibowo, E. P., & Santosa, P. I. (2021). Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation. Journal of Big Data, 8(1), 1–16. https://doi.org/10.1186/S40537-021-00413-1/FIGURES/6

Rosid, M. A., Fitrani, A. S., Astutik, I. R. I., Mulloh, N. I., & Gozali, H. A. (2020). Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi. IOP Conference Series: Materials Science and Engineering, 874(1), 012017. https://doi.org/10.1088/1757-899X/874/1/012017

Sastrawi. (n.d.). Sastrawi. ttps://github.com/sastrawi/sastrawi

Setiabudi, R., Iswari, N. M. S., & Rusli, A. (2021). Enhancing text classification performance by preprocessing misspelled words in Indonesian language. TELKOMNIKA (Telecommunication Computing Electronics and Control), 19(4), 1234–1241. https://doi.org/10.12928/TELKOMNIKA.V19I4.20369