Exploring a Large Language Model on the ChatGPT Platform for Indonesian Text Preprocessing Tasks
Main Article Content
Abstract
Preprocessing is a crucial step in Natural Language Processing, especially for informal languages like Indonesian, which contain complex morphology, slang, abbreviations, and non-standard expressions. Traditional rule-based tools such as regex, IndoNLP, and Sastrawi are commonly used but often fall short in handling noisy, user-generated text. This study explores the capability of Large Language Model, particularly ChatGPT-o3, in performing Indonesian text preprocessing tasks, namely text cleaning, normalization, stopword removal, and stemming/lemmatization, and compares it to conventional rule-based approaches. Using two types of datasets, consisting of a small example dataset of five manually constructed sentences and a real-world dataset of 100 tweets about the Indonesian “Makan Bergizi Gratis” program, both preprocessing methods were applied and evaluated. Results show that ChatGPT-o3 performs equally well in text cleaning and significantly better in normalization. However, rule-based methods like IndoNLP and Sastrawi still outperform ChatGPT-o3 in stopword removal and stemming. These findings indicate that while ChatGPT-o3 demonstrates strong contextual understanding and linguistic flexibility, they may underperform in rigid, token-based operations without fine-tuning. This study provides initial insights into using Large Language Models as an alternative preprocessing engine for Indonesian text and highlights the need for hybrid approaches or improved prompt design in future applications.
Downloads
Article Details
References
Belal, M., She, J., & Wong, S. (2023). Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis. https://arxiv.org/pdf/2306.17177
Blüthgen, C. (2025). Technical foundations of large language models[Technische Grundlagen großer Sprachmodelle]. Radiologie, 65(4), 227–234. https://doi.org/10.1007/s00117-025-01427-z
Dong, Y., Xiao, C., & Oyamada, M. (2024). Large Language Models as Data Preprocessors. 3–6.
Hamarashid, H. K., Karim, L. T., & Muhammed, D. A. (2023). ChatGPT and Large Language Models: Unraveling Multifaceted Applications, Hallucinations, and Knowledge Extraction. Indonesian Journal of Curriculum and Educational Technology Studies, 11(2), 60–70. https://doi.org/10.15294/IJCETS.V11I2.75617
Hasanah, U., Astuti, T., Wahyudi, R., Rifai, Z., & Pambudi, R. A. (2018). An experimental study of text preprocessing techniques for automatic short answer grading in Indonesian. Proceedings - 2018 3rd International Conference on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2018, 230–234. https://doi.org/10.1109/ICITISEE.2018.8720957
Hyuto. (n.d.). IndoNLP. https://hyuto.github.io/indo-nlp/
Julianto, I. T., Kurniadi, D., & Jr, B. B. B. (2023). ENHANCING SENTIMENT ANALYSIS WITH CHATBOTS: A COMPARATIVE STUDY OF TEXT PRE-PROCESSING. Jurnal Teknik Informatika (Jutif), 4(6), 1419–1430. https://doi.org/10.52436/1.JUTIF.2023.4.6.1448
Lai, V. D., Ngo, N. T., Veyseh, A. P. Ben, Man, H., Dernoncourt, F., Bui, T., & Nguyen, T. H. (2023). ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023, 13171–13189. https://doi.org/10.18653/v1/2023.findings-emnlp.878
Lubis, A. R., Lase, Y. Y., Rahman, D. A., & Witarsyah, D. (2023). Improving Spell Checker Performance for Bahasa Indonesia Using Text Preprocessing Techniques with Deep Learning Models. Ingenierie Des Systemes d’Information, 28(5), 1335–1342. https://doi.org/10.18280/ISI.280522
Nasution, A. H., & Onan, A. (2024). ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access, 12(April), 71876–71900. https://doi.org/10.1109/ACCESS.2024.3402809
Nugraheni, E., Haekal, F. I., Arisal, A., & Perdana, R. S. (2024). Optimizing Indonesian Tweet Preprocessing on Halal Domain. International Conference on Computer, Control, Informatics and Its Applications, IC3INA, 2024, 434–439. https://doi.org/10.1109/IC3INA64086.2024.10732128
Openai, T. (2025). OpenAI o3 and o4-mini System Card. 1–33.
Purbolaksono, M. D., Reskyadita, F. D., Adiwijaya, Suryani, A. A., & Huda, A. F. (2020). Indonesian Text Classification using Back Propagation and Sastrawi Stemming Analysis with Information Gain for Selection Feature. International Journal on Advanced Science, Engineering and Information Technology, 10(1), 234–238. https://doi.org/10.18517/IJASEIT.10.1.8858
Python, S. (2024). re — Regular expression operations. https://docs.python.org/3/library/re.html
Rahman, R. A., & Suyanto. (2024). Performance Analysis of ChatGPT for Indonesian Abstractive Text Summarization. Proceedings - International Seminar on Intelligent Technology and Its Applications, ISITIA, 2024, 477–482. https://doi.org/10.1109/ISITIA63062.2024.10668361
Rahman, T., Agustin, F. E. M., & Rozy, N. F. (2019). Normalization of Unstructured Indonesian Tweet Text for Presidential Candidates Sentiment Analysis. 2019 7th International Conference on Cyber and IT Service Management, CITSM 2019, 2019. https://doi.org/10.1109/CITSM47753.2019.8965324
Rianto, Mutiara, A. B., Wibowo, E. P., & Santosa, P. I. (2021). Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation. Journal of Big Data, 8(1), 1–16. https://doi.org/10.1186/S40537-021-00413-1/FIGURES/6
Rosid, M. A., Fitrani, A. S., Astutik, I. R. I., Mulloh, N. I., & Gozali, H. A. (2020). Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi. IOP Conference Series: Materials Science and Engineering, 874(1), 012017. https://doi.org/10.1088/1757-899X/874/1/012017
Sastrawi. (n.d.). Sastrawi. ttps://github.com/sastrawi/sastrawi
Setiabudi, R., Iswari, N. M. S., & Rusli, A. (2021). Enhancing text classification performance by preprocessing misspelled words in Indonesian language. TELKOMNIKA (Telecommunication Computing Electronics and Control), 19(4), 1234–1241. https://doi.org/10.12928/TELKOMNIKA.V19I4.20369
Belal, M., She, J., & Wong, S. (2023). Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis. https://arxiv.org/pdf/2306.17177
Blüthgen, C. (2025). Technical foundations of large language models[Technische Grundlagen großer Sprachmodelle]. Radiologie, 65(4), 227–234. https://doi.org/10.1007/s00117-025-01427-z
Dong, Y., Xiao, C., & Oyamada, M. (2024). Large Language Models as Data Preprocessors. 3–6.
Hamarashid, H. K., Karim, L. T., & Muhammed, D. A. (2023). ChatGPT and Large Language Models: Unraveling Multifaceted Applications, Hallucinations, and Knowledge Extraction. Indonesian Journal of Curriculum and Educational Technology Studies, 11(2), 60–70. https://doi.org/10.15294/IJCETS.V11I2.75617
Hasanah, U., Astuti, T., Wahyudi, R., Rifai, Z., & Pambudi, R. A. (2018). An experimental study of text preprocessing techniques for automatic short answer grading in Indonesian. Proceedings - 2018 3rd International Conference on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2018, 230–234. https://doi.org/10.1109/ICITISEE.2018.8720957
Hyuto. (n.d.). IndoNLP. https://hyuto.github.io/indo-nlp/
Julianto, I. T., Kurniadi, D., & Jr, B. B. B. (2023). ENHANCING SENTIMENT ANALYSIS WITH CHATBOTS: A COMPARATIVE STUDY OF TEXT PRE-PROCESSING. Jurnal Teknik Informatika (Jutif), 4(6), 1419–1430. https://doi.org/10.52436/1.JUTIF.2023.4.6.1448
Lai, V. D., Ngo, N. T., Veyseh, A. P. Ben, Man, H., Dernoncourt, F., Bui, T., & Nguyen, T. H. (2023). ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023, 13171–13189. https://doi.org/10.18653/v1/2023.findings-emnlp.878
Lubis, A. R., Lase, Y. Y., Rahman, D. A., & Witarsyah, D. (2023). Improving Spell Checker Performance for Bahasa Indonesia Using Text Preprocessing Techniques with Deep Learning Models. Ingenierie Des Systemes d’Information, 28(5), 1335–1342. https://doi.org/10.18280/ISI.280522
Nasution, A. H., & Onan, A. (2024). ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access, 12(April), 71876–71900. https://doi.org/10.1109/ACCESS.2024.3402809
Nugraheni, E., Haekal, F. I., Arisal, A., & Perdana, R. S. (2024). Optimizing Indonesian Tweet Preprocessing on Halal Domain. International Conference on Computer, Control, Informatics and Its Applications, IC3INA, 2024, 434–439. https://doi.org/10.1109/IC3INA64086.2024.10732128
Openai, T. (2025). OpenAI o3 and o4-mini System Card. 1–33.
Purbolaksono, M. D., Reskyadita, F. D., Adiwijaya, Suryani, A. A., & Huda, A. F. (2020). Indonesian Text Classification using Back Propagation and Sastrawi Stemming Analysis with Information Gain for Selection Feature. International Journal on Advanced Science, Engineering and Information Technology, 10(1), 234–238. https://doi.org/10.18517/IJASEIT.10.1.8858
Python, S. (2024). re — Regular expression operations. https://docs.python.org/3/library/re.html
Rahman, R. A., & Suyanto. (2024). Performance Analysis of ChatGPT for Indonesian Abstractive Text Summarization. Proceedings - International Seminar on Intelligent Technology and Its Applications, ISITIA, 2024, 477–482. https://doi.org/10.1109/ISITIA63062.2024.10668361
Rahman, T., Agustin, F. E. M., & Rozy, N. F. (2019). Normalization of Unstructured Indonesian Tweet Text for Presidential Candidates Sentiment Analysis. 2019 7th International Conference on Cyber and IT Service Management, CITSM 2019, 2019. https://doi.org/10.1109/CITSM47753.2019.8965324
Rianto, Mutiara, A. B., Wibowo, E. P., & Santosa, P. I. (2021). Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation. Journal of Big Data, 8(1), 1–16. https://doi.org/10.1186/S40537-021-00413-1/FIGURES/6
Rosid, M. A., Fitrani, A. S., Astutik, I. R. I., Mulloh, N. I., & Gozali, H. A. (2020). Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi. IOP Conference Series: Materials Science and Engineering, 874(1), 012017. https://doi.org/10.1088/1757-899X/874/1/012017
Sastrawi. (n.d.). Sastrawi. ttps://github.com/sastrawi/sastrawi
Setiabudi, R., Iswari, N. M. S., & Rusli, A. (2021). Enhancing text classification performance by preprocessing misspelled words in Indonesian language. TELKOMNIKA (Telecommunication Computing Electronics and Control), 19(4), 1234–1241. https://doi.org/10.12928/TELKOMNIKA.V19I4.20369