Classification of Drinking Water Source Suitability in West Java Using XGBoost and Cluster Analysis Based on SHAP Values

Klasifikasi Kelayakan Sumber Air Minum di Jawa Barat Menggunakan XGBoost dan Analisis Klasterisasi Berdasarkan Nilai SHAP

Authors

  • Annisa Permata Sari IPB University
  • Billy IPB University
  • Denanda Aufadlan Tsaqif IPB University
  • Bagus Sartono IPB University
  • Aulia Rizki Firdawanti IPB University

DOI:

https://doi.org/10.29244/ijsa.v8i2p202-214

Keywords:

Drinking Water Sources, Feature Selection, Machine Learning, Classification, Water Suitablity, XGBoost

Abstract

Water is essential for meeting the basic needs of living organisms. In Indonesia, ensuring safe and quality drinking water is crucial for public health. However, in some regions, particularly in West Java Province, people still rely on unsuitable water sources, which can negatively impact health. The classification of water source suitability can be achieved using machine learning, such as the Extreme Gradient Boosting (XGBoost) model. XGBoost with feature selection is effective in improving prediction accuracy and minimizing overfitting. This study evaluates the performance of the XGBoost model in classifying household drinking water sources in West Java and uses the K-Means algorithm for cluster SHAP values to identify key characteristics of households with safe drinking water. The results show that the XGBoost model, with an accuracy of 77.43% and an F1-Score of 80.17%, successfully classified 4187 households, with 2349 having safe drinking water and 1838 having unsuitable sources. SHAP value analysis identified location, water collection time, and monthly per capita expenditure as significant factors influencing water source suitability. Households with water sources inside the house's fence, a short water collection time, and high monthly per capita expenditure tend to have safe drinking water sources. There are 4 clusters formed, with cluster 1 and cluster 3 needing immediate quality of drinking water sources improvement with cluster 2 as an indicator of success. Cluster 4 consists of households with high expenditure, marking it as a potential household for the government to make water quality improvements.

Downloads

Download data is not yet available.

References

Bergstra, J. & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 281-305.

Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D. D. (2015). Hyperopt: a Python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1), 014008. https://doi.org/10.1088/1749-4699/8/1/014008

Bobbitt, Z. (2021). F1 Score vs. Accuracy: Which Should You Use?. Retrieved from https://www.statology.org/f1-score-vs-accuracy

BPS. (2024a). Persentase Rumah Tangga Menggunakan Layanan Sanitasi yang Dikelola Secara Aman Menurut Provinsi dan Tipe Daerah (Persen), 2023-2024. Retrieved from https://www.bps.go.id/id/statistics-table/2/MjE3OSMy/persentase-rumah-tangga-menggunakan-layanan-sanitasi-yang-dikelola-secara-aman-menurut-provinsi-dan-tipe-daerah.html

BPS. (2024b). Survei Sosial Ekonomi Nasional 2023 Maret (KOR). Retrieved from https://silastik.bps.go.id/v3/index.php/mikrodata/detail/ZnZSZms4aStzN2JUSVY1QklqZ08rdz09

BPS. (2024c). Tinjauan Ekonomi Provinsi Jawa Barat 2023 (issue ISSN: 2714-9218). BPS Provinsi Jawa Barat.

Direktorat Jenderal Pencegahan dan Pengendalian Penyakit. (2023). Laporan Tahunan Pengamanan Kualitas Air Minum Tahun 2022. Jakarta: Kementerian Kesehatan RI.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451

Hart, S. (1989). Shapley value. In game theory (pp. 210-216). London: Palgrave Macmillan UK. Retrieved from https://link.springer.com/book/10.1007/978-1-349-20181-5

Herdiana, I. (2023). Bencana Kekeringan Melanda 23 Kabupaten dan Kota di Jawa Barat, Mengancam Sawah di Kabupaten Bandung | BandungBergerak.id. Retrieved from https://bandungbergerak.id/article/detail/158922/bencana-kekeringan-melanda-23-kabupaten-dan-kota-di-jawa-barat-mengancam-sawah-di-kabupaten-bandung

Kodinariya, T. M. & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 1(6), 90-95.

Maulana, M. D., Hadiana, A. I., & Umbara, F. R. (2023). Algoritma Xgboost untuk klasifikasi kualitas air minum. JATI (Jurnal Mahasiswa Teknik Informatika), 7(5), 3251-3256. https://doi.org/10.36040/jati.v7i5.7308

Nasution, M. K., Saedudin, R. R., & Widartha, V. P. (2021). Perbandingan akurasi algoritma naïve bayes dan algoritma Xgboost pada klasifikasi penyakit diabetes. eProceedings of Engineering, 8(5), 9765-9772.

Nguyen, H. T. T., Cao, H. Q., Nguyen, K. V. T., & Pham, N. D. K. (2021). Evaluation of explainable artificial intelligence: Shap, lime, and cam. In Proceedings of the FPT AI Conference (pp. 1-6).

Nursantika, M., Faridhan, Y. E., & Kamila, I. (2023). Analisis pengaruh faktor risiko penyakit pneumonia terhadap angka mortalitas bayi dan balita menggunakan regresi poisson dan regresi binomial negatif (studi kasus: Provinsi Jawa Barat). Interval: Jurnal Ilmiah Matematika, 3(2), 102-111. https://doi.org/10.33751/interval.v3i2.9093

Prabha, A., Yadav, J., Rani, A., & Singh, V. (2021). Design of intelligent diabetes mellitus detection system using hybrid feature selection based XGBoost classifier. Computers in Biology and Medicine, 136, 104664. https://doi.org/10.1016/j.compbiomed.2021.104664

Sinulingga, B. (2023). Kali Bekasi tercemar parah, ribuan pelanggan PDAM krisis air bersih. Retrieved from https://www.liputan6.com/news/read/5402885/kali-bekasi-tercemar-parah-ribuan-pelanggan-pdam-krisis-air-bersih

Syukron, M., Santoso, R., & Widiharih, T. (2020). Perbandingan metode smote random forest dan smote xgboost untuk klasifikasi tingkat penyakit hepatitis C pada imbalance class data. Jurnal Gaussian, 9(3), 227-236. https://doi.org/10.14710/j.gauss.9.3.227-236

Wang, Y., & Ni, X. S. (2019). A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization. arXiv preprint arXiv:1901.08433.

Yulianti, S. E. H., Soesanto, O., & Sukmawaty, Y. (2022). Penerapan metode extreme gradient boosting (Xgboost) pada klasifikasi nasabah kartu kredit. Journal of Mathematics: Theory and Applications, 4(1), 21-26. https://doi.org/10.31605/jomta.v4i1.1792

Downloads

Published

31-12-2024

How to Cite

Sari, A. P., Billy, Tsaqif, D. A., Sartono, B., & Firdawanti, A. R. (2024). Classification of Drinking Water Source Suitability in West Java Using XGBoost and Cluster Analysis Based on SHAP Values: Klasifikasi Kelayakan Sumber Air Minum di Jawa Barat Menggunakan XGBoost dan Analisis Klasterisasi Berdasarkan Nilai SHAP. Indonesian Journal of Statistics and Its Applications, 8(2), 202–214. https://doi.org/10.29244/ijsa.v8i2p202-214

Issue

Section

Articles

Most read articles by the same author(s)