Classification of Drinking Water Source Suitability in West Java Using XGBoost and Cluster Analysis Based on SHAP Values
Klasifikasi Kelayakan Sumber Air Minum di Jawa Barat Menggunakan XGBoost dan Analisis Klasterisasi Berdasarkan Nilai SHAP
DOI:
https://doi.org/10.29244/ijsa.v8i2p202-214Keywords:
Drinking Water Sources, Feature Selection, Machine Learning, Classification, Water Suitablity, XGBoostAbstract
Water is essential for meeting the basic needs of living organisms. In Indonesia, ensuring safe and quality drinking water is crucial for public health. However, in some regions, particularly in West Java Province, people still rely on unsuitable water sources, which can negatively impact health. The classification of water source suitability can be achieved using machine learning, such as the Extreme Gradient Boosting (XGBoost) model. XGBoost with feature selection is effective in improving prediction accuracy and minimizing overfitting. This study evaluates the performance of the XGBoost model in classifying household drinking water sources in West Java and uses the K-Means algorithm for cluster SHAP values to identify key characteristics of households with safe drinking water. The results show that the XGBoost model, with an accuracy of 77.43% and an F1-Score of 80.17%, successfully classified 4187 households, with 2349 having safe drinking water and 1838 having unsuitable sources. SHAP value analysis identified location, water collection time, and monthly per capita expenditure as significant factors influencing water source suitability. Households with water sources inside the house's fence, a short water collection time, and high monthly per capita expenditure tend to have safe drinking water sources. There are 4 clusters formed, with cluster 1 and cluster 3 needing immediate quality of drinking water sources improvement with cluster 2 as an indicator of success. Cluster 4 consists of households with high expenditure, marking it as a potential household for the government to make water quality improvements.
Downloads
References
Bergstra, J. & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 281-305.
Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D. D. (2015). Hyperopt: a Python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1), 014008. https://doi.org/10.1088/1749-4699/8/1/014008
Bobbitt, Z. (2021). F1 Score vs. Accuracy: Which Should You Use?. Retrieved from https://www.statology.org/f1-score-vs-accuracy
BPS. (2024a). Persentase Rumah Tangga Menggunakan Layanan Sanitasi yang Dikelola Secara Aman Menurut Provinsi dan Tipe Daerah (Persen), 2023-2024. Retrieved from https://www.bps.go.id/id/statistics-table/2/MjE3OSMy/persentase-rumah-tangga-menggunakan-layanan-sanitasi-yang-dikelola-secara-aman-menurut-provinsi-dan-tipe-daerah.html
BPS. (2024b). Survei Sosial Ekonomi Nasional 2023 Maret (KOR). Retrieved from https://silastik.bps.go.id/v3/index.php/mikrodata/detail/ZnZSZms4aStzN2JUSVY1QklqZ08rdz09
BPS. (2024c). Tinjauan Ekonomi Provinsi Jawa Barat 2023 (issue ISSN: 2714-9218). BPS Provinsi Jawa Barat.
Direktorat Jenderal Pencegahan dan Pengendalian Penyakit. (2023). Laporan Tahunan Pengamanan Kualitas Air Minum Tahun 2022. Jakarta: Kementerian Kesehatan RI.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
Hart, S. (1989). Shapley value. In game theory (pp. 210-216). London: Palgrave Macmillan UK. Retrieved from https://link.springer.com/book/10.1007/978-1-349-20181-5
Herdiana, I. (2023). Bencana Kekeringan Melanda 23 Kabupaten dan Kota di Jawa Barat, Mengancam Sawah di Kabupaten Bandung | BandungBergerak.id. Retrieved from https://bandungbergerak.id/article/detail/158922/bencana-kekeringan-melanda-23-kabupaten-dan-kota-di-jawa-barat-mengancam-sawah-di-kabupaten-bandung
Kodinariya, T. M. & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 1(6), 90-95.
Maulana, M. D., Hadiana, A. I., & Umbara, F. R. (2023). Algoritma Xgboost untuk klasifikasi kualitas air minum. JATI (Jurnal Mahasiswa Teknik Informatika), 7(5), 3251-3256. https://doi.org/10.36040/jati.v7i5.7308
Nasution, M. K., Saedudin, R. R., & Widartha, V. P. (2021). Perbandingan akurasi algoritma naïve bayes dan algoritma Xgboost pada klasifikasi penyakit diabetes. eProceedings of Engineering, 8(5), 9765-9772.
Nguyen, H. T. T., Cao, H. Q., Nguyen, K. V. T., & Pham, N. D. K. (2021). Evaluation of explainable artificial intelligence: Shap, lime, and cam. In Proceedings of the FPT AI Conference (pp. 1-6).
Nursantika, M., Faridhan, Y. E., & Kamila, I. (2023). Analisis pengaruh faktor risiko penyakit pneumonia terhadap angka mortalitas bayi dan balita menggunakan regresi poisson dan regresi binomial negatif (studi kasus: Provinsi Jawa Barat). Interval: Jurnal Ilmiah Matematika, 3(2), 102-111. https://doi.org/10.33751/interval.v3i2.9093
Prabha, A., Yadav, J., Rani, A., & Singh, V. (2021). Design of intelligent diabetes mellitus detection system using hybrid feature selection based XGBoost classifier. Computers in Biology and Medicine, 136, 104664. https://doi.org/10.1016/j.compbiomed.2021.104664
Sinulingga, B. (2023). Kali Bekasi tercemar parah, ribuan pelanggan PDAM krisis air bersih. Retrieved from https://www.liputan6.com/news/read/5402885/kali-bekasi-tercemar-parah-ribuan-pelanggan-pdam-krisis-air-bersih
Syukron, M., Santoso, R., & Widiharih, T. (2020). Perbandingan metode smote random forest dan smote xgboost untuk klasifikasi tingkat penyakit hepatitis C pada imbalance class data. Jurnal Gaussian, 9(3), 227-236. https://doi.org/10.14710/j.gauss.9.3.227-236
Wang, Y., & Ni, X. S. (2019). A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization. arXiv preprint arXiv:1901.08433.
Yulianti, S. E. H., Soesanto, O., & Sukmawaty, Y. (2022). Penerapan metode extreme gradient boosting (Xgboost) pada klasifikasi nasabah kartu kredit. Journal of Mathematics: Theory and Applications, 4(1), 21-26. https://doi.org/10.31605/jomta.v4i1.1792