Grouping Data in Predicting Infant Mortality Using K-Means and Decision Tree

- The high infant mortality rate is the main thing and the Indonesian government must prioritize. The number of infant deaths that occur, it is necessary to group data on infant mortality and predict the cause of the infant's death. With the grouping and predictions that aim to reduce infant mortality data in Indonesia. For grouping infant mortality data, a K-Means method is needed to analyze data by carrying out a data modeling process without supervision or also known as unsupervised learning. If the data is obtained from the results of grouping, then the data will be predicted with the Decission Tree method which is more reliable in making a decision with the decision tree. The results of K-Means in determining the centroid in the early stages of the k-means algorithm greatly affect the results of clusters carried out on infant mortality datasets with different centroid results. From the clustering results, four labels were tested again using the decision tree algorithm. From the results obtained that a very good prediction rate is obtained. With the K-Means and Decission Tree methods, it will be used and evaluated by the government or the health department to prevent a lot of infant deaths.


INTRODUCTION
Death is something that we cannot avoid where, when and how death comes. Infant death is something we don't want, especially for newlywed couples or those who have been married for a long time but haven't gotten the baby they want. (Kohno et al., 2020). The high infant mortality rate is the main thing and the Indonesian government must prioritize, one of the government's efforts to reduce infant mortality is by conducting a surveillance program, namely PWS KIA where the program monitors maternal and infant health in the local area. Basically there are several infant deaths that have causes from the time of pregnancy, accidents, disasters, diseases or because it is destiny from God (Salina et al., 2019), For this reason, research is carried out in classifying infant mortality data (Junaedi et al., 2019). For grouping infant mortality data, a K-Means method is needed to analyze data by carrying out a data modeling process without supervision or also known as unsupervised learning. Based on the data obtained, to find out which subdistricts and sub-districts can be grouped based on the same factors that affect the infant mortality rate in Jakarta Indonesia, in order to obtain good grouping results that can carry out more accurate and efficient handling of infant mortality, data processing is needed to determine the patterns from the data which then from the patterns obtained are taken hidden information from the data, then in the processing using the K-Means method which can analyze and classify a partition based on N objects with observations into groups of objects where the object group has the closest mean (Aditya et al., 2018) and performs data grouping with a partition system (Santiko et al., 2018). Data that has been grouped based on the same clusters and have the same characteristics are grouped into one cluster and clusters that have different characteristics are grouped into other clusters that have the same cluster in that cluster. (Suniantara et al., 2020). From the grouping data, it can be done by predicting infant mortality data using the Decisiion Tree algorithm (Arifin & Herliana, 2020) (Charbuty & Abdulazeez, 2021) which produces a decision tree that is flexible enough so that it is good at making decisions. (Syamsu et al., 2019). This research focuses on data on infant mortality in the province of DKI Jakarta Indonesia in 2018 using the K-Means method which is then carried out using the Decisiion Tree model. Previously, in research on factors related to infant mortality, the quantitative nature of which used the cross sectional method and resulted in parental work and the cost of living a healthy life had an effect on infant mortality (Lengkong, G.T., Langi, F.L.F.G and Posangi, 2020). And distance is very influential on the death of the baby (Fitri et al., 2017). Neonatal deaths that occur with infants using the C.45 algorithm method have been successfully carried out and can support the results of the risk analysis on infant mortality (Junaedi et al., 2019). Existing research on the K-Means clustering method has not used existing data from data.go.id about infant mortality and there are no predictions in infant mortality. The novelty and contribution of the research is in the dataset used. The purpose of this study is that the community or government can reduce the infant mortality rate in the DKI Jakarta province of Indonesia (Wulan Sari et al., 2018).

RESEARCH METHODOLOGY
Metode non hierarchical Cluster (Oktavia et al., 2020) the way of working starts from determining the desired number of clusters, namely as many as four clusters in this study. After determining the number of clusters, determine the clusters in the infant mortality data without following the hierarchical process. The methodology in this study uses sample data taken from the global dataset of infant mortality data on data.go.id in 2018.
DL 2 ( 2 , 1 ) = || 2 − 1 ||2………………………(1) From the results of the cluster data will be used to predict infant mortality using a decision tree algorithm. The Decision Tree algorithm is one of the algorithms in the data mining process that makes a decision tree by having attributes and making it a root node and will make each branch for each value have the same class. Here's the formula for determining Gain and Entropy.
The following is a model algorithm in infant mortality research using K-Means and Decisiion Tree which can be seen in figure 1.
Source: (Ridwansyah et al., 2022) Figure 1. Research Steps a. The collection of infant mortality datasets carried out by the DKI Jakarta provincial government and uploaded to data.go.id which can be downloaded for free which is then processed using datamining using k-means and decision trees. b. The next stage is to determine the number of clusters from the infant mortality dataset, the determination of clusters does not specify how Paradigma, Vol. 24, No. 2, September 2022 P-ISSN 1410-5063, E-ISSN: 2579-3500 many there are. c. The number of selected clusters is 4 clusters which are randomly selected points from the infant mortality dataset which will later be used for grouping the data. d. This data grouping from the data will be divided into four parts, data cluster1, data cluster2, data cluster3, data cluster4. e. Update the cendroid point value until the value doesn't change anymore. f. Repeat Steps d and e until the value of the centroid point does not change anymore. g. The cluster results from the k-means test on the infant mortality dataset were again tested using the decision tree algorithm with the result labels from the cluster. h. The infant mortality dataset is preprocessed first, such as normalizing etc i. Testing the infant mortality dataset using a decision tree algorithm with the aim of getting high accuracy and AUC results to predict infant mortality.

K-Means
The data processed in this study is data taken from the global dataset of infant mortality data in 2018. The infant mortality dataset consists of attributes of year, name of city, name of sub-district, name of village, gender and number. The data to be tested and grouped consists of 403 infant mortality data.

Tabel 1. Sample Data on infant mortality
Source: (Ridwansyah et al., 2022) From the data, it was tested and grouped using the K-Means method by determining the number of 4 clusters with the clusters being randomly selected which can be seen in table 2. Table 2. Initial Cluster of infant mortality data Source: (Ridwansyah et al., 2022) From the initial cluster data, the distance between the data objects and the centroid is calculated by calculating the distance using L2 (Euclidean) distance space, from calculating the distance between two points it is calculated and can be seen in table 3.
Year City Districts  Ward  Sex  Number  of  Deaths  2018  1  1  1  1  1  2018  1  1  1  2  2  2018  1  1  2  1  2  2018  1  1  2  2  2  2018  1  1  2  2  5  2018  1  1  2  1  8  2018  1  2  3  2  1  2018  1  2  2  1  3  2018  1  2  3  2  3  2018  1  2  3  2  3  2018  1  2  3  1  3  2018  1  2  3  1  4  2018  1  2  2  2  5  2018  1  2  3  1  6  2018  1  2  3  1  8  2018  1  2  3  2  9 Group Cluster Year  Source: (Ridwansyah et al., 2022) Data that has been placed in the form of the nearest cluster and can be calculated back to the center of the new cluster based on the average of the members in the nearest cluster. With the results of the calculations, the new centroids of cluster 1, cluster 2, cluster 3, cluster 4 are obtained, a new center point is obtained from each cluster, then recalculate the data with the new cluster center and can be repeated until the last pattern of the same cluster in the cluster is obtained. previous iteration that has not moved. In the study of infant mortality data, the data was calculated at the 10th iteration (Ten), in the 10th iteration the cluster data did not change and there was no more data moving from one cluster to another which can be seen in table 4. Source: (Ridwansyah et al., 2022) Paradigma, Vol. 24, No. 2, September 2022P-ISSN 1410-5063, E-ISSN: 2579 From the calculation of the cluster center and the results of the new centroid, it can be seen the results and the last pattern of the distance between the centroid distance and the center of the cluster.  196152 5.196152 5.196152 5.196152 4 403 2 5.196152 5.196152 5.196152 5.196152 4 Source: (Ridwansyah et al., 2022) The data grouped in cluster 1 amounted to 43 deaths, in cluster 2 there were 126 deaths, in cluster 3 there were 45 deaths and in cluster 4 there were 189 deaths.
With the data criteria that cluster 4 has the highest infant mortality criteria, cluster 2 is ranked 2nd, cluster 3 is ranked 3rd and cluster 1 has the last ranking criteria. From the cluster data obtained, which were tested with the Decission Tree algorithm to produce accurate and correct prediction results in predicting infant mortality data in sub-districts and urban villages in the DKI Jakarta province.

Decission Tree
From the test results by grouping infant mortality data using K-Means, it will be tested using a decision tree algorithm by calculating first with the entropy and gain formulas to determine the attribute to be used as the root node and determine other attributes to become the next node. Paradigma, Vol. 24, No. 2, September 2022P-ISSN 1410-5063, E-ISSN: 2579 Vol. 24, No. 2, September 2022P-ISSN 1410-5063, E-ISSN: 2579 After getting the decision tree, the accuracy of the decision tree algorithm will be obtained with an accuracy value of 99.75%. From these results, it can be used by the community and the government to overcome the high infant mortality and infant mortality in the highest urban village must be considered first.

CONCLUSION
From the results of the study, the goal obtained is that with the successful trial of the K-Means and Decision tree methods. So with these data it can be overcome which area has the greater infant mortality, so that infant mortality in the blood can be reduced. Performed in the application of data mining models in the grouping of infant mortality, the pattern obtained is re-implemented using a decision tree algorithm, from the results of the analysis it is known that: a. Determination of the closest distance in making the k-means pattern using the Euclidean distance.
b. Determination of the closest distance is more optimal than using Mahattan distance and chbchep distance in classifying infant mortality.
c. In determining the centroid in the early stages of the k-means algorithm, it is very influential on the results of the cluster carried out on the infant mortality dataset taken from data.go.id with different centroid results. d. The results of the clustering model pattern that can be evaluated by the government or the Health department to prevent infant mortality. e. From the clustering results, four labels were tested using the decision tree algorithm.