This time we will discuss about data ‘Recapitulation of Health HR for each Province that is utilized in Health Service Facilities (Fasyankes) in 2018’. With the aim of grouping the availability of health facilities and services in each province using the clustering method, namely “k-Means” and “Self Organizing.”

Data Source: (http://bppsdmk.kemkes.go.id/info_sdmk/history/#)

The data consists of 34 rows and 11 columns, namely:

Nama.Provinsi: is the name of each province
Jumlah Puskesmas: is data on the number of Public Health in each province.
Total.RS: is data on the number of hospitals in each province.
Dokter.Spesialis: is data on the number of specialist doctors in each province.
Dokter.Umum: is data on the number of general practitioners in each province.
Dokter.Gigi…Spesialis: is data on the number of dental specialists in each province.
Keperawatan: is data on the number of nurses in each province.
Kebidanan: is data on the number of midwives in each province.
Farmasi: is the data on the number of pharmacists in each province.
Nakes.Lainnya: is data on the number of other health workers not mentioned in each province.
Tenaga.Penunjang: is data on the number of supporting staff in each province.

Data Wrangling

data <- read.csv("data.csv")
names(data)

##  [1] "No."                     "Nama.Provinsi"          
##  [3] "Jumlah.Puskesmas"        "Jumlah.RS"              
##  [5] "Dokter.Spesialis"        "Dokter.Umum"            
##  [7] "Dokter.Gigi...Spesialis" "Keperawatan"            
##  [9] "Kebidanan"               "Farmasi"                
## [11] "Nakes.Lainnya"           "Tenaga.Penunjang"       
## [13] "Jumlah"

data <- data %>% 
  select(-No., -Jumlah)

data_result <- data %>% 
  select(Nama.Provinsi)

rownames(data) <- data[,"Nama.Provinsi"]
data <- data[,-1]
head(data)

##                  Jumlah.Puskesmas Jumlah.RS Dokter.Spesialis Dokter.Umum
## ACEH                          347        70             1328        1580
## SUMATERA UTARA                575       237             3799        3378
## SUMATERA BARAT                276        82             1141        1107
## RIAU                          232        73             1209        1629
## JAMBI                         207        41              496        1006
## SUMATERA SELATAN              341        76             1563        1283
##                  Dokter.Gigi...Spesialis Keperawatan Kebidanan Farmasi
## ACEH                                 346       11635     12270    1441
## SUMATERA UTARA                       826       16221     16938    1716
## SUMATERA BARAT                       408        7978      5986    1499
## RIAU                                 450        8287      7333    1554
## JAMBI                                266        6593      5263    1102
## SUMATERA SELATAN                     275       12854     11805    1591
##                  Nakes.Lainnya Tenaga.Penunjang
## ACEH                      6513             7808
## SUMATERA UTARA            5730            10774
## SUMATERA BARAT            4362             7155
## RIAU                      3240             7345
## JAMBI                     2785             4878
## SUMATERA SELATAN          5432             9806

Elbow Method

wss <- function(data, maxCluster = 15) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}

wss(data)

Elbow method is one way to determine the number of clusters that you want to form, by looking at the point that shows the movement starting to sloping.

In the visualization above, the amount determined can be seen at the point that indicates the movement began to sloping. In the visualization above point number 9 is a starting point. However, taking into consideration it would be ’unwise if it still kept the number of k = 9, because it has a collection of only 34 (each province). There will be many clusters that do not have members.

I chose k = 3 because it considers the territory of Indonesia divided into 3 major parts. (Western Indonesia Region, Central Indonesia Region, and Eastern Indonesia Region). It will be easier to prove the properties of the clusters formed.

K-Means Algorithm with k=3

The process below is the process of making a k-Means clustering model using the kmeans function with k or cluster groups to be formed by 3.

set.seed(27)
kmeans_model <- kmeans(data, centers = 3)

Characteristics of grouping based on k-means clustering

Below is shown the characteristics of each cluster. Namely the average value of each variable based on the cluster.

kmeans_model[2]

## $centers
##   Jumlah.Puskesmas Jumlah.RS Dokter.Spesialis Dokter.Umum
## 1         195.9583  42.75000         684.2083    812.7917
## 2         819.2500 307.75000        7494.2500   6746.0000
## 3         347.1667  97.83333        1848.5000   2274.3333
##   Dokter.Gigi...Spesialis Keperawatan Kebidanan   Farmasi Nakes.Lainnya
## 1                210.1250    6201.917  3619.708  882.0833      2656.667
## 2               1829.7500   41295.000 19121.500 8651.0000     14248.500
## 3                544.1667   16006.333 12341.167 2086.6667      6684.833
##   Tenaga.Penunjang
## 1          4931.00
## 2         40045.25
## 3         12774.00

Showing Cluster Members

fviz_cluster(kmeans_model, data = data)

Visualization of the cluster plot above illustrates the area of objects that are divided into 3 clusters.

Cluster 1 is depicted as red in 24 provinces.
Cluster 2 is depicted in green which consists of 4 provinces.
Cluster 3 is depicted in blue which consists of 6 provinces.

To be able to clearly see which provinces are included in clusters 1,2, or 3. Then see the table below:

Below this is the name of the province and their respective clusters.

data_result$cluster_kmeans <- kmeans_model$cluster

Self Organizing Maps Algorithm

The process below is the process of making a k-Means clustering model using the som function.

data <-as.matrix(data)
som_grid <- somgrid(xdim=6, ydim=5, topo=c("rectangular","hexagonal"))
sommodel <- som(data, grid=som_grid,rlen=500, alpha= c(0.05,0.01), keep.data=TRUE)

Iteration Process

The figure below explains the amount of training progress that shows the number of iterations and the impact on the average distance to the nearest unit that is getting smaller. It can be seen that iteration shows convergence starting from the iteration to 200

plot(sommodel, type = "changes")

Characteristics of grouping based on SOM clustering

The graph below illustrates the characteristics of each node. For example, node (5.6) has the characteristics of the ten most variables. The bar in the node represents the variable frequency, for example, the variable “Total. Puskesmas (Number of Public Health)” has a green bar (3 o’clock) and then the next variable is shown by the next bar which is counterclockwise.

plot(sommodel, type = "codes")

But the weakness of the visualization above is that we cannot know which node (5,6) is which cluster?

set.seed(99)
library(tempR)
som_cluster <- cutree(hclust(dist(sommodel$codes[[1]])),3)

Now to overcome the shortcomings of the visualization above, we can see the plot below. The red color indicates the characteristics of cluster 1, the orange color indicates the characteristics of cluster 2, and the yellow color indicates the characteristics of cluster 3.

plot(sommodel, type="codes", bgcol = rainbow(10)[som_cluster], main = "Clusters") 
add.cluster.boundaries(sommodel,som_cluster)

Based on these images, it can be seen that the model formed by the Kohonen SOM algorithm is then formed into 3 groups. Each cluster formed has its own characteristics. The characteristics of each cluster can be seen through the size of the bars inside the circle. The larger the size, the cluster has the highest average value variable.

From the visualization above, it can be seen that cluster 2 or (orange-colored nodes have the ten characteristics of the highest variable, while cluster 1 (nodes that have a red color) have the least ten characteristics of a variable.

Member Comparison of Each Cluster based on K-Means and SOM

Below is a list of cluster members 1,2 and 3 based on k-means and SOM clustering.

data_result$cluster_som <- som_cluster[sommodel$unit.classif]
rmarkdown::paged_table(data_result)

Conclusion

From the comparison of the cluster results above, it can be seen that using the k-Means method and SOM both produce cluster 1 describing the provinces with the ten lowest characteristics, cluster 2 describing the provinces with the ten highest characteristics, and cluster 3 describing the provinces with the ten characteristics not lowest and not the highest. Although the results of cluster members have several different provinces between k-Means and SOM algorithm.

Insight

As a large and developing country, the government must pay more attention to health issues with the equal distribution of health personnel and health facilities in each province by prioritizing provinces in cluster 1 first.

Comparison of k-Means and SOM Clustering Methods

Hafizah Ilma

October 28, 2019