In this notebook, we will be discussing cluster analysis and its application on a dataset, as well as a beginner introduction to R and Rstudio.
Cluster analysis involves finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
Cluster analysis is applied for understanding or summarization of datasets. Notion of a cluster can be ambiguous: a dataset can be clustered into different number of clusters.
There are two main types of clustering:
For both types above, characteristics of the input data are important, such as:
Basically, k-means is a partitional centroid-based clustering algorithm that works as follows.
There are 2 hyperparameters that must be determined by the user in implementing k-means clustering: then number of centroids \(k\), and the distance metric used (usually Euclidean distance or Manhattan distance). In this exercise, we will use Euclidean distance which comes by the formula \(d(x,y) = \sqrt{\sum_{i=1}^n (x_{i} - y_{i})^2}\) where \(x = (x_1, x_2, \dots, x_n)\) and \(y = (y_1, y_2, \dots, y_n)\) are observations with dimension \(n\). Hence, we only need to determine \(k\).
K-means clustering has several limitations when:
clusters are of differing sizes, densities, and has non-globular shapes, and
the data contains outliers.
One solution to overcome these limitations is to use many clusters: find parts of the clusters, but then need to put the clusters together.
Initial centroid randomization is also important since bad initialization produces bad clustering. But this issue can be tackled by running the algorithm multiple times.
[RStudio tour]
Before jumping into the case study, we need to understand some operations and terms in data analysis. Starting this section, we will do some programming.
In R markdown, we recognize 2 types of text writing:
## [1] "D:/Olimpiade UNS"
We use <- as assignment operator.
Some arithmetic operators in R:
+ addition- subtraction* multiplication/ division^ exponent%% modulus## [1] 18
## [1] 8
## [1] 65
## [1] 2.6
## [1] 371293
## [1] 3
Some relational operators in R:
< less than> greater than<= less than or equal to>= greater than or equal to== equal to!= not equal to## [1] TRUE
## [1] FALSE
Basic data types in R:
character for characters and stringscomplex for complex numbersnumeric for real numbers (can be decimals or integers)integer for integerslogical for TRUE or FALSE## [1] "character"
## [1] "complex"
## [1] "numeric"
## [1] "integer"
## [1] "logical"
Basic data structures in R:
vector: 1-dimensional, contains objects with the same classmatrix: 2-dimensional, contains objects with the same classlist: may contain objects with different classes and each object may have different lengthfactor: contains categorical objectsdataframe: a special list, may contain objects with different classes but each object must have the same length## [1] "Albert" "Robert" "Hilbert" "Math" "Physics" "Astronomy"
## [,1] [,2]
## [1,] "Albert" "Math"
## [2,] "Robert" "Physics"
## [3,] "Hilbert" "Astronomy"
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] "FALSE"
##
## [[3]]
## [1] 1 3 7
##
## [[4]]
## [1] 1+5i
fct <- factor(
x = c('female', 'male', 'male', 'female', 'female'),
levels = c('male', 'female'),
labels = c('M', 'F')
)
fct## [1] F M M F F
## Levels: M F
df <- data.frame(
name = c('Albert', 'Robert', 'Hilbert'),
score = c(60, 100, 90),
pass = factor(c(FALSE, TRUE, TRUE))
)
dfRow filtering.
Column selection by index.
Row filtering and column selection by name.
Select all column except the last.
Slice only for score above 70.
Get the student name for score below 100 and pass is TRUE.
## [1] "Hilbert"
We will use 2020 Sebelas Maret Statistics Olympiad’s problem statement for this case study regarding COVID-19 data.
This study aims to accelerate a more focused handling of COVID-19 for all provinces in Indonesia. The government needs to understand the characteristics of each province and the appropriate treatment for that province, hence it is necessary to group each province based on data. The results of the clustering can be considered as a preliminary strategy for the government’s effort in handling COVID-19 more effectively and efficiently.
First, import all needed libraries. In this case study, we use factoextra for clustering visualization and GGally for plotting heatmaps and pairplots.
Let’s read the dataset.
## 'data.frame': 34 obs. of 18 variables:
## $ Provinsi.Asal: chr "Aceh" "Bali" "Banten" "Bangka Belitung" ...
## $ X1 : int 6436 10697 7593 499 862 3285 93356 968 30043 28723 ...
## $ X2 : int 4318 9505 5735 430 701 2551 77969 368 19924 22980 ...
## $ X3 : int 182 92 127 11 23 24 974 26 500 416 ...
## $ X4 : int 131 99 114 12 2 48 1106 27 440 320 ...
## $ X5 : int 226 343 226 6 43 85 2015 18 559 1579 ...
## $ X6 : int 1 2 7 0 3 0 21 0 0 10 ...
## $ X7 : num 0.67 0.89 0.76 0.86 0.81 0.78 0.84 0.38 0.66 0.8 ...
## $ X8 : num 0.04 0.03 0.03 0.01 0.05 0.03 0.02 0.02 0.02 0.06 ...
## $ X9 : num 0.5 0 2.67 3.86 3.4 4.47 1.76 3.46 2.77 1.46 ...
## $ X10 : int 0 0 26179 5459 0 0 262702 0 72119 52084 ...
## $ X11 : int 0 0 22679 5103 0 0 237573 0 61684 NA ...
## $ X12 : int 2742 0 748 0 2851 1507 0 0 0 13457 ...
## $ X13 : int 189 0 13 0 0 70 0 0 0 770 ...
## $ X14 : int 2553 0 735 0 0 1421 0 0 0 12672 ...
## $ X15 : int 415 0 188 1886 0 212 0 0 0 3781 ...
## $ X16 : int 93 200 130 80 72 92 100 244 91 267 ...
## $ X17 : int 512886 484717 527830 526431 453502 421650 506010 433843 732570 595249 ...
We can look at variable types in df and compare them to the given table below, then change them accordingly.
| Variable | Definition |
|---|---|
| X1 | Kumulatif total terinfeksi COVID-19 (jiwa) |
| X2 | Kumulatif total sembuh dari COVID-19 (jiwa) |
| X3 | Total kasus positif COVID-19 aktif pada tanggal 17 Oktober 2020 (jiwa) |
| X4 | Total kasus sembuh dari COVID-19 pada tanggal 17 Oktober 2020 (jiwa) |
| X5 | Kumulatif total meninggal akibat terinfeksi COVID-19 (jiwa) |
| X6 | Total kasus meninggal akibat COVID-19 pada tanggal 17 Oktober 2020 (jiwa) |
| X7 | Recovery Index (persen) |
| X8 | Case Fatality Ratio (persen) |
| X9 | Rasio lacak & isolasi (persen) |
| X10 | Kumulatif total kasus OTG (jiwa) |
| X11 | Total kasus OTG selesai (jiwa) |
| X12 | Kumulatif total kasus ODP (jiwa) |
| X13 | Total kasus ODP dalam proses (jiwa) |
| X14 | Total kasus ODP selesai (jiwa) |
| X15 | Kumulatif total kasus PDP (jiwa) |
| X16 | Kepadatan Penduduk (jiwa/km2) |
| X17 | Garis Kemiskinan (Rupiah/kapita/bulan) |
Some descriptive statistics.
## Provinsi.Asal X1 X2 X3
## Length:34 Min. : 499 Min. : 368 Min. : 0.0
## Class :character 1st Qu.: 1590 1st Qu.: 1394 1st Qu.: 24.5
## Mode :character Median : 4052 Median : 3216 Median : 48.5
## Mean :10522 Mean : 8282 Mean :126.5
## 3rd Qu.:11194 3rd Qu.: 8112 3rd Qu.:123.2
## Max. :93356 Max. :77969 Max. :974.0
##
## X4 X5 X6 X7
## Min. : 0.00 Min. : 5.00 Min. : 0.000 Min. :0.3800
## 1st Qu.: 17.25 1st Qu.: 43.75 1st Qu.: 0.000 1st Qu.:0.6850
## Median : 41.00 Median : 131.00 Median : 0.500 Median :0.7900
## Mean : 119.06 Mean : 365.62 Mean : 2.471 Mean :0.7585
## 3rd Qu.: 117.00 3rd Qu.: 376.00 3rd Qu.: 2.000 3rd Qu.:0.8550
## Max. :1106.00 Max. :3529.00 Max. :21.000 Max. :0.9400
##
## X8 X9 X10 X11
## Min. :0.01000 Min. : 0.000 Min. : 0 Min. : 0
## 1st Qu.:0.02000 1st Qu.: 0.710 1st Qu.: 0 1st Qu.: 0
## Median :0.03000 Median : 2.200 Median : 0 Median : 0
## Mean :0.03029 Mean : 2.647 Mean : 25716 Mean : 19995
## 3rd Qu.:0.04000 3rd Qu.: 3.390 3rd Qu.: 5454 3rd Qu.: 3983
## Max. :0.07000 Max. :13.130 Max. :385352 Max. :284304
## NA's :1
## X12 X13 X14 X15
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0
## Median : 0 Median : 0.0 Median : 0 Median : 0.0
## Mean : 5430 Mean : 294.5 Mean : 4542 Mean : 461.0
## 3rd Qu.: 2551 3rd Qu.: 113.0 3rd Qu.: 1346 3rd Qu.: 369.2
## Max. :96844 Max. :4519.0 Max. :92325 Max. :3781.0
##
## X16 X17
## Min. : 9.0 Min. : 0
## 1st Qu.: 54.5 1st Qu.:395236
## Median : 103.5 Median :454847
## Mean : 742.0 Mean :457415
## 3rd Qu.: 261.2 3rd Qu.:523045
## Max. :15900.0 Max. :732570
##
Now, inspect missing values.
## Provinsi.Asal X1 X2 X3 X4
## 0 0 0 0 0
## X5 X6 X7 X8 X9
## 0 0 0 0 0
## X10 X11 X12 X13 X14
## 0 1 0 0 0
## X15 X16 X17
## 0 0 0
There is one on X11, at Jawa Tengah province. We can estimate this using the ratio of X11 to X10 at nearby province such as Jawa Barat and Jawa Timur.
## [1] 0.796543
Inspect missing values one more time.
## [1] 0
Great! No more missing values.
Let’s see one more time the definition of each feature in the table at the beginning of this Case Study section. Specifically, look at X3, X4, and X6. These features are positive, recovered, and death cases of COVID-19 on October 17th, 2020, respectively. These figures may not be representative and already captured in X1, X2, and X5 as cumulative positive, recovered, and death cases. Hence, we will drop them.
Now, inspect the heatmap of the correlation coefficient of each feature pair below.
## Warning in ggcorr(df, label = TRUE): data in column(s) 'Provinsi.Asal' are not
## numeric and were ignored
We can see that:
X1 and X2 have strong correlation, since they are cumulative positive and recovered casesX10 and X11 have strong correlation, since both are related to OTG casesX12, X13, and X14 are highly correlated, since all of them are related to ODP casesIt may be redundant to keep all these features. Hence, we pick X1, X10, and X12 as representative and drop the others.
Now, X10, X12, and X15 has many zero values. It’s unlikely that we have zero OTG, ODP, or PDP cases in some province in October 2020, months after the first COVID-19 case has been announced. Hence, this zero values may indicate that the data is unavailable. They will influence k-means algorithm in a bad way if we are persisted to use them. So, let’s drop X10, X12, and X15 altogether.
For the remaining features, we can see the pairplot as follows.
Some notes:
X1, X5, and X16 has many small values and some extremely big values. We can log-transform these features to make a more normal distribution.X17 has zero value for Jawa Timur province, which is impossible since this feature is poverty line. We can use the corresponding value for nearby province such as Jawa Tengah instead.Great! All seems good. Lastly, since we don’t need province names for clustering, we can drop Provinsi.Asal.
We will use k-means clustering to categorize provinces. Since k-means clustering is a distance-based model, the dataset has to be normalized prior to modeling so that the model can treat each feature equally. In other words, if a feature has relatively big values compared to others, then it will dominantly influence the model in selecting centroids. We will use scale() function to the dataset.
A good clustering is the one with the lowest Within Sum of Squares (WSS), that is, the lowest sum of quadratic distance from each observation to the centroid of the cluster it belongs to. A good clustering also has a bigger Between Sum of Squares (BSS). Here, BSS is just the sum of quadratic distance from each centroid to the global mean, weighted by the number of observations in each cluster. It’s easy to see that higher \(k\) corresponds to lower WSS. However, \(k\) has to be determined by business point of view, or more objectively using what’s called the elbow method.
\[WSS = \sum_{i=1}^{N_C} \sum_{x \in C_i} d(x, \bar{x}_{C_i})^2\]
\[BSS = \sum_{i=1}^{N_C} |C_i| \cdot d(\bar{x}_{C_i}, \bar{x})^2\]
Elbow method says that we should pick \(k\) where increasing it will result in no more significant decrease of WSS. For example, we can choose \(k = 4\). Since k-means clustering is a non-deterministic algorithm (due to random initialization of centroids), we will run k-means 5 times.
set.seed(0) # for reproducibility
result <- c()
for (i in 1:5) {
df_km <- kmeans(x = df_scaled, centers = 4)
wss <- df_km$tot.withinss
bss <- df_km$betweenss
print(glue::glue("Trial {i} WSS: {wss} \t BSS: {bss}"))
result <- c(result, list(df_km))
}## Trial 1 WSS: 123.079136453051 BSS: 107.920863546949
## Trial 2 WSS: 128.225814265618 BSS: 102.774185734382
## Trial 3 WSS: 125.178352886141 BSS: 105.821647113859
## Trial 4 WSS: 131.23630399328 BSS: 99.7636960067199
## Trial 5 WSS: 124.829641123719 BSS: 106.170358876281
Then take the clustering with the lowest WSS.
## K-means clustering with 4 clusters of sizes 15, 3, 5, 11
##
## Cluster means:
## X1 X5 X7 X8 X9 X16
## 1 0.06187671 0.2743736 0.4691381 0.5505091 -0.3054288 0.08825619
## 2 -1.31929482 -1.7281097 0.8555248 -1.2359586 1.7849176 0.85964253
## 3 1.59257372 1.4442596 0.3875071 0.5911106 -0.4157849 0.27662998
## 4 -0.44846680 -0.5593248 -1.0491983 -0.6823012 0.1186913 -0.48053821
## X17
## 1 -0.3343062
## 2 -0.1161377
## 3 1.4281663
## 4 -0.1616205
##
## Clustering vector:
## Aceh Bali Banten Bangka Belitung
## 1 1 1 2
## Bengkulu DI Yogyakarta DKI Jakarta Jambi
## 1 1 3 4
## Jawa Barat Jawa Tengah Jawa Timur Kalimantan Barat
## 3 3 3 2
## Kalimantan Timur Kalimantan Tengah Kalimantan Selatan Kalimantan Utara
## 1 1 1 2
## Kepulauan Riau Nusa Tenggara Barat Sumatera Selatan Sumatera Barat
## 1 1 1 4
## Sulawesi Utara Sumatera Utara Sulawesi Tenggara Sulawesi Selatan
## 1 1 4 3
## Sulawesi Tengah Lampung Riau Maluku Utara
## 4 4 4 1
## Maluku Papua Barat Papua Sulawesi Barat
## 4 4 4 4
## Nusa Tenggara Timur Gorontalo
## 4 1
##
## Within cluster sum of squares by cluster:
## [1] 38.95366 10.04327 26.75843 47.32378
## (between_SS / total_SS = 46.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Note that BSS or WSS gives different number for each trial because the clustering process starts by randomization of centroids.
We can visualize this clustering as follows.
Each province in the same cluster has similar characteristics based on the data, hence has similar treatment. We can see that many provinces in Java such as Jakarta, Jawa Barat, Jawa Tengah, and Jawa Timur belong to the same cluster, while some provinces with low population density such as Kalimantan Barat, Kalimantan Utara, and Bangka Belitung form another cluster. The transition from lower to higher Dim1 may indicate the urgency transition of the treatment. As an example, let’s say the government has a regulation to wear face coverings with the following urgency levels:
Then we can assign each of these levels to cluster 2, 4, 1, and 3 respectively.
We have completed the following topics: