In this notebook, we will be discussing cluster analysis and its application on a dataset, as well as a beginner introduction to R and Rstudio.

1 Cluster Analysis

1.1 What is Cluster Analysis?

Cluster analysis involves finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

Cluster analysis is applied for understanding or summarization of datasets. Notion of a cluster can be ambiguous: a dataset can be clustered into different number of clusters.

1.2 Types of Clusterings

There are two main types of clustering:

  1. Partitional Clustering: A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.

  1. Hierarchical clustering: A set of nested clusters organized as a hierarchical tree.

For both types above, characteristics of the input data are important, such as:

  1. Type of proximity measure (distance or similarity)
  2. Dimensionality, attribute type, special relationships in the data, distribution of the data
  3. Noise and outliers

1.3 K-means Clustering

Basically, k-means is a partitional centroid-based clustering algorithm that works as follows.

  1. Random initialization: place \(k\) centroids randomly
  2. Cluster assignment: assign each observation to the closest cluster, based on the distance to centroids.
  3. Centroid update: move centroids to the means of observations of the same cluster.
  4. Repeat step 2 and 3 until convergence is reached.

There are 2 hyperparameters that must be determined by the user in implementing k-means clustering: then number of centroids \(k\), and the distance metric used (usually Euclidean distance or Manhattan distance). In this exercise, we will use Euclidean distance which comes by the formula \(d(x,y) = \sqrt{\sum_{i=1}^n (x_{i} - y_{i})^2}\) where \(x = (x_1, x_2, \dots, x_n)\) and \(y = (y_1, y_2, \dots, y_n)\) are observations with dimension \(n\). Hence, we only need to determine \(k\).

K-means clustering has several limitations when:

  1. clusters are of differing sizes, densities, and has non-globular shapes, and

  2. the data contains outliers.

One solution to overcome these limitations is to use many clusters: find parts of the clusters, but then need to put the clusters together.

Initial centroid randomization is also important since bad initialization produces bad clustering. But this issue can be tackled by running the algorithm multiple times.

2 Preliminary Knowledge

[RStudio tour]

Before jumping into the case study, we need to understand some operations and terms in data analysis. Starting this section, we will do some programming.

In R markdown, we recognize 2 types of text writing:

  1. Markdown: used to write regular narratives. Markdown is characterized by a white sheet and no running button.
  2. Chunk: used to write R code. Chunk can be used to write narration but it must be marked with a comment (#). Otherwise, the written narration will be considered as R code. Chunk is characterized by starting with a command in the form of {r}.
# this line is a comment, whereas the next line is a code
getwd()
## [1] "D:/Olimpiade UNS"

2.1 Basic Operations

We use <- as assignment operator.

Some arithmetic operators in R:

  • + addition
  • - subtraction
  • * multiplication
  • / division
  • ^ exponent
  • %% modulus
13 + 5
## [1] 18
13 - 5
## [1] 8
13 * 5
## [1] 65
13 / 5
## [1] 2.6
13 ^ 5
## [1] 371293
13 %% 5
## [1] 3

Some relational operators in R:

  • < less than
  • > greater than
  • <= less than or equal to
  • >= greater than or equal to
  • == equal to
  • != not equal to
a <- 7
b <- 4
print(a >= b)
## [1] TRUE
a <- TRUE
b <- FALSE
print(a == b)
## [1] FALSE

2.2 Data Types

Basic data types in R:

  • character for characters and strings
  • complex for complex numbers
  • numeric for real numbers (can be decimals or integers)
  • integer for integers
  • logical for TRUE or FALSE
a <- 'olympiad'
b <- 3 + 4i
c <- 3.14
d <- -3L
e <- FALSE
class(a)
## [1] "character"
class(b)
## [1] "complex"
class(c)
## [1] "numeric"
class(d)
## [1] "integer"
class(e)
## [1] "logical"

Basic data structures in R:

  • vector: 1-dimensional, contains objects with the same class
  • matrix: 2-dimensional, contains objects with the same class
  • list: may contain objects with different classes and each object may have different length
  • factor: contains categorical objects
  • dataframe: a special list, may contain objects with different classes but each object must have the same length
vec <- c('Albert', 'Robert', 'Hilbert', 'Math', 'Physics', 'Astronomy')
vec
## [1] "Albert"    "Robert"    "Hilbert"   "Math"      "Physics"   "Astronomy"
mat <- matrix(data = vec, nrow = 3, ncol = 2)
mat
##      [,1]      [,2]       
## [1,] "Albert"  "Math"     
## [2,] "Robert"  "Physics"  
## [3,] "Hilbert" "Astronomy"
lst <- list(TRUE, "FALSE", c(1,3,7), 1+5i)
lst
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] "FALSE"
## 
## [[3]]
## [1] 1 3 7
## 
## [[4]]
## [1] 1+5i
fct <- factor(
    x = c('female', 'male', 'male', 'female', 'female'),
    levels = c('male', 'female'),
    labels = c('M', 'F')
)
fct
## [1] F M M F F
## Levels: M F
df <- data.frame(
    name = c('Albert', 'Robert', 'Hilbert'),
    score = c(60, 100, 90),
    pass = factor(c(FALSE, TRUE, TRUE))
)
df

2.3 Dataframe subsetting

Row filtering.

df[2:3, ]

Column selection by index.

df[, c(1,3)]

Row filtering and column selection by name.

df[2:3, c('name', 'pass')]

Select all column except the last.

df[, -c(3)]
subset(df, select = -c(pass))

Slice only for score above 70.

df[df$score > 70, ]

Get the student name for score below 100 and pass is TRUE.

df[df$score < 100 & df$pass == TRUE, 'name']
## [1] "Hilbert"

3 Case Study

We will use 2020 Sebelas Maret Statistics Olympiad’s problem statement for this case study regarding COVID-19 data.

This study aims to accelerate a more focused handling of COVID-19 for all provinces in Indonesia. The government needs to understand the characteristics of each province and the appropriate treatment for that province, hence it is necessary to group each province based on data. The results of the clustering can be considered as a preliminary strategy for the government’s effort in handling COVID-19 more effectively and efficiently.

3.1 Import Libraries

First, import all needed libraries. In this case study, we use factoextra for clustering visualization and GGally for plotting heatmaps and pairplots.

library(factoextra)     # clustering visualization
library(GGally)         # heatmap and pairplot

3.2 Load and Clean the Dataset

Let’s read the dataset.

df <- read.csv('data.csv', row.names = 'No')
df
str(df)
## 'data.frame':    34 obs. of  18 variables:
##  $ Provinsi.Asal: chr  "Aceh" "Bali" "Banten" "Bangka Belitung" ...
##  $ X1           : int  6436 10697 7593 499 862 3285 93356 968 30043 28723 ...
##  $ X2           : int  4318 9505 5735 430 701 2551 77969 368 19924 22980 ...
##  $ X3           : int  182 92 127 11 23 24 974 26 500 416 ...
##  $ X4           : int  131 99 114 12 2 48 1106 27 440 320 ...
##  $ X5           : int  226 343 226 6 43 85 2015 18 559 1579 ...
##  $ X6           : int  1 2 7 0 3 0 21 0 0 10 ...
##  $ X7           : num  0.67 0.89 0.76 0.86 0.81 0.78 0.84 0.38 0.66 0.8 ...
##  $ X8           : num  0.04 0.03 0.03 0.01 0.05 0.03 0.02 0.02 0.02 0.06 ...
##  $ X9           : num  0.5 0 2.67 3.86 3.4 4.47 1.76 3.46 2.77 1.46 ...
##  $ X10          : int  0 0 26179 5459 0 0 262702 0 72119 52084 ...
##  $ X11          : int  0 0 22679 5103 0 0 237573 0 61684 NA ...
##  $ X12          : int  2742 0 748 0 2851 1507 0 0 0 13457 ...
##  $ X13          : int  189 0 13 0 0 70 0 0 0 770 ...
##  $ X14          : int  2553 0 735 0 0 1421 0 0 0 12672 ...
##  $ X15          : int  415 0 188 1886 0 212 0 0 0 3781 ...
##  $ X16          : int  93 200 130 80 72 92 100 244 91 267 ...
##  $ X17          : int  512886 484717 527830 526431 453502 421650 506010 433843 732570 595249 ...

We can look at variable types in df and compare them to the given table below, then change them accordingly.

Variable Definition
X1 Kumulatif total terinfeksi COVID-19 (jiwa)
X2 Kumulatif total sembuh dari COVID-19 (jiwa)
X3 Total kasus positif COVID-19 aktif pada tanggal 17 Oktober 2020 (jiwa)
X4 Total kasus sembuh dari COVID-19 pada tanggal 17 Oktober 2020 (jiwa)
X5 Kumulatif total meninggal akibat terinfeksi COVID-19 (jiwa)
X6 Total kasus meninggal akibat COVID-19 pada tanggal 17 Oktober 2020 (jiwa)
X7 Recovery Index (persen)
X8 Case Fatality Ratio (persen)
X9 Rasio lacak & isolasi (persen)
X10 Kumulatif total kasus OTG (jiwa)
X11 Total kasus OTG selesai (jiwa)
X12 Kumulatif total kasus ODP (jiwa)
X13 Total kasus ODP dalam proses (jiwa)
X14 Total kasus ODP selesai (jiwa)
X15 Kumulatif total kasus PDP (jiwa)
X16 Kepadatan Penduduk (jiwa/km2)
X17 Garis Kemiskinan (Rupiah/kapita/bulan)
df[, c('X16', 'X17')] <- sapply(df[, c('X16', 'X17')], as.numeric)
df

Some descriptive statistics.

summary(df)
##  Provinsi.Asal            X1              X2              X3       
##  Length:34          Min.   :  499   Min.   :  368   Min.   :  0.0  
##  Class :character   1st Qu.: 1590   1st Qu.: 1394   1st Qu.: 24.5  
##  Mode  :character   Median : 4052   Median : 3216   Median : 48.5  
##                     Mean   :10522   Mean   : 8282   Mean   :126.5  
##                     3rd Qu.:11194   3rd Qu.: 8112   3rd Qu.:123.2  
##                     Max.   :93356   Max.   :77969   Max.   :974.0  
##                                                                    
##        X4                X5                X6               X7        
##  Min.   :   0.00   Min.   :   5.00   Min.   : 0.000   Min.   :0.3800  
##  1st Qu.:  17.25   1st Qu.:  43.75   1st Qu.: 0.000   1st Qu.:0.6850  
##  Median :  41.00   Median : 131.00   Median : 0.500   Median :0.7900  
##  Mean   : 119.06   Mean   : 365.62   Mean   : 2.471   Mean   :0.7585  
##  3rd Qu.: 117.00   3rd Qu.: 376.00   3rd Qu.: 2.000   3rd Qu.:0.8550  
##  Max.   :1106.00   Max.   :3529.00   Max.   :21.000   Max.   :0.9400  
##                                                                       
##        X8                X9              X10              X11        
##  Min.   :0.01000   Min.   : 0.000   Min.   :     0   Min.   :     0  
##  1st Qu.:0.02000   1st Qu.: 0.710   1st Qu.:     0   1st Qu.:     0  
##  Median :0.03000   Median : 2.200   Median :     0   Median :     0  
##  Mean   :0.03029   Mean   : 2.647   Mean   : 25716   Mean   : 19995  
##  3rd Qu.:0.04000   3rd Qu.: 3.390   3rd Qu.:  5454   3rd Qu.:  3983  
##  Max.   :0.07000   Max.   :13.130   Max.   :385352   Max.   :284304  
##                                                      NA's   :1       
##       X12             X13              X14             X15        
##  Min.   :    0   Min.   :   0.0   Min.   :    0   Min.   :   0.0  
##  1st Qu.:    0   1st Qu.:   0.0   1st Qu.:    0   1st Qu.:   0.0  
##  Median :    0   Median :   0.0   Median :    0   Median :   0.0  
##  Mean   : 5430   Mean   : 294.5   Mean   : 4542   Mean   : 461.0  
##  3rd Qu.: 2551   3rd Qu.: 113.0   3rd Qu.: 1346   3rd Qu.: 369.2  
##  Max.   :96844   Max.   :4519.0   Max.   :92325   Max.   :3781.0  
##                                                                   
##       X16               X17        
##  Min.   :    9.0   Min.   :     0  
##  1st Qu.:   54.5   1st Qu.:395236  
##  Median :  103.5   Median :454847  
##  Mean   :  742.0   Mean   :457415  
##  3rd Qu.:  261.2   3rd Qu.:523045  
##  Max.   :15900.0   Max.   :732570  
## 

Now, inspect missing values.

colSums(is.na(df))
## Provinsi.Asal            X1            X2            X3            X4 
##             0             0             0             0             0 
##            X5            X6            X7            X8            X9 
##             0             0             0             0             0 
##           X10           X11           X12           X13           X14 
##             0             1             0             0             0 
##           X15           X16           X17 
##             0             0             0

There is one on X11, at Jawa Tengah province. We can estimate this using the ratio of X11 to X10 at nearby province such as Jawa Barat and Jawa Timur.

df$X11_X10_ratio <- df$X11 / df$X10
df[9:11, c('Provinsi.Asal', 'X11', 'X10', 'X11_X10_ratio')]
ratio <- mean(df[9:11, 'X11_X10_ratio'], na.rm = TRUE)
ratio
## [1] 0.796543
df[10, 'X11'] <- df[10, 'X10'] * ratio
df[9:11, c('Provinsi.Asal', 'X11', 'X10', 'X11_X10_ratio')]

Inspect missing values one more time.

df <- df[, -c(ncol(df))]
sum(is.na(df))
## [1] 0

Great! No more missing values.

3.3 Feature Selection

Let’s see one more time the definition of each feature in the table at the beginning of this Case Study section. Specifically, look at X3, X4, and X6. These features are positive, recovered, and death cases of COVID-19 on October 17th, 2020, respectively. These figures may not be representative and already captured in X1, X2, and X5 as cumulative positive, recovered, and death cases. Hence, we will drop them.

df <- subset(df, select = -c(X3, X4, X6))
df

Now, inspect the heatmap of the correlation coefficient of each feature pair below.

ggcorr(df, label = TRUE)
## Warning in ggcorr(df, label = TRUE): data in column(s) 'Provinsi.Asal' are not
## numeric and were ignored

We can see that:

  • X1 and X2 have strong correlation, since they are cumulative positive and recovered cases
  • X10 and X11 have strong correlation, since both are related to OTG cases
  • X12, X13, and X14 are highly correlated, since all of them are related to ODP cases

It may be redundant to keep all these features. Hence, we pick X1, X10, and X12 as representative and drop the others.

df <- subset(df, select = -c(X2, X11, X13, X14))
df

Now, X10, X12, and X15 has many zero values. It’s unlikely that we have zero OTG, ODP, or PDP cases in some province in October 2020, months after the first COVID-19 case has been announced. Hence, this zero values may indicate that the data is unavailable. They will influence k-means algorithm in a bad way if we are persisted to use them. So, let’s drop X10, X12, and X15 altogether.

df <- subset(df, select = -c(X10, X12, X15))
df

For the remaining features, we can see the pairplot as follows.

ggpairs(df, columns = 2:ncol(df))

Some notes:

  • X1, X5, and X16 has many small values and some extremely big values. We can log-transform these features to make a more normal distribution.
  • X17 has zero value for Jawa Timur province, which is impossible since this feature is poverty line. We can use the corresponding value for nearby province such as Jawa Tengah instead.
df$X1 <- log(df$X1)
df$X5 <- log(df$X5)
df$X16 <- log(df$X16)
df[11, 'X17'] <- df[10, 'X17']
ggpairs(df, columns = 2:ncol(df))

Great! All seems good. Lastly, since we don’t need province names for clustering, we can drop Provinsi.Asal.

province <- df$Provinsi.Asal
rownames(df) <- province
df <- df[, -c(1)]
df

3.4 Clustering

We will use k-means clustering to categorize provinces. Since k-means clustering is a distance-based model, the dataset has to be normalized prior to modeling so that the model can treat each feature equally. In other words, if a feature has relatively big values compared to others, then it will dominantly influence the model in selecting centroids. We will use scale() function to the dataset.

df_scaled <- scale(df)

A good clustering is the one with the lowest Within Sum of Squares (WSS), that is, the lowest sum of quadratic distance from each observation to the centroid of the cluster it belongs to. A good clustering also has a bigger Between Sum of Squares (BSS). Here, BSS is just the sum of quadratic distance from each centroid to the global mean, weighted by the number of observations in each cluster. It’s easy to see that higher \(k\) corresponds to lower WSS. However, \(k\) has to be determined by business point of view, or more objectively using what’s called the elbow method.

\[WSS = \sum_{i=1}^{N_C} \sum_{x \in C_i} d(x, \bar{x}_{C_i})^2\]

\[BSS = \sum_{i=1}^{N_C} |C_i| \cdot d(\bar{x}_{C_i}, \bar{x})^2\]

  • \(C_i\) = \(i\)-th cluster
  • \(N_C\) = number of clusters
  • \(\bar{x}_{C_i}\) = \(i\)-th cluster centroid
  • \(\bar{x}\) = mean of all observations
fviz_nbclust(x = df_scaled, FUNcluster = kmeans, method = "wss")

Elbow method says that we should pick \(k\) where increasing it will result in no more significant decrease of WSS. For example, we can choose \(k = 4\). Since k-means clustering is a non-deterministic algorithm (due to random initialization of centroids), we will run k-means 5 times.

set.seed(0) # for reproducibility
result <- c()
for (i in 1:5) {
    df_km <- kmeans(x = df_scaled, centers = 4)
    wss <- df_km$tot.withinss
    bss <- df_km$betweenss
    print(glue::glue("Trial {i} WSS: {wss} \t BSS: {bss}"))
    result <- c(result, list(df_km))
}
## Trial 1 WSS: 123.079136453051     BSS: 107.920863546949
## Trial 2 WSS: 128.225814265618     BSS: 102.774185734382
## Trial 3 WSS: 125.178352886141     BSS: 105.821647113859
## Trial 4 WSS: 131.23630399328      BSS: 99.7636960067199
## Trial 5 WSS: 124.829641123719     BSS: 106.170358876281

Then take the clustering with the lowest WSS.

df_km <- result[[1]]
df_km
## K-means clustering with 4 clusters of sizes 15, 3, 5, 11
## 
## Cluster means:
##            X1         X5         X7         X8         X9         X16
## 1  0.06187671  0.2743736  0.4691381  0.5505091 -0.3054288  0.08825619
## 2 -1.31929482 -1.7281097  0.8555248 -1.2359586  1.7849176  0.85964253
## 3  1.59257372  1.4442596  0.3875071  0.5911106 -0.4157849  0.27662998
## 4 -0.44846680 -0.5593248 -1.0491983 -0.6823012  0.1186913 -0.48053821
##          X17
## 1 -0.3343062
## 2 -0.1161377
## 3  1.4281663
## 4 -0.1616205
## 
## Clustering vector:
##                Aceh                Bali              Banten     Bangka Belitung 
##                   1                   1                   1                   2 
##            Bengkulu       DI Yogyakarta         DKI Jakarta               Jambi 
##                   1                   1                   3                   4 
##          Jawa Barat         Jawa Tengah          Jawa Timur    Kalimantan Barat 
##                   3                   3                   3                   2 
##    Kalimantan Timur   Kalimantan Tengah  Kalimantan Selatan    Kalimantan Utara 
##                   1                   1                   1                   2 
##      Kepulauan Riau Nusa Tenggara Barat    Sumatera Selatan      Sumatera Barat 
##                   1                   1                   1                   4 
##      Sulawesi Utara      Sumatera Utara   Sulawesi Tenggara    Sulawesi Selatan 
##                   1                   1                   4                   3 
##     Sulawesi Tengah             Lampung                Riau        Maluku Utara 
##                   4                   4                   4                   1 
##              Maluku         Papua Barat               Papua      Sulawesi Barat 
##                   4                   4                   4                   4 
## Nusa Tenggara Timur           Gorontalo 
##                   4                   1 
## 
## Within cluster sum of squares by cluster:
## [1] 38.95366 10.04327 26.75843 47.32378
##  (between_SS / total_SS =  46.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Note that BSS or WSS gives different number for each trial because the clustering process starts by randomization of centroids.

result <- data.frame('Cluster' = df_km$cluster, row.names = province)
result

We can visualize this clustering as follows.

fviz_cluster(object = df_km, data = df_scaled, repel = TRUE)

Each province in the same cluster has similar characteristics based on the data, hence has similar treatment. We can see that many provinces in Java such as Jakarta, Jawa Barat, Jawa Tengah, and Jawa Timur belong to the same cluster, while some provinces with low population density such as Kalimantan Barat, Kalimantan Utara, and Bangka Belitung form another cluster. The transition from lower to higher Dim1 may indicate the urgency transition of the treatment. As an example, let’s say the government has a regulation to wear face coverings with the following urgency levels:

  1. Recommended
  2. Required in some specified shared/public spaces outside the home with other people present, or some situations when social distancing not possible
  3. Required in all shared/public spaces outside the home with other people present or all situations when social distancing not possible
  4. Required outside the home at all times regardless of location or presence of other people

Then we can assign each of these levels to cluster 2, 4, 1, and 3 respectively.

4 Summary

We have completed the following topics:

  1. Understanding cluster analysis, it’s types, and k-means algorithm
  2. Data types and basic operations in R, and dataframe subsetting
  3. Solving a case study of COVID-19 in Indonesia

5 Reference