Cluster analysis is a technique that groups large number of cases (observations) into small number of clusters based on a set of measurements collected from each observations. Cluster analysis can also be applied to variables as a tool for data reduction, such as choosing the best variables or cluster components for analysis. However, in this study, we talk about clustering cases.
Cluster analysis is classified as an unsupervised learning technique in machine learning. This is because, unlike supervised learning, there is no target variable in the data. Many applications of cluster analysis exist. For example, in business applications, cluster analysis can be used to understand customer groups for marketing purposes. With gene expression data, clustering can be used to understand the natural structure inherent in the data.
Approaches to clustering technique can be categorized into three broad categories:
1. Hierarchical methods: Two types of hierarchical methods exist:
The agglomerative hierarchical method is a “bottom-up” approach which starts by defining each observations a cluster. Then, the two closest clusters, measured using some distance measure are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster.
The divisive hierarchical method is a “top-down” approach in which all observations start in one cluster. Then, the initial cluster is divided into two clusters. In each subsequent step, the existing cluster is divided into two clusters.
2. Non-hierarchical methods: Under non-hierarchical methods, observations are initially partitioned into a set of k clusters. This number k has to be pre-specified by the user and may be based on the existing domain knowledge or output from some algorithm. Initially, the assignment of observations into clusters may be random. In subsequent steps, the observations are iteratively moved into different clusters until there is no meaningful reassignment possible.
3. Model based methods: These are statistical approach to clustering. Under these methods, the observed multivariate data is assumed to have been generated from a finite mixture of component models. Each component model is a probability distribution, typically a multivariate distribution. For instance, in a multivariate Gaussian mixture model, each component is a multivariate Gaussian distribution. The clustering algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model.
In this study we hope to group the 50 US states into clusters that share similar figures in terms of the attributes considered. Data has been taken from CDC website and consists three variables: (i) Number of cases per 100,000 population (ii) Number of deaths per 100,000 population, and (iii) Vaccination rate. Figures in data are cumulative counts for nearly 21 months: from January 21, 2020 to October 15, 2021. Prior to clustering, let’s visualize our data with some plots.
# Read in the data
library(ggplot2)
setwd("U:/COVID")
df = read.csv('coviddata1.csv')
head(df)
## State Cases Deaths Vaccinated
## 1 Alabama 16575 308 43.8
## 2 Alaska 16637 80 51.5
## 3 Arizona 15485 281 52.1
## 4 Arkansas 16744 270 46.7
## 5 California 12093 177 60.2
## 6 Colorado 12190 136 60.5
summary(df[,2:4])
## Cases Deaths Vaccinated
## Min. : 5419 Min. : 52.0 Min. :40.80
## 1st Qu.:12166 1st Qu.:159.2 1st Qu.:49.05
## Median :14306 Median :216.0 Median :53.35
## Mean :13707 Mean :204.1 Mean :55.22
## 3rd Qu.:16161 3rd Qu.:245.5 3rd Qu.:61.55
## Max. :18469 Max. :332.0 Max. :70.40
ggplot(data = df,mapping = aes(x = Cases, y = Deaths))+
geom_point() +
ggtitle("Plot of Deaths vs Cases") +
xlab("Cases per 100,000 Population") + ylab("Deaths per 100,000 Population")
It seems that the number of deaths increased only until a certain number of cases. For cases around 12,000 per 100,000 population and above, the relation does not look obvious.
ggplot(data = df, mapping = aes(x = Cases, y = Vaccinated))+
geom_point() +
ggtitle("Cases vs Vaccination Rate") +
xlab("Cases per 100,000 Population") + ylab("Percentage of Vaccinated Population")
The relationship between vaccination rate and number of cases seems negative. States with low cases have high rate of vaccination. This may also be the other way around: States with high rate of vaccination have low number of cases.
ggplot(data = df,mapping = aes(x = Deaths, y = Vaccinated))+
geom_point() +
ggtitle("Plot of Deaths vs Vaccination Rate") +
xlab("Deaths per 100,000 Population") + ylab("Percentage of Vaccinated Population")
There is no obvious linear relationship between deaths and vaccination rate. Some states with low death rates have high vaccination rate while the other with high death rates have lower vaccination rates.
This study uses the most popular method called K-means clustering that falls under non-hierarchical method to cluster US states based on three important COVID measures.
This method involves a partitioning approach that begins with random partition of observations into pre-defined number of clusters, k. An alternative to this random partition exists which requires starting with an additional set of starting points to form the clusters’ centers. Clustering takes place such that objects within the same clusters are as similar as possible (i.e.high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e.low inter-class similarity). The random partition approach works in the following way:
(i) A number from 1 to k is randomly assigned to each observations in the dataset. This serves as initial cluster assignment.
(ii) Iterate until the cluster assignment stop changing:
(a) For each of the k clusters, the cluster centroid is computed. The kth cluster centroid is the vector of p variables means for the observations in the kth cluster.
(b) Each observation should be assigned to the cluster whose centroid is closest (where closest may be defined using some distance measures such as the Euclidean distance).
Domain knowledge may play an important role in deciding the number of clusters, k. Moreover, there are techniques which can help decide this number. We see some of these techniques in our dataset.
1. Elbow Method
In this method, K-means clustering is performed with varying values of k (say 1 to 15). For each k, the total within-cluster sum of squares (WSS), also called within-cluster variation, is calculated. WSS measures total variation within clusters and we want to keep it at a minimum. It is given by the sum of squared distance between each point and the centroid in the cluster. Consider n observations have been grouped into g clusters based on p variables and each cluster consists of nk cases such that \[\sum_{k=1}^{g} n_{k} = n\] Let \(\overline{x}_{sk}\) be the mean of the \(k^{th}\) cluster for the \(x_{s}\) variable, s = 1,2,..p. The within-cluster sum of squares for the kth cluster is given by; \[
W_k = \sum_{s=1}^{p}\sum_{i=1}^{n_k}(x_{is}- \overline{x}_{sk})^2
\]
The total within-cluster sum of squares of g clusters is given by, \[ W = \sum_{k=1}^{g}W_k \]
We plot WSS for various K values and locate the bend (elbow-shape) to get an idea of the appropriate number of clusters. This bend is the position at which WSS rapidly decreases and beyond which the decrease is not noteworthy.
To demonstrate in R, let’s begin with scaling the data so all variables are measured around same unit.
rownames(df) = df$State
newdf = df[,2:4]
# Scale the data
scaled.df = scale(newdf)
head(scaled.df)
## Cases Deaths Vaccinated
## Alabama 0.9233847 1.5013577 -1.4165100
## Alaska 0.9433449 -1.7945171 -0.4615873
## Arizona 0.5724715 1.1110567 -0.3871777
## Arkansas 0.9777924 0.9520452 -1.0568638
## California -0.5195448 -0.3923247 0.6173513
## Colorado -0.4883167 -0.9850040 0.6545561
set.seed(123)
wss = numeric()
# Run k means algorithm for different values of k.
for (k in 1:10){
wss[k] = kmeans(scaled.df, centers = k, nstart = 25 )$tot.withinss
}
# nstart = 25 option generates 25 initial configurations and the best one is reported.
plot(1:10, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters, K",
ylab="Total within-clusters sum of squares (WSS)")
It seems that 4 would be an appropriate number since it is at the bend in the plot. We see this further with other methods.
2. Average Silhouette Method
The method is based on a measure of the quality of clustering. The silhouette value, denoted s(i), for each observation i, measures how similar an observation is to its own cluster (cohesion) compared to other clusters (separation). This value ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring cluster. The method works in the following way: For each observation i, first the average distance between i and all other observations in the same cluster is calculated. This is called measure of cohesion (say, a). Next, the average distance between i and all points in the nearest cluster is calculated. This is called measure of separation (say, b). The silhouette coefficient for \(i^{th}\) observation is then, \[
s(i) = \frac{b(i)-a(i)}{max(b(i)-a(i))}
\]
The average silhouette method calculates average s(i) for different values of k that are pre-specified. The optimal k is the one for which average silhouette is maximum.
library(cluster)
# function to compute average silhouette for k clusters
sil_score <- function(k) {
km <- kmeans(scaled.df, centers = k, nstart = 25)
ss <- silhouette(km$cluster, dist(scaled.df))
mean(ss[, 3])
}
# Choose k from 2 to 10
k <- 2:10
# extract avg silhouette for 2-10 clusters
avg_sil <- sapply(k,sil_score)
plot(k, avg_sil,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters, K",
ylab = "Average Silhouettes")
The plot shows that the average silhouttes is maximized for 2 clusters followed by 3 and 4. We will see what the third method suggests.
3. Gap Statistic Method
This method involves comparing the total within-cluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). As a reference dataset, for each variable we sample points uniformly in its range using Monte Carlo simulations. The gap statistic for a particular k is given by,
\[Gap_n(k) = E^*_n(log(W_k))-log(W_k)\]
Where \(E^*_n\) denotes expectation under a sample of size n from the reference distribution. \(W_k\) is the pooled within cluster sum of squares around the cluster means. The optimal value of k will be the one maximizing \(Gap_n(k)\).
set.seed(123)
gap_stat <- clusGap(scaled.df, FUN = kmeans, nstart = 25, K.max = 10, B=50) # B is the number of Monte Carlo ("bootstrap") samples
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = scaled.df, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 3.282770 3.544523 0.2617536 0.04606714
## [2,] 2.991342 3.188880 0.1975384 0.04375235
## [3,] 2.800672 3.020723 0.2200513 0.04124840
## [4,] 2.611536 2.875898 0.2643616 0.04266198
## [5,] 2.499115 2.753141 0.2540269 0.04264541
## [6,] 2.415336 2.651477 0.2361406 0.04331299
## [7,] 2.313148 2.564112 0.2509639 0.04523043
## [8,] 2.216863 2.484119 0.2672566 0.04714274
## [9,] 2.142364 2.408140 0.2657763 0.04804113
## [10,] 2.055956 2.334714 0.2787574 0.04962355
# visualize the result
library(factoextra)
fviz_gap_stat(gap_stat)
The gap statistic is high at K=4. Although it has higher values at K= 8,9 and 10, considering the size of data and results from previous two methods, it seems reasonable to conclude that K = 4 is the choice for optimal number of clusters. Finally, we run the K-means clustering with K = 4.
set.seed(123)
final.res <- kmeans(scaled.df, centers = 4, nstart = 25)
# The cluster element gives the cluster labels for each of the n observations.
print(final.res$cluster)
## Alabama Alaska Arizona Arkansas California
## 3 2 3 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 4 2 3 3
## Hawaii Idaho Illinois Indiana Iowa
## 1 3 2 3 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 3 1 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 4 2 2 3 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 3 1 4
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 2 3 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 1 4 4 3
## South Dakota Tennessee Texas Utah Vermont
## 3 3 2 2 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 1 1 3 2 3
# fviz_cluster is for visualizing clustering results using principal components.
fviz_cluster(final.res, data = scaled.df)
We can see the average values of the features in terms of original data (prior to scaling) for the four clusters obtained.
library(tidyverse)
newdf %>%
group_by(Cluster = final.res$cluster) %>%
summarise_all("mean")
## # A tibble: 4 x 4
## Cluster Cases Deaths Vaccinated
## <int> <dbl> <dbl> <dbl>
## 1 1 8077. 105. 64.1
## 2 2 14133. 179. 54.3
## 3 3 16245. 252. 47.6
## 4 4 12880 267. 66.1
Four clusters have been identified in terms of the chosen COVID measures. States in Cluster 1 have the lowest number of average cases and deaths per 100,000 population where the average vaccination rate is close to that of highest vaccinated cluster: cluster 4. Cluster 2 contains states with second-most highest average cases and a relatively lower deaths than Cluster 3 and 4. The vaccination rate, on the other hand, is lower than the rates in Cluster 1 and 4. Cluster 3 happens to have the highest number of cases, relatively higher number of deaths and is the lowest vaccinated cluster among the four. Cluster 4 includes states with highest number of average deaths and also has the highest average vaccination rate.
K-means clustering is a simple and elegant approach to clustering that is suitable for extremely large data sets. Unlike hierarchical and model based clustering approaches, this approach is not computationally intensive. However, there are some disadvantages associated. K-Means clustering requires K to be chosen manually by the user. We discussed that in absence of the domain knowledge, there are methods to help us but sometimes it may be difficult to reach to a consensus based on the output of those methods. Another disadvantage is that the K-means clustering results depend on the initial random assignment of clusters to observations. This means that the final results may not be reproducible unless you set a seed while running algorithm. Finally, since the approach is based on centroids, outliers can drag the centroids. This may be dealt by treating outliers prior to clustering or using other approaches such as K-Medoids Clustering.,
[1] Discussion of Clustering Methods https://online.stat.psu.edu/stat505/lesson/14
[2] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An Introduction to Statistical Learning : With Applications in R. New York : Springer.
[3] UC Business Analytics R Programming Guide https://uc-r.github.io/kmeans_clustering