In this paper I will try to describe how to conduct a clustering with various algorithms. I will show different results obtained with various implementations of K-means algorithm, PAM algorithm and hierarchical clustering. There is also a popular algorithm called Clara but it is mainly used for large datasets and we only have 167 observations, so there is no need to introduce it. I will also show how to assess clustering tendency and clustering quality.
Dataset I'm going to work on comes from Kaggle and can be found under this link https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data. It describes 167 countries with 9 variables, namely:
- children mortality - number of deaths of children under 5 years of age per 1000 live births
- exports - proportion of exports in GDP per capita
- health - health spending per capita. Given as percentage of GDP per capita.
- imports - value of imports of goods and services. Given as percentage of the GDP per capita
- income - net income per person
- inflation - the measurement of the annual growth rate of the total GDP
- life expectancy - the average number of years a newborn child would live if the current mortality patterns are to remain the same
- total fertility - the number of children that would be born to each woman if the current age-fertility rates remain the same
- GDP per capita - calculated as the Total GDP divided by the total population.
Load the data
library(data.table)
data <- read.csv("Country_data.csv")
head(data)
## country child_mort exports health imports income inflation
## 1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44
## 2 Albania 16.6 28.0 6.55 48.6 9930 4.49
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.10
## 4 Angola 119.0 62.3 2.85 42.9 5900 22.40
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44
## 6 Argentina 14.5 18.9 8.10 16.0 18700 20.90
## life_expec total_fer gdpp
## 1 56.2 5.82 553
## 2 76.3 1.65 4090
## 3 76.5 2.89 4460
## 4 60.1 6.16 3530
## 5 76.8 2.13 12200
## 6 75.8 2.37 10300
sum(is.na(data))
## [1] 0
Fortunately our dataset doesn't include any missing values. There seems to be no need to tidy the data as it is already well prepared. I will now see summary statistics of all the variables.
summary(data)
## country child_mort exports health
## Length:167 Min. : 2.60 Min. : 0.109 Min. : 1.810
## Class :character 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920
## Mode :character Median : 19.30 Median : 35.000 Median : 6.320
## Mean : 38.27 Mean : 41.109 Mean : 6.816
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600
## Max. :208.00 Max. :200.000 Max. :17.900
## imports income inflation life_expec
## Min. : 0.0659 Min. : 609 Min. : -4.210 Min. :32.10
## 1st Qu.: 30.2000 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30
## Median : 43.3000 Median : 9960 Median : 5.390 Median :73.10
## Mean : 46.8902 Mean : 17145 Mean : 7.782 Mean :70.56
## 3rd Qu.: 58.7500 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80
## Max. :174.0000 Max. :125000 Max. :104.000 Max. :82.80
## total_fer gdpp
## Min. :1.150 Min. : 231
## 1st Qu.:1.795 1st Qu.: 1330
## Median :2.410 Median : 4660
## Mean :2.948 Mean : 12964
## 3rd Qu.:3.880 3rd Qu.: 14050
## Max. :7.490 Max. :105000
library(plyr)
library(psych)
multi.hist(data[, 2:10])
As can be seen from histograms, some of the variables are highly skewed. There also seem to be outliers for almost every variable. It's worth to keep that in mind when choosing the best algorithm for clustering because some of them perform worse with the presence of outliers.
I will also standardize the data. Standardization is required because some of the variables are in different units and we may give larger importance to variables with larger magnitude. To avoid this such a transformation is made that changes the mean of all variables to 0 with a variance of 1 unit. I will also create another data frame with only numerical variables (without countries names) for clustering.
data_cl <- as.data.frame(lapply(data[,2:10], scale))
First of all, before conducting clustering, I will check the clusterability of the data. In order to do that I will use Hopkin's statistics. It can be obtained from the get_clust_tendency() function from the factoextra package. Hopkin's statistic takes values from 0 to 1, with values close to 0 indicating no statistically significant cluster and 1 indicating presence of a statistically significant cluster.
library(factoextra)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#n is a sample size for the Hopkins stat. The higher the n, the better the estimate, although if too many observations are selected some assumptions can be violated.
#(R. G. Lawson and P. C. Jurs, 1990) suggest sampling around 5-10% of the data. I will go for 10%.
get_clust_tendency(data_cl, 17)
## $hopkins_stat
## [1] 0.8533165
##
## $plot
Hopkin's statistic is equal to 0.85, which seems to be at a satisfactory level. Here (https://www.datanovia.com/en/lessons/assessing-clustering-tendency/) we can read that H higher than 0.75 indicates a clustering tendency at the 90% confidence level. There is at least 90% chances that there are meaningful clusters in the dataset.
For every algorithm I will follow the same procedure. First I will determine the optimal number of clusters, then I will do the clustering and check the quality of partitioning. At the end of the paper I will compare results of different algorithms.
There are various methods for determining the appropriate number of clusters. However, there usually is no straightforward answer or correct answer for that problem. It can depend on the purpose of clustering or what we want to achieve so the final decision depends on the researcher. The most popular methods for deciding on the number of clusters are elbow, silhouette and gap statistic method.
The Elbow method uses the total sum of squares (variance explained) to determine the appropriate number of clusters. Here I will use a within group heterogeneity, measured by a total variation within a cluster. We want it to be as low as possible (so that the observations within clusters are as similar as possible), but after a certain point the loss of variation from adding another cluster starts to diminish. The general idea is to choose that point (elbow point, hence the name of the method), after which the total within sum of squares is falling more steadily and the plot starts to take a more linear shape.
However it's also worth to keep in mind that it is rather a heuristic and can be used as a rule of thumb, and may not always give the best answer.
library(ClusterR)
opt<-Optimal_Clusters_KMeans(data_cl, max_clusters=10, plot_clusters = TRUE)
If we followed the elbow method, we would choose 3 clusters for k-means algorithm, because after that point returns start to diminish (from cluster 2 to 3 there is 0.15 gain, from 3 to 4 there is 0.08 gain)
I will now also use silhouette and gap statistic methods from the factoextra package. Silhouette method compares the average silhouette width for every number of clusters (silhouette measures how similar a point is to the cluster it belongs. Values close to 1 indicate that it lies very close to the center (mean for K-means and medoid for PAM) of it's cluster and values close to -1 indicate that it lies far from its own center and close to the center of another cluster. If the average value is near 0, it means that clusters are overlapping itself). The average silhouette method suggests 5 clusters as the optimal number. However the average value is quite similar for the range of 2 to 5 clusters (around 0.3).
The gap statistic method focuses on within-cluster dispersion. Values closer to 1 indicate that the clustering structure is significantly different to the random distribution of points (R. Tibshirani, G. Walther and T. Hastie. 2001). We use here a similar approach to the elbow method with choosing a point after returns are diminishing. Gap statistic method suggests using 3 clusters.
library(pdp)
library(factoextra)
a <- fviz_nbclust(data_cl, kmeans, method = "silhouette",k.max=10)+
labs(subtitle = "Silhouette method")
b <- fviz_nbclust(data_cl, kmeans, method = "wss",k.max=10)+
labs(subtitle = "Total within sum of squares method")
c <- fviz_nbclust(data_cl, kmeans, method = "gap_stat",k.max=10)+
labs(subtitle = "Gap statistic method")
grid.arrange(a, b, c, ncol=2)
I will now compare the results for K-means with 3 and 5 clusters.
set.seed(123)
countries_clusters3 <- kmeans(data_cl, 3)
set.seed(123)
countries_clusters5 <- kmeans(data_cl, 5)
We can visualize clusters.
a <- fviz_cluster(list(data = data_cl, cluster = countries_clusters3$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
b <- fviz_cluster(list(data = data_cl, cluster = countries_clusters5$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
grid.arrange(a, b, ncol=2)
And their average silhouette width (ASW). ASW for 5 clusters is actually lower than what would emerge from the optimal number of clusters analysis (the average silhouette method suggested AWS will be the highest for 5 clusters with value of 0.3), however K-means implementation has elements of randomness (choosing the initial center of cluster) and may give different results with each implementation. That's why it is important to set seed in R in order to have the same results every time.
Fifth cluster of the K-means variant with 5 clusters mainly consists of some outliers in the data (remember that we had some of them for almost every variable) and it's average silhouette is very low (even negative). Not optimal.
library(cluster)
SIL <- silhouette(countries_clusters5$cluster, dist(data_cl))
aggregate(SIL[,3], list(SIL[,1]), mean)
## Group.1 x
## 1 1 0.46469762
## 2 2 0.18964702
## 3 3 0.19718413
## 4 4 0.22187313
## 5 5 -0.03034447
library(cluster)
sil1<-silhouette(countries_clusters3$cluster, dist(data_cl))
c <- fviz_silhouette(sil1)
## cluster size ave.sil.width
## 1 1 36 0.15
## 2 2 84 0.36
## 3 3 47 0.24
sil2<-silhouette(countries_clusters5$cluster, dist(data_cl))
d <- fviz_silhouette(sil2)
## cluster size ave.sil.width
## 1 1 22 0.46
## 2 2 40 0.19
## 3 3 45 0.20
## 4 4 52 0.22
## 5 5 8 -0.03
grid.arrange(c, d, ncol=2)
Based on the analysis above I think K-means with 3 clusters is a better choice. Contrary to the optimal number of cluster analysis, the average silhouette width is actually higher for K-means with 3 clusters. For the 5 clusters K-means the 5th cluster is also quite problematic.
If we go for the 3 cluster K-means we can also check which observation belongs to which cluster.
setDT(data)
data$clusters <- countries_clusters3$cluster
as.vector(data[clusters == "1", country])
## [1] "Australia" "Austria" "Bahrain"
## [4] "Belgium" "Brunei" "Canada"
## [7] "Cyprus" "Czech Republic" "Denmark"
## [10] "Finland" "France" "Germany"
## [13] "Greece" "Iceland" "Ireland"
## [16] "Israel" "Italy" "Japan"
## [19] "Kuwait" "Luxembourg" "Malta"
## [22] "Netherlands" "New Zealand" "Norway"
## [25] "Portugal" "Qatar" "Singapore"
## [28] "Slovak Republic" "Slovenia" "South Korea"
## [31] "Spain" "Sweden" "Switzerland"
## [34] "United Arab Emirates" "United Kingdom" "United States"
as.vector(data[clusters == "2", country])
## [1] "Albania" "Algeria"
## [3] "Antigua and Barbuda" "Argentina"
## [5] "Armenia" "Azerbaijan"
## [7] "Bahamas" "Bangladesh"
## [9] "Barbados" "Belarus"
## [11] "Belize" "Bhutan"
## [13] "Bolivia" "Bosnia and Herzegovina"
## [15] "Brazil" "Bulgaria"
## [17] "Cambodia" "Cape Verde"
## [19] "Chile" "China"
## [21] "Colombia" "Costa Rica"
## [23] "Croatia" "Dominican Republic"
## [25] "Ecuador" "Egypt"
## [27] "El Salvador" "Estonia"
## [29] "Fiji" "Georgia"
## [31] "Grenada" "Guatemala"
## [33] "Guyana" "Hungary"
## [35] "India" "Indonesia"
## [37] "Iran" "Jamaica"
## [39] "Jordan" "Kazakhstan"
## [41] "Kyrgyz Republic" "Latvia"
## [43] "Lebanon" "Libya"
## [45] "Lithuania" "Macedonia, FYR"
## [47] "Malaysia" "Maldives"
## [49] "Mauritius" "Micronesia, Fed. Sts."
## [51] "Moldova" "Mongolia"
## [53] "Montenegro" "Morocco"
## [55] "Myanmar" "Nepal"
## [57] "Oman" "Panama"
## [59] "Paraguay" "Peru"
## [61] "Philippines" "Poland"
## [63] "Romania" "Russia"
## [65] "Samoa" "Saudi Arabia"
## [67] "Serbia" "Seychelles"
## [69] "Solomon Islands" "Sri Lanka"
## [71] "St. Vincent and the Grenadines" "Suriname"
## [73] "Tajikistan" "Thailand"
## [75] "Tonga" "Tunisia"
## [77] "Turkey" "Turkmenistan"
## [79] "Ukraine" "Uruguay"
## [81] "Uzbekistan" "Vanuatu"
## [83] "Venezuela" "Vietnam"
as.vector(data[clusters == "3", country])
## [1] "Afghanistan" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Equatorial Guinea"
## [15] "Eritrea" "Gabon"
## [17] "Gambia" "Ghana"
## [19] "Guinea" "Guinea-Bissau"
## [21] "Haiti" "Iraq"
## [23] "Kenya" "Kiribati"
## [25] "Lao" "Lesotho"
## [27] "Liberia" "Madagascar"
## [29] "Malawi" "Mali"
## [31] "Mauritania" "Mozambique"
## [33] "Namibia" "Niger"
## [35] "Nigeria" "Pakistan"
## [37] "Rwanda" "Senegal"
## [39] "Sierra Leone" "South Africa"
## [41] "Sudan" "Tanzania"
## [43] "Timor-Leste" "Togo"
## [45] "Uganda" "Yemen"
## [47] "Zambia"
After I quick look it would seem that the 3rd cluster consists of mainly quite poor or not well developed countries, 2nd cluster of developing countries and 1st of rich countries or with good standard of living, however there are some exceptions and much more detailed and thorough analysis would be needed.
We can visualize general statistics for every cluster with a simple boxplot.
We can see that countries in the first cluster have relatively high GDP per capita, life expectancy, income, health spendings and low children mortality, but on the other hand low fertility rates. Countries in the third cluster have low GDP per capita and low incomes, but high fertility rates, high inflation and high children mortality.
library(flexclust)
groupBWplot(data_cl, countries_clusters3$cluster, alpha=0.05)
PAM method (partitioning around medoids) uses medoids (most centrally located objects in a cluster) rather than means, like in K-means, and therefore it is less affected by outliers (becasue they affect the mean, but not a medoid). I will now see how the results of clustering differ for PAM.
But first, let's determine the optimal number of clusters.
library(pdp)
a <- fviz_nbclust(data_cl, pam, method = "silhouette",k.max=10)+
labs(subtitle = "Silhouette method")
b <- fviz_nbclust(data_cl, pam, method = "wss",k.max=10)+
labs(subtitle = "Total within sum of squares method")
c <- fviz_nbclust(data_cl, pam, method = "gap_stat",k.max=10)+
labs(subtitle = "Gap statistic method")
grid.arrange(a, b, c, ncol=2)
Average silhouette width suggests 2 clusters (although ASW is similar to PAM with 3 clusters), elbow method 3 clusters and gap statistic method 3 clusters. I will therefore compare results for 2 and 3 clusters.
set.seed(123)
pam2 <- pam(data_cl,2)
set.seed(123)
pam3 <- pam(data_cl,3)
We can see the clusters.
a <- fviz_cluster(list(data = data_cl, cluster = pam2$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
b <- fviz_cluster(list(data = data_cl, cluster = pam3$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
grid.arrange(a, b, ncol=2)
library(cluster)
sil1<-silhouette(pam2$cluster, dist(data_cl))
a <- fviz_silhouette(sil1)
## cluster size ave.sil.width
## 1 1 69 0.28
## 2 2 98 0.29
sil2<-silhouette(pam3$cluster, dist(data_cl))
b <- fviz_silhouette(sil2)
## cluster size ave.sil.width
## 1 1 51 0.25
## 2 2 85 0.31
## 3 3 31 0.26
grid.arrange(a, b, ncol=2)
Average silhouette width is almost the same for 2 clusters and 3 clusters, but for 3 clusters we have more observations with negative silhouette width.
It's also possible to use other distance metric than euclidean one. Possible metrics are manhattan, canberra, binary, maximum or minkowski. You can play with different ones on your own, but it shouldn't change results significantly. Here is the average silhouette width for PAM with manhattan distance.
pam3 <- eclust(data_cl, "pam", k=3, hc_metric="manhattan")
fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 51 0.25
## 2 2 85 0.31
## 3 3 31 0.26
Let's check which countries belong to which cluster.
setDT(data)
data$clusters <- pam3$cluster
as.vector(data[clusters == "1", country])
## [1] "Afghanistan" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Equatorial Guinea"
## [15] "Eritrea" "Gabon"
## [17] "Gambia" "Ghana"
## [19] "Guinea" "Guinea-Bissau"
## [21] "Haiti" "India"
## [23] "Iraq" "Kenya"
## [25] "Kiribati" "Lao"
## [27] "Lesotho" "Liberia"
## [29] "Madagascar" "Malawi"
## [31] "Mali" "Mauritania"
## [33] "Mozambique" "Myanmar"
## [35] "Namibia" "Nepal"
## [37] "Niger" "Nigeria"
## [39] "Pakistan" "Rwanda"
## [41] "Senegal" "Sierra Leone"
## [43] "South Africa" "Sudan"
## [45] "Tajikistan" "Tanzania"
## [47] "Timor-Leste" "Togo"
## [49] "Uganda" "Yemen"
## [51] "Zambia"
as.vector(data[clusters == "2", country])
## [1] "Albania" "Algeria"
## [3] "Antigua and Barbuda" "Argentina"
## [5] "Armenia" "Azerbaijan"
## [7] "Bahrain" "Bangladesh"
## [9] "Barbados" "Belarus"
## [11] "Belize" "Bhutan"
## [13] "Bolivia" "Bosnia and Herzegovina"
## [15] "Brazil" "Bulgaria"
## [17] "Cambodia" "Cape Verde"
## [19] "Chile" "China"
## [21] "Colombia" "Costa Rica"
## [23] "Croatia" "Czech Republic"
## [25] "Dominican Republic" "Ecuador"
## [27] "Egypt" "El Salvador"
## [29] "Estonia" "Fiji"
## [31] "Georgia" "Grenada"
## [33] "Guatemala" "Guyana"
## [35] "Hungary" "Indonesia"
## [37] "Iran" "Jamaica"
## [39] "Jordan" "Kazakhstan"
## [41] "Kyrgyz Republic" "Latvia"
## [43] "Lebanon" "Libya"
## [45] "Lithuania" "Macedonia, FYR"
## [47] "Malaysia" "Maldives"
## [49] "Malta" "Mauritius"
## [51] "Micronesia, Fed. Sts." "Moldova"
## [53] "Mongolia" "Montenegro"
## [55] "Morocco" "Oman"
## [57] "Panama" "Paraguay"
## [59] "Peru" "Philippines"
## [61] "Poland" "Romania"
## [63] "Russia" "Samoa"
## [65] "Saudi Arabia" "Serbia"
## [67] "Seychelles" "Singapore"
## [69] "Slovak Republic" "Solomon Islands"
## [71] "South Korea" "Sri Lanka"
## [73] "St. Vincent and the Grenadines" "Suriname"
## [75] "Thailand" "Tonga"
## [77] "Tunisia" "Turkey"
## [79] "Turkmenistan" "Ukraine"
## [81] "Uruguay" "Uzbekistan"
## [83] "Vanuatu" "Venezuela"
## [85] "Vietnam"
as.vector(data[clusters == "3", country])
## [1] "Australia" "Austria" "Bahamas"
## [4] "Belgium" "Brunei" "Canada"
## [7] "Cyprus" "Denmark" "Finland"
## [10] "France" "Germany" "Greece"
## [13] "Iceland" "Ireland" "Israel"
## [16] "Italy" "Japan" "Kuwait"
## [19] "Luxembourg" "Netherlands" "New Zealand"
## [22] "Norway" "Portugal" "Qatar"
## [25] "Slovenia" "Spain" "Sweden"
## [28] "Switzerland" "United Arab Emirates" "United Kingdom"
## [31] "United States"
Hierarchical clustering is an unsupervised learning algorithm that clusters observations iteratively, starting with every observation as a single clusters and then joining two most similar clusters together, until there is just one cluster (agglomerative approach) or starting with one cluster containing all observations and then splitting them until each observations is a single cluster. It is usually presented with a tree with each level of a tree showing clusters for that level. The problem for a researcher is to choose an optimal number of clusters by cutting the tree at an appropriate level.
We can use various methods for hierarchical clustering like average linkage, single linkage, complete linkage or Ward's method. To choose between them we can compare agglomerative coefficient, which measures the clustering structure of the dataset.
methods <- c( "average", "single", "complete", "ward")
names(methods) <- c( "average", "single", "complete", "ward")
hierarchical_ac <- function(x) {
agnes(data_cl, method = x)$ac
}
library(purrr)
map_dbl(methods, hierarchical_ac)
## average single complete ward
## 0.8752786 0.8430072 0.9130134 0.9519134
Conglomerative coefficient is highest for complete linkage and Ward's methods so I will choose those two for hierarchical clustering.
Perform hierarchical clustering with agnes function.
library(cluster)
set.seed(123)
hc <- agnes(data_cl, method = "ward")
And visualize the dendogram.
fviz_dend(hc, k = 3, k_colors = "jco", as.ggplot = TRUE, show_labels = FALSE)
Now hierarchical clustering with complete linkage method.
set.seed(123)
hc2 <- hclust(dist(data_cl), method = "complete")
If we decide for 3 clusters complete linkage we have 2 large clusters and one with very few observations (only 3), which may not always be satisfactory. It usually depends on a problem, but for clustering countries it is not necessarily optimal.
fviz_dend(hc2, k = 3, k_colors = "jco", as.ggplot = TRUE, show_labels = FALSE)
groups <- cutree(hc2, k = 3)
table(groups)
## groups
## 1 2 3
## 55 109 3
For hierarchical clustering we can also plot average silhouette width statistics. For Ward's method average silhouette width is equal to 0.25 and for complete linkage method to 0.28, however in both cases (especially for Ward's method) we have many observations with negative silhouette widths. That means that some countries are more similar to countries from outside of it's cluster.
library(cluster)
plot(silhouette(cutree(hc, k = 3), dist(data_cl)))
library(cluster)
plot(silhouette(cutree(hc2, k = 2), dist(data_cl)))
I will also check how will divisive hierarchical clustering perform.
set.seed(123)
hc_div <- diana(data_cl)
As we can see divisive hierarchical clustering with 3 clusters results in one large cluster and 2 clusters with only few observations. For 5 clusters we obtained 2 large clusters and 3 rather small ones. It all depends on a problem, but in our case I think it is not very optimal.
a <- fviz_dend(hc_div, k = 3, k_colors = "jco", as.ggplot = TRUE, show_labels = FALSE)
b <- fviz_dend(hc_div, k = 5, k_colors = "jco", as.ggplot = TRUE, show_labels = FALSE)
grid.arrange(a, b, ncol = 2)
Such a partitioning results in very high score for average silhouette width, but it is not necessarily what is the best choice. For example for only 2 clusters AWS is equal to 0.63 but we have 1 cluster with 166 observations and a cluster with only 1 observation. Such a clustering is rather pointless.
library(cluster)
plot(silhouette(cutree(hc_div, k = 2), dist(data_cl)))
For 4 clusters the average silhouette width is equal to 0.3, but we also don't have any observations with negative silhouette width.
library(cluster)
plot(silhouette(cutree(hc_div, k = 4), dist(data_cl)))
Let's see what are those observations for clusters with only few of them.
data$clusters <- cutree(hc_div, k = 4)
as.vector(data[clusters == "3", country])
## [1] "Ireland" "Luxembourg" "Malta" "Singapore"
as.vector(data[clusters == "4", country])
## [1] "Nigeria"
Clusters for hierarchcial clustering using Ward's method.
setDT(data)
data$clusters <- cutree(hc, k = 3)
as.vector(data[clusters == "1", country])
## [1] "Afghanistan" "Benin"
## [3] "Burkina Faso" "Burundi"
## [5] "Cameroon" "Central African Republic"
## [7] "Chad" "Comoros"
## [9] "Congo, Dem. Rep." "Cote d'Ivoire"
## [11] "Gambia" "Guinea"
## [13] "Guinea-Bissau" "Haiti"
## [15] "Kenya" "Madagascar"
## [17] "Malawi" "Mali"
## [19] "Mozambique" "Niger"
## [21] "Rwanda" "Senegal"
## [23] "Sierra Leone" "Tanzania"
## [25] "Togo" "Uganda"
## [27] "Zambia"
as.vector(data[clusters == "2", country])
## [1] "Albania" "Algeria"
## [3] "Angola" "Antigua and Barbuda"
## [5] "Argentina" "Armenia"
## [7] "Azerbaijan" "Bahamas"
## [9] "Bangladesh" "Barbados"
## [11] "Belarus" "Belize"
## [13] "Bhutan" "Bolivia"
## [15] "Bosnia and Herzegovina" "Botswana"
## [17] "Brazil" "Bulgaria"
## [19] "Cambodia" "Cape Verde"
## [21] "Chile" "China"
## [23] "Colombia" "Congo, Rep."
## [25] "Costa Rica" "Croatia"
## [27] "Cyprus" "Czech Republic"
## [29] "Dominican Republic" "Ecuador"
## [31] "Egypt" "El Salvador"
## [33] "Equatorial Guinea" "Eritrea"
## [35] "Estonia" "Fiji"
## [37] "Gabon" "Georgia"
## [39] "Ghana" "Grenada"
## [41] "Guatemala" "Guyana"
## [43] "Hungary" "India"
## [45] "Indonesia" "Iran"
## [47] "Iraq" "Jamaica"
## [49] "Jordan" "Kazakhstan"
## [51] "Kiribati" "Kyrgyz Republic"
## [53] "Lao" "Latvia"
## [55] "Lebanon" "Lesotho"
## [57] "Liberia" "Lithuania"
## [59] "Macedonia, FYR" "Malaysia"
## [61] "Maldives" "Mauritania"
## [63] "Mauritius" "Micronesia, Fed. Sts."
## [65] "Moldova" "Mongolia"
## [67] "Montenegro" "Morocco"
## [69] "Myanmar" "Namibia"
## [71] "Nepal" "Nigeria"
## [73] "Pakistan" "Panama"
## [75] "Paraguay" "Peru"
## [77] "Philippines" "Poland"
## [79] "Romania" "Russia"
## [81] "Samoa" "Serbia"
## [83] "Seychelles" "Slovak Republic"
## [85] "Slovenia" "Solomon Islands"
## [87] "South Africa" "South Korea"
## [89] "Sri Lanka" "St. Vincent and the Grenadines"
## [91] "Sudan" "Suriname"
## [93] "Tajikistan" "Thailand"
## [95] "Timor-Leste" "Tonga"
## [97] "Tunisia" "Turkey"
## [99] "Turkmenistan" "Ukraine"
## [101] "Uruguay" "Uzbekistan"
## [103] "Vanuatu" "Venezuela"
## [105] "Vietnam" "Yemen"
as.vector(data[clusters == "3", country])
## [1] "Australia" "Austria" "Bahrain"
## [4] "Belgium" "Brunei" "Canada"
## [7] "Denmark" "Finland" "France"
## [10] "Germany" "Greece" "Iceland"
## [13] "Ireland" "Israel" "Italy"
## [16] "Japan" "Kuwait" "Libya"
## [19] "Luxembourg" "Malta" "Netherlands"
## [22] "New Zealand" "Norway" "Oman"
## [25] "Portugal" "Qatar" "Saudi Arabia"
## [28] "Singapore" "Spain" "Sweden"
## [31] "Switzerland" "United Arab Emirates" "United Kingdom"
## [34] "United States"
There is also a package that compares different clustering alghorithms and returns the best one based on connectivity stat, Dunn's index and average sihluette width.
library(cluster)
par(mfrow=c(4,4))
library(clValid)
## Warning: package 'clValid' was built under R version 4.0.4
valid <- clValid(data_cl, nClust = 2:30, clMethods = c("hierarchical","kmeans","pam"), validation = "internal", maxitems = 100000)
## Warning in clValid(data_cl, nClust = 2:30, clMethods = c("hierarchical", :
## rownames for data not specified, using 1:nrow(data)
summary(valid)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
##
## hierarchical Connectivity 2.9290 7.8825 11.0794 14.0083 16.9373 18.7706 22.6286 39.7468 42.6758 47.7865 47.8976 49.8976 59.8468 63.3881 70.2635 75.4393 83.7095 86.5135 89.7242 97.6333 98.0444 111.7754 113.8944 125.4528 126.2591 129.3992 133.3254 139.7155 141.9417
## Dunn 0.5457 0.3431 0.3284 0.3771 0.2914 0.2914 0.2786 0.1775 0.1775 0.1775 0.1775 0.1775 0.1546 0.1768 0.1929 0.1929 0.1943 0.1943 0.1943 0.2120 0.2120 0.1798 0.1798 0.1798 0.1798 0.1798 0.1798 0.2240 0.2240
## Silhouette 0.6303 0.5620 0.4970 0.3858 0.2850 0.2786 0.1357 0.2342 0.1775 0.1730 0.1613 0.1608 0.2219 0.2456 0.2381 0.2255 0.2182 0.1713 0.1762 0.1715 0.1630 0.1614 0.1601 0.1829 0.1812 0.1712 0.1749 0.1795 0.1768
## kmeans Connectivity 2.9290 39.2329 44.1865 42.5603 44.4325 46.2659 68.6206 79.5123 95.2468 97.7734 110.2825 124.5762 125.3575 124.1448 125.8198 128.3825 142.7984 150.7829 155.5222 167.8960 165.2579 162.0881 161.4103 163.8698 167.3627 172.4409 177.1528 183.1333 185.3595
## Dunn 0.5457 0.0666 0.0838 0.0861 0.0972 0.0972 0.1107 0.1164 0.0675 0.0804 0.1166 0.0804 0.0804 0.0939 0.0939 0.0939 0.0956 0.1143 0.1177 0.1366 0.1366 0.1561 0.2045 0.2045 0.2045 0.2045 0.2045 0.2045 0.2045
## Silhouette 0.6303 0.2859 0.3005 0.3044 0.3107 0.3064 0.2471 0.2423 0.2177 0.2221 0.2068 0.1876 0.1858 0.1873 0.1978 0.2001 0.1927 0.2098 0.1988 0.2041 0.2029 0.2187 0.2223 0.2219 0.2200 0.2157 0.2197 0.2190 0.2160
## pam Connectivity 37.8583 48.0448 80.9861 82.9302 77.9345 89.9044 102.7837 103.4865 115.7131 122.1310 131.5242 142.6385 153.3155 165.6421 167.4472 168.9333 174.5635 156.3500 158.1833 160.2893 162.2893 164.0282 179.2409 179.8758 181.6437 195.9143 195.9198 196.3365 198.1258
## Dunn 0.0666 0.0600 0.0633 0.0633 0.0633 0.0633 0.0537 0.0937 0.0974 0.0736 0.0471 0.0471 0.0471 0.0471 0.0974 0.0974 0.0471 0.0974 0.1022 0.1022 0.1022 0.1022 0.0651 0.1019 0.1019 0.1054 0.1054 0.1182 0.1182
## Silhouette 0.2846 0.2810 0.2054 0.2203 0.2412 0.2168 0.1893 0.1986 0.1907 0.2055 0.1813 0.1769 0.1743 0.1711 0.1662 0.1764 0.1741 0.1869 0.1839 0.1667 0.1650 0.1743 0.1742 0.1745 0.1757 0.1775 0.1858 0.1870 0.1909
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 2.9290 hierarchical 2
## Dunn 0.5457 hierarchical 2
## Silhouette 0.6303 hierarchical 2
If we run it on our data, we can see that all of the statistics are highest for hierarchical clustering with 2 clusters. However, it's for divisive hierarchical clustering, and we've already seen that such a clustering results in one cluster with almost all the observations and a cluster with only few. Despite high scores that's now what we're looking for.
All of the algorithms did pretty similar in terms of statistics like average silhouette width. K-means and PAM for 3 clusters resulted in similar clustering structure, but in PAM there were fewer observations with negative silhouette width (that could be because PAM handles outliers better) and therefore out of the two I am going to choose PAM algorithm. Although hierarchical clustering for 4 clusters did also quite well in terms of clustering quality and could be considered.