Clustering is grouping a set of objects in such a way that objects in the same cluster are more similar (in some sense) to each other than to those in other groups. CLustering has many applications such as market research, pattern recognition, data analysis, and image processing.
In this project I used countries data from https://www.kaggle.com/fernandol/countries-of-the-world. After checking the clusteribility of the data, the best clustering method is to be chosen together with the optimum number of clusters. I will also discuss the efficiency of the method applied comparing it to the true classification of world countries, based on another data.
All steps applied are discussed here with the code blocks.
#### Necessary Packages ####
requiredPackages = c("clValid","factoextra","flexclust","fpc","haven","clustertend","cluster","ClusterR","tidyverse","dendextend","NbClust","gridExtra","stats","modeest")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) }
I believe there is a pattern across countries when it comes to the differences in infant mortality rate, GDP Per Capita and birthrate. Other variables in the data could be equally useful, but these 3 are chosen for simplicity.
countriesData <- read.csv2 ("countries of the world.csv",header = T, sep = ",")
countriesData <- data.frame(countriesData[,c(3,8,9,16)], row.names = countriesData$Country)
colnames(countriesData ) = c("Population","Infant mortality rate","GDP per Capita","Birthrate")
head(countriesData)
## Population Infant mortality rate GDP per Capita Birthrate
## Afghanistan 31056997 163.07 700 46.60
## Albania 3581655 21.52 4500 15.11
## Algeria 32930091 31.00 6000 17.14
## American Samoa 57794 9.27 8000 22.46
## Andorra 71201 4.05 19000 8.71
## Angola 12127071 191.19 1900 45.11
# number of observations with missing values
sum(is.na(countriesData))
## [1] 7
#remove all observations with missing data
countriesData <- countriesData[complete.cases(countriesData), ]
sum(is.na(countriesData))
## [1] 0
Small countries with smaller than a million population usually behave differently to other, therefore removed.This simplifies our task, hopefully improving the clustering efficiency. Removing countries with less than 20 million populations will remove the small countries with a valuable natural resources having high GDP with low human development statistics. It will also be easier to plot and discuss the results with smaller number of countries.
#remove countries with smaller population than 20000000
countriesData <- countriesData %>% filter(Population >= 20000000)
nrow(countriesData)
## [1] 53
# list of 53 countries
head(countriesData)
## Population Infant mortality rate GDP per Capita Birthrate
## Afghanistan 31056997 163.07 700 46.60
## Algeria 32930091 31.00 6000 17.14
## Argentina 39921833 15.18 11200 16.73
## Australia 20264082 4.69 29000 12.14
## Bangladesh 147365352 62.60 1900 29.80
## Brazil 188078227 29.61 7600 16.56
# 53 countries remain in the list
TestCountriesList = row.names(countriesData)
# without population variable.
countData_interest <- countriesData[,c(2,3,4)]
head(countData_interest)
## Infant mortality rate GDP per Capita Birthrate
## Afghanistan 163.07 700 46.60
## Algeria 31.00 6000 17.14
## Argentina 15.18 11200 16.73
## Australia 4.69 29000 12.14
## Bangladesh 62.60 1900 29.80
## Brazil 29.61 7600 16.56
Normalization (standardization) of the data may result in wrong shape, however, in most cases it is useful. It avoids one or the other variable dominating the distance. z - score normalization is applied to every variable separately.
countData_interest <- scale(countData_interest)
head(countData_interest)
## Infant mortality rate GDP per Capita Birthrate
## Afghanistan 3.8140189 -0.8477670 2.2539345
## Algeria -0.2163901 -0.3162095 -0.4089954
## Argentina -0.6991725 0.2053186 -0.4460558
## Australia -1.0192981 1.9905494 -0.8609523
## Bangladesh 0.7479540 -0.7274144 0.7353594
## Brazil -0.2588090 -0.1557393 -0.4614224
There are many ways of doing this. Some of the popular clusterability checks are applied here. I assumed there would be 3 clusters, considering the natural classification of countries as first, second and third world.
A. Distance matrix Observing the distance marix, we can get the overview of the data and tell if it is clusterable. However, this should be only the first stage.
# Calculate the euclidean distance
distance<-get_dist(countData_interest, method="euclidean")
# Visualizes a distance matrix
fviz_dist(distance, show_labels = FALSE)+ labs(title="Data") #factoextra::
B. Rand, Jaccard & Fowlkes-Mallows # let’s check if partitioning in interests is robust
dista<-cclust(countData_interest, 3, dist="euclidean") # assuming 3 clusters
dista
dista
## kcca object of family 'kmeans'
##
## call:
## cclust(x = countData_interest, k = 3, dist = "euclidean")
##
## cluster sizes:
##
## 1 2 3
## 15 11 27
C. Calinski-Harabasz & Duda-Hart
km1<-kmeans(countData_interest, 3)
round(calinhara(countData_interest, km1$cluster),digits=2)
## [1] 100.11
dudahart2(countData_interest, km1$cluster)
## $p.value
## [1] 3.101963e-13
##
## $dh
## [1] 0.09833575
##
## $compare
## [1] 0.4917111
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
D. Shadow statistics
C_clusters<-cclust(countData_interest, 3, dist="euclidean")
shadow(C_clusters)
## 1 2 3
## 0.2762533 0.5421405 0.4417922
plot(shadow(C_clusters))
Now that we know the data is clusterable, we need to find the best clustering technique. The code below checks the results of 5 different methods for different number of clusters. ‘clValid’ function reports validation measures for clustering results.
A. Finding the best method
All 5 methods tested below have very similar Silhoutte, connectivity and Dunn values for the optimum number of clusters, 3. For the subsequent stages, I chose to continue with “clara” clustering method.Other methods can also be used with little change to the r code used below.
methods <- clValid(countData_interest, 2:6, clMethods = c("hierarchical","kmeans", "diana", "pam", "clara"), validation = "internal")
summary(methods)
##
## Clustering Methods:
## hierarchical kmeans diana pam clara
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 2.9290 4.0790 10.0937 16.8563 19.9282
## Dunn 0.3838 0.1617 0.1646 0.2260 0.2260
## Silhouette 0.5152 0.4521 0.5295 0.5068 0.4620
## kmeans Connectivity 4.6024 4.5508 7.4798 15.4440 27.1778
## Dunn 0.1167 0.1648 0.2792 0.2260 0.0878
## Silhouette 0.4841 0.5723 0.5614 0.5091 0.4133
## diana Connectivity 4.6024 7.5313 8.6813 15.4440 19.0952
## Dunn 0.1167 0.1246 0.1732 0.2260 0.2335
## Silhouette 0.4841 0.4473 0.5428 0.5091 0.4692
## pam Connectivity 1.1500 4.5508 14.4429 28.7024 31.2905
## Dunn 0.1149 0.1648 0.0921 0.0558 0.0788
## Silhouette 0.5003 0.5723 0.5080 0.3893 0.4003
## clara Connectivity 1.1500 4.5508 14.4429 28.7024 31.1524
## Dunn 0.1149 0.1648 0.0921 0.0558 0.0853
## Silhouette 0.5003 0.5723 0.5080 0.3893 0.3934
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 1.1500 pam 2
## Dunn 0.3838 hierarchical 2
## Silhouette 0.5723 kmeans 3
# plot the results
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(methods, legend = FALSE)
legend("right", clusterMethods(methods), col = 1:9, lty = 1:9,pch = paste(1:9))
par(mfrow = c(1, 1))
B. Finding the optimum number of clusters. All methods indicate that the optimum number of clusters is 3. 1. bootstrapping
#bootstrapping while using kmeans
fviz_nbclust(countData_interest, cluster::clara, method = "gap_stat")
wss=function(k) {kmeans(countData_interest, k, nstart=10)$tot.withinss}
k.values = 1:15
# extract wss
wss_values = map_dbl(k.values, wss)
plot(k.values, wss_values, type="b", pch=19, frame=FALSE,
main = "Elbow plot for k-means output",
xlab = "No of clusters",
ylab = "Total within-clusters sum of squares")
3.Silhoutte method Using Silhoutte criteria
optimal_n<-Optimal_Clusters_KMeans(countData_interest, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
With the the optimum number of clusters determined, we can apply the the chosen method and see the results both graphically and statistically.
country_cluster<-clara(countData_interest, 3, metric="euclidean", stand=FALSE, samples=6,
sampsize=50, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
# plot clara output
fviz_cluster(country_cluster, geom="point", ellipse.type="norm")
fviz_cluster(country_cluster, palette=c("#00AFBB", "#FC4E07", "#E7B800"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic())
fviz_silhouette(country_cluster)
## cluster size ave.sil.width
## 1 1 13 0.55
## 2 2 26 0.53
## 3 3 11 0.73
List of countries in each class:
# List of countries in each class.TW: third world, SW: second world, FW: first world
TW <- row.names(as.data.frame(country_cluster$clustering[country_cluster$clustering==1]))
SW <- row.names(as.data.frame(country_cluster$clustering[country_cluster$clustering==2]))
FW <- row.names(as.data.frame(country_cluster$clustering[country_cluster$clustering==3]))
I have created the list of countries based on human development ranking at this website.http://hdr.undp.org/en/content/latest-human-development-index-ranking
The data has 186 countries rated from top to bottom. I divided the data to three categories ( first world, second world and third world). This division is totally based on my opinion. I consider the top 40 countries first world, the next 60, second world and the rest as third world countries.
# load the countries rating data
ranking <- read.csv2('countriesrated.csv', header = T, sep = ",")
head(ranking)
## ï..Country Rating
## 1 Norway 1
## 2 Ireland 2
## 3 Switzerland 3
## 4 Hong Kong, China (SAR) 4
## 5 Iceland 5
## 6 Germany 6
Classify the countries into three ‘true’ clusters. Based on this assumption, Hungary is the last developed country, the likes of Croatia go in to second world countries. Note that this list has all 186 countries, where as the test data has only 53.
# classify the countries in
ranking$class <- ifelse(ranking$Rating < 40, 'First world', ifelse(ranking$Rating < 100, 'Second world','Third world'))
head(ranking)
## ï..Country Rating class
## 1 Norway 1 First world
## 2 Ireland 2 First world
## 3 Switzerland 3 First world
## 4 Hong Kong, China (SAR) 4 First world
## 5 Iceland 5 First world
## 6 Germany 6 First world
# find the three groups based on the class assigned.
True_TW <- ranking$ï..Country[ranking$class == 'Third world']
True_SW <- ranking$ï..Country[ranking$class=='Second world']
True_FW <- ranking$ï..Country[ranking$class=='First world']
Confusion Matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. We do not need TN (true negative) in this case since TN for one group is TP for the other, which is specified.
# confusionMatrix
TruePositive <- rep(0,3) # TruePositive: correctly put into a cluster
FalsePositive <- rep(0,3) # FalsePositive: wrongly put into a cluster
FalseNegative <- rep(0,3) # FalseNegative: wrongly assigned to another cluster
#
names <- c("First world","Second world","Third World")
confusionMatrix <- cbind(TruePositive,FalsePositive,FalseNegative)
rownames(confusionMatrix)<- names
confusionMatrix <- as.data.frame(confusionMatrix)
# TruePositive
confusionMatrix[1,1] <- length(unique(intersect(FW,True_FW)))
confusionMatrix[2,1] <- length(unique(intersect(SW,True_SW)))
confusionMatrix[3,1] <- length(unique(intersect(TW,True_TW)))
# FalsePositive
confusionMatrix[1,2] <- length(unique(append(intersect(FW,True_SW),intersect(FW,True_TW))))
confusionMatrix[2,2] <- length(unique(append(intersect(SW,True_FW),intersect(SW,True_TW))))
confusionMatrix[3,2] <- length(unique(append(intersect(TW,True_FW),intersect(TW,True_SW))))
# FalseNegative
confusionMatrix[1,3] <- length(unique(append(intersect(TW,True_FW),intersect(SW,True_FW))))
confusionMatrix[2,3] <- length(unique(append(intersect(TW,True_SW),intersect(FW,True_SW))))
confusionMatrix[3,3] <- length(unique(append(intersect(FW,True_TW),intersect(SW,True_TW))))
confusionMatrix
## TruePositive FalsePositive FalseNegative
## First world 9 0 1
## Second world 13 7 0
## Third World 13 0 6
We can calculate the efficiency of this clustering method comparing it to the ideal classification results.
TestCountriesList = row.names(countriesData)
Ideal_FW <- unique(intersect(TestCountriesList,True_FW))
Ideal_SW <- unique(intersect(TestCountriesList,True_SW))
Ideal_TW <- unique(intersect(TestCountriesList,True_TW))
Average Accuracy: the mean of cluster efficiencies can be a good estimation of the efficiency of the method applied.
# efficiency function return the ratio of true positive to the length of a cluster
efficiency <- function(ideal,result)
{
k = 0
for (i in result) { if ( i %in% ideal) {k <- k + 1}
}
return(k/length(unique(result)))
}
# average efficiency is the average of efficiency of each cluster.
AverageEfficiency <- (efficiency(Ideal_FW,FW) + efficiency(Ideal_SW,SW) + efficiency(Ideal_TW,TW))/3
AverageEfficiency
## [1] 0.72211
I am happy with the results I have got. With better reference of countries rating, I believe the result could be better.