Clustering countries based on descriptive variables

Clustering is grouping a set of objects in such a way that objects in the same cluster are more similar (in some sense) to each other than to those in other groups. CLustering has many applications such as market research, pattern recognition, data analysis, and image processing.

In this project I used countries data from https://www.kaggle.com/fernandol/countries-of-the-world. After checking the clusteribility of the data, the best clustering method is to be chosen together with the optimum number of clusters. I will also discuss the efficiency of the method applied comparing it to the true classification of world countries, based on another data.

All steps applied are discussed here with the code blocks.

1.Install the necessary packages

#### Necessary Packages #### 
requiredPackages = c("clValid","factoextra","flexclust","fpc","haven","clustertend","cluster","ClusterR","tidyverse","dendextend","NbClust","gridExtra","stats","modeest")

for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) } 

2. Load the data and select the interesting variables

I believe there is a pattern across countries when it comes to the differences in infant mortality rate, GDP Per Capita and birthrate. Other variables in the data could be equally useful, but these 3 are chosen for simplicity.

countriesData <- read.csv2 ("countries of the world.csv",header = T, sep = ",")
countriesData <- data.frame(countriesData[,c(3,8,9,16)], row.names = countriesData$Country) 
colnames(countriesData ) = c("Population","Infant mortality rate","GDP per Capita","Birthrate")
head(countriesData)
##                 Population Infant mortality rate GDP per Capita Birthrate
## Afghanistan       31056997                163.07            700     46.60
## Albania            3581655                 21.52           4500     15.11
## Algeria           32930091                 31.00           6000     17.14
## American Samoa       57794                  9.27           8000     22.46
## Andorra              71201                  4.05          19000      8.71
## Angola            12127071                191.19           1900     45.11

3. Check for the missing entries and remove observations with missing values

# number of observations with missing values 
sum(is.na(countriesData)) 
## [1] 7
#remove all observations with missing data
countriesData <- countriesData[complete.cases(countriesData), ]
sum(is.na(countriesData))
## [1] 0

4. Do further clean up

Small countries with smaller than a million population usually behave differently to other, therefore removed.This simplifies our task, hopefully improving the clustering efficiency. Removing countries with less than 20 million populations will remove the small countries with a valuable natural resources having high GDP with low human development statistics. It will also be easier to plot and discuss the results with smaller number of countries.

#remove countries with smaller population than 20000000
countriesData <- countriesData %>% filter(Population >= 20000000)
nrow(countriesData) 
## [1] 53
# list of 53 countries
head(countriesData)
##              Population Infant mortality rate GDP per Capita Birthrate
## Afghanistan    31056997                163.07            700     46.60
## Algeria        32930091                 31.00           6000     17.14
## Argentina      39921833                 15.18          11200     16.73
## Australia      20264082                  4.69          29000     12.14
## Bangladesh    147365352                 62.60           1900     29.80
## Brazil        188078227                 29.61           7600     16.56
# 53 countries remain in the list
TestCountriesList = row.names(countriesData) 

# without population variable.
countData_interest <- countriesData[,c(2,3,4)]
head(countData_interest)
##              Infant mortality rate GDP per Capita Birthrate
## Afghanistan                 163.07            700     46.60
## Algeria                      31.00           6000     17.14
## Argentina                    15.18          11200     16.73
## Australia                     4.69          29000     12.14
## Bangladesh                   62.60           1900     29.80
## Brazil                       29.61           7600     16.56

5.Scaling

Normalization (standardization) of the data may result in wrong shape, however, in most cases it is useful. It avoids one or the other variable dominating the distance. z - score normalization is applied to every variable separately.

countData_interest <- scale(countData_interest)
head(countData_interest)
##              Infant mortality rate GDP per Capita  Birthrate
## Afghanistan              3.8140189     -0.8477670  2.2539345
## Algeria                 -0.2163901     -0.3162095 -0.4089954
## Argentina               -0.6991725      0.2053186 -0.4460558
## Australia               -1.0192981      1.9905494 -0.8609523
## Bangladesh               0.7479540     -0.7274144  0.7353594
## Brazil                  -0.2588090     -0.1557393 -0.4614224

6.Check clusterability statistics of the data

There are many ways of doing this. Some of the popular clusterability checks are applied here. I assumed there would be 3 clusters, considering the natural classification of countries as first, second and third world.

A. Distance matrix Observing the distance marix, we can get the overview of the data and tell if it is clusterable. However, this should be only the first stage.

# Calculate the euclidean distance
distance<-get_dist(countData_interest, method="euclidean")

# Visualizes a distance matrix
fviz_dist(distance, show_labels = FALSE)+ labs(title="Data") #factoextra::

  B. Rand, Jaccard & Fowlkes-Mallows # let’s check if partitioning in interests is robust

dista<-cclust(countData_interest, 3, dist="euclidean") # assuming 3 clusters
dista
dista
## kcca object of family 'kmeans' 
## 
## call:
## cclust(x = countData_interest, k = 3, dist = "euclidean")
## 
## cluster sizes:
## 
##  1  2  3 
## 15 11 27

C. Calinski-Harabasz & Duda-Hart

km1<-kmeans(countData_interest, 3)
round(calinhara(countData_interest, km1$cluster),digits=2) 
## [1] 100.11
dudahart2(countData_interest, km1$cluster)
## $p.value
## [1] 3.101963e-13
## 
## $dh
## [1] 0.09833575
## 
## $compare
## [1] 0.4917111
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

D. Shadow statistics

C_clusters<-cclust(countData_interest, 3, dist="euclidean") 
shadow(C_clusters)
##         1         2         3 
## 0.2762533 0.5421405 0.4417922
plot(shadow(C_clusters))

7.Find the best clustering technique and optimal number of clusters

Now that we know the data is clusterable, we need to find the best clustering technique. The code below checks the results of 5 different methods for different number of clusters. ‘clValid’ function reports validation measures for clustering results.

A. Finding the best method

All 5 methods tested below have very similar Silhoutte, connectivity and Dunn values for the optimum number of clusters, 3. For the subsequent stages, I chose to continue with “clara” clustering method.Other methods can also be used with little change to the r code used below.

methods <- clValid(countData_interest, 2:6, clMethods = c("hierarchical","kmeans", "diana", "pam", "clara"), validation = "internal")

summary(methods)
## 
## Clustering Methods:
##  hierarchical kmeans diana pam clara 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                  2       3       4       5       6
##                                                                   
## hierarchical Connectivity   2.9290  4.0790 10.0937 16.8563 19.9282
##              Dunn           0.3838  0.1617  0.1646  0.2260  0.2260
##              Silhouette     0.5152  0.4521  0.5295  0.5068  0.4620
## kmeans       Connectivity   4.6024  4.5508  7.4798 15.4440 27.1778
##              Dunn           0.1167  0.1648  0.2792  0.2260  0.0878
##              Silhouette     0.4841  0.5723  0.5614  0.5091  0.4133
## diana        Connectivity   4.6024  7.5313  8.6813 15.4440 19.0952
##              Dunn           0.1167  0.1246  0.1732  0.2260  0.2335
##              Silhouette     0.4841  0.4473  0.5428  0.5091  0.4692
## pam          Connectivity   1.1500  4.5508 14.4429 28.7024 31.2905
##              Dunn           0.1149  0.1648  0.0921  0.0558  0.0788
##              Silhouette     0.5003  0.5723  0.5080  0.3893  0.4003
## clara        Connectivity   1.1500  4.5508 14.4429 28.7024 31.1524
##              Dunn           0.1149  0.1648  0.0921  0.0558  0.0853
##              Silhouette     0.5003  0.5723  0.5080  0.3893  0.3934
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 1.1500 pam          2       
## Dunn         0.3838 hierarchical 2       
## Silhouette   0.5723 kmeans       3
# plot the results
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(methods, legend = FALSE)
legend("right", clusterMethods(methods), col = 1:9, lty = 1:9,pch = paste(1:9))
par(mfrow = c(1, 1))

  B. Finding the optimum number of clusters. All methods indicate that the optimum number of clusters is 3. 1. bootstrapping

#bootstrapping while using kmeans
fviz_nbclust(countData_interest, cluster::clara, method = "gap_stat")

  1. Elbow method
wss=function(k) {kmeans(countData_interest, k, nstart=10)$tot.withinss} 
k.values = 1:15 
# extract wss
wss_values = map_dbl(k.values, wss)
plot(k.values, wss_values, type="b", pch=19, frame=FALSE,
     main = "Elbow plot for k-means output",
     xlab = "No of clusters",
     ylab = "Total within-clusters sum of squares")

3.Silhoutte method Using Silhoutte criteria

optimal_n<-Optimal_Clusters_KMeans(countData_interest, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

8.Result visualization

With the the optimum number of clusters determined, we can apply the the chosen method and see the results both graphically and statistically.

country_cluster<-clara(countData_interest, 3, metric="euclidean", stand=FALSE, samples=6,
                   sampsize=50, trace=0, medoids.x=TRUE,
                   rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
# plot clara output
fviz_cluster(country_cluster, geom="point", ellipse.type="norm") 

fviz_cluster(country_cluster, palette=c("#00AFBB", "#FC4E07", "#E7B800"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic())

fviz_silhouette(country_cluster)
##   cluster size ave.sil.width
## 1       1   13          0.55
## 2       2   26          0.53
## 3       3   11          0.73

List of countries in each class:

# List of countries in each class.TW: third world, SW: second world, FW: first world
TW <- row.names(as.data.frame(country_cluster$clustering[country_cluster$clustering==1]))
SW <- row.names(as.data.frame(country_cluster$clustering[country_cluster$clustering==2]))
FW <- row.names(as.data.frame(country_cluster$clustering[country_cluster$clustering==3]))

9. Compare to the real world classification

I have created the list of countries based on human development ranking at this website.http://hdr.undp.org/en/content/latest-human-development-index-ranking

The data has 186 countries rated from top to bottom. I divided the data to three categories ( first world, second world and third world). This division is totally based on my opinion. I consider the top 40 countries first world, the next 60, second world and the rest as third world countries.

# load the countries rating data
ranking <- read.csv2('countriesrated.csv', header = T, sep = ",")
head(ranking)
##                ï..Country Rating
## 1                 Norway       1
## 2                Ireland       2
## 3            Switzerland       3
## 4 Hong Kong, China (SAR)       4
## 5                Iceland       5
## 6                Germany       6

Classify the countries into three ‘true’ clusters. Based on this assumption, Hungary is the last developed country, the likes of Croatia go in to second world countries. Note that this list has all 186 countries, where as the test data has only 53.

# classify the countries in 
ranking$class <- ifelse(ranking$Rating < 40, 'First world', ifelse(ranking$Rating < 100, 'Second world','Third world'))
head(ranking)
##                ï..Country Rating       class
## 1                 Norway       1 First world
## 2                Ireland       2 First world
## 3            Switzerland       3 First world
## 4 Hong Kong, China (SAR)       4 First world
## 5                Iceland       5 First world
## 6                Germany       6 First world
# find the three groups based on the class assigned. 
True_TW <- ranking$ï..Country[ranking$class == 'Third world']
True_SW <- ranking$ï..Country[ranking$class=='Second world'] 
True_FW <- ranking$ï..Country[ranking$class=='First world']

Confusion Matrix

Confusion Matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. We do not need TN (true negative) in this case since TN for one group is TP for the other, which is specified.

# confusionMatrix
TruePositive <- rep(0,3)  # TruePositive:  correctly put into a cluster
FalsePositive <- rep(0,3) # FalsePositive: wrongly put into a cluster
FalseNegative <- rep(0,3) # FalseNegative: wrongly assigned to another cluster
# 
names <- c("First world","Second world","Third World")
confusionMatrix <- cbind(TruePositive,FalsePositive,FalseNegative)
rownames(confusionMatrix)<- names
confusionMatrix <- as.data.frame(confusionMatrix)

Fill the confusion matrix

# TruePositive 
confusionMatrix[1,1] <- length(unique(intersect(FW,True_FW)))
confusionMatrix[2,1] <- length(unique(intersect(SW,True_SW)))
confusionMatrix[3,1] <- length(unique(intersect(TW,True_TW)))

# FalsePositive
confusionMatrix[1,2] <- length(unique(append(intersect(FW,True_SW),intersect(FW,True_TW))))
confusionMatrix[2,2] <- length(unique(append(intersect(SW,True_FW),intersect(SW,True_TW))))
confusionMatrix[3,2] <- length(unique(append(intersect(TW,True_FW),intersect(TW,True_SW))))

# FalseNegative
confusionMatrix[1,3] <- length(unique(append(intersect(TW,True_FW),intersect(SW,True_FW))))
confusionMatrix[2,3] <- length(unique(append(intersect(TW,True_SW),intersect(FW,True_SW))))
confusionMatrix[3,3] <- length(unique(append(intersect(FW,True_TW),intersect(SW,True_TW))))

confusionMatrix
##              TruePositive FalsePositive FalseNegative
## First world             9             0             1
## Second world           13             7             0
## Third World            13             0             6

Accuracy calculations

We can calculate the efficiency of this clustering method comparing it to the ideal classification results.

TestCountriesList = row.names(countriesData)
Ideal_FW <- unique(intersect(TestCountriesList,True_FW))
Ideal_SW <- unique(intersect(TestCountriesList,True_SW))
Ideal_TW <- unique(intersect(TestCountriesList,True_TW))

Average Accuracy: the mean of cluster efficiencies can be a good estimation of the efficiency of the method applied.

# efficiency function return the ratio of true positive to the length of a cluster
efficiency <- function(ideal,result)
{
k = 0
for (i in result) { if ( i %in% ideal) {k <- k + 1}
    
  }
  return(k/length(unique(result)))
}
# average efficiency is the average of efficiency of each cluster. 
AverageEfficiency <- (efficiency(Ideal_FW,FW) + efficiency(Ideal_SW,SW) + efficiency(Ideal_TW,TW))/3
AverageEfficiency
## [1] 0.72211

Conclusion

I am happy with the results I have got. With better reference of countries rating, I believe the result could be better.