Clustering

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. Thus, our job is to categorise the countries using some socio-economic and health factors that determine the overall development of the country and then provide suggestion of the countries which the CEO needs to focus on the most on investing.

We will be consuming dataset Humanitarian Aid to Underdeveloped Countries about the socio-economic and health factors of 167 countries like gdp, child mortality, life expectancy, etc. and clustering them into groups of countries using various clustering techniques.

Data Cleaning

Data Import

We are importing the Country data and performing the preliminary analysis.

library(rmarkdown)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(tidyr)
library(knitr)
library(dplyr)
library(factoextra)
library(fpc)
library(cluster)
library(tidyverse)
library(clValid)
set.seed(12345)

country_data <- read.csv("~/Country-data.csv")
dim(country_data)

## [1] 167  10

str(country_data)

## 'data.frame':    167 obs. of  10 variables:
##  $ country   : Factor w/ 167 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

There are 167 records and 10 variables including the country name.
All are numerical variables except for country name which is factor.

Here is the full data.

paged_table(country_data)

Data Cleaning

Missing Values

We are checking if there are any missing values in any of the variables.

colSums(is.na(country_data))

##    country child_mort    exports     health    imports     income  inflation 
##          0          0          0          0          0          0          0 
## life_expec  total_fer       gdpp 
##          0          0          0

print(paste(sum(complete.cases(country_data)),"Complete cases!"))

## [1] "167 Complete cases!"

any(is.null(country_data))

## [1] FALSE

We see that there are no missing values in any of the variables and there are 167 complete cases.

Negative Values

We are checking if there are any negative values in any of the variables.

sum(rowMeans(country_data[,-1] < 0), na.rm = TRUE)

## [1] 0.8888889

sum(rowMeans(country_data[,!names(country_data) %in% c("country", "inflation")] < 0), na.rm = TRUE)

## [1] 0

We see that there are some negative values for inflation which shows the deflation rate for the countries and is normal.

Duplicate Records

We are checking if there are duplicate records in the dataset which is redundant and hence needs to be removed.

dim(unique(country_data))[1]

## [1] 167

There are no duplicate records in the dataset.

Data Dictionary

Variable	Type	Description
country	factor	Name of the country
child_mort	double	Death of children under 5 years of age per 1000 live births
exports	double	Exports of goods and services. Given as %age of the Total GDP
health	double	Total health spending as %age of Total GDP
imports	double	Imports of goods and services. Given as %age of the Total GDP
Income	double	Net income per person
Inflation	double	The measurement of the annual growth rate of the Total GDP
life_expec	double	The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer	double	The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp	integer	The GDP per capita. Calculated as the Total GDP divided by the total population.

EDA and Data Preparation

Exploratory Data Analysis

Correlation between variables

corrplot(cor(country_data[,2:10]), type = "lower", method = "number")

We observe the following correlations from the plot.

gdpp and income are most highly correlated with correlation of 0.9
child_mortality and life_expentency are highly correlated with correlation of -0.89
child_mortality and total_fertility are highly correlated with correlation of 0.85
imports and exports are highly correlated with correlation of 0.74
life_expentency and total_fertility are highly correlated with correlation of -0.76

Outliers

We observe the following about the outliers.

There are very few oultiers (<= 5) for all variables except for income and gdpp.
All variables have outliers on the upper side (higher values) except for life_expec which has outliers on the down side (less values) indicating that usually the life expectancy in most countries is above 50 except for 3 countries.

Data Preparation

We will deduce the 3 variables namely imports, exports and health spending from percentage values to actual values of their GDP per capita since the percentage values don’t give a clear picture of that country. For example, Austria and Belarus have almost the same exports % but their gdpp has a huge gap which doesn’t give an accurate idea of which country is more developed than the other.
Then we will remove the Country field and keep it as the row names in the final data frame and scale the remaining data.

country_data['exports'] = country_data['exports'] * country_data['gdpp']/100
country_data['imports'] = country_data['imports'] * country_data['gdpp']/100
country_data['health'] = country_data['health'] * country_data['gdpp']/100
row.names(country_data) <- country_data$country
country_data <- country_data[,-1]
country_data_scaled <- scale(country_data)

K-means Clustering

K-means is a type of unsupervised learning which is considered as one of the most used algorithms due to its simplicity.
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

Finding the Optimal Number of Clusters (k)

Elbow Curve method

In Elbow Curve method, the total within-cluster sum of square (wss) measures the compactness of the clustering and it should be as small as possible for better clustering results.

wss <- (nrow(country_data_scaled) - 1)*sum(apply(country_data_scaled,2,var))
for (i in 1:10) wss[i] <- sum(kmeans(country_data_scaled,
                                     centers = i)$withinss)
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")

We observe that the optimal number of clusters, k would be 4.

Silhouette method

The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.

d = dist(country_data_scaled, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:10) {
  cluster_result = kmeans(country_data_scaled, i)
  clusterstat = cluster.stats(d, cluster_result$cluster)
  result[i - 1, 1] = i
  result[i - 1, 2] = clusterstat$avg.silwidth
}
plot(result[,c(1,2)], type = "l", ylab = 'silhouette width', xlab = 'number of clusters')

The results show that 2 clusters maximize the average silhouette values with 4 clusters coming in as second optimal number of clusters. Therefore, the optimal number of clusters, k = 4 would be a good choice.

Gap Statistic method

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).

gap_stat <- clusGap(country_data_scaled, FUN = kmeans, nstart = 25, K.max = 10)
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = country_data_scaled, FUNcluster = kmeans, K.max = 10,     nstart = 25)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 4
##           logW   E.logW       gap     SE.sim
##  [1,] 4.968025 5.960431 0.9924059 0.01975041
##  [2,] 4.720247 5.749259 1.0290118 0.01593695
##  [3,] 4.475412 5.673630 1.1982176 0.01619197
##  [4,] 4.395930 5.609470 1.2135406 0.01614209
##  [5,] 4.345731 5.556311 1.2105798 0.01599800
##  [6,] 4.274437 5.509614 1.2351770 0.01577231
##  [7,] 4.197304 5.468737 1.2714336 0.01552893
##  [8,] 4.135668 5.432355 1.2966867 0.01547706
##  [9,] 4.071479 5.399072 1.3275926 0.01570677
## [10,] 4.041642 5.369123 1.3274804 0.01560518

fviz_gap_stat(gap_stat)

The optimal number of clusters would be 3.

Since both methods (wss and silhouette) indicated that 4 is the optimal number of clusters for this dataset, we proceed by taking k = 4 in our final clustering model.

k4 <- kmeans(country_data_scaled, centers = 4, nstart = 25)
fviz_cluster(k4, geom = "point",  data = country_data_scaled) + ggtitle("k = 4") + theme_bw()

table(k4$cluster)

## 
##  1  2  3  4 
## 29 48 88  2

The table shows the number of countries in each cluster where clusters contain 29, 88, 2, and 48 countries.

cluster_means <- aggregate(country_data_scaled,by = list(k4$cluster), FUN = mean)
kable(cluster_means)

Group.1	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
1	-0.8229155	0.7839417	1.6810380	0.7845623	1.4864203	-0.4677212	1.1073394	-0.7491976	1.6942324
2	1.3226325	-0.3639478	-0.5228749	-0.3916387	-0.6871713	0.3906375	-1.2724487	1.3502808	-0.6031494
3	-0.4302588	-0.2255406	-0.3185481	-0.2125591	-0.1913636	-0.0460506	0.2998973	-0.4662325	-0.3072514
4	-0.8795190	7.2913772	2.1900618	7.3757764	3.3590146	-0.5671176	1.2868650	-1.0291420	3.4282752

res <- cbind(country_data, ClusterId = unname(k4$cluster))
rownames(res) <- NULL
res <- as.data.frame(res)
cluster_data <- aggregate(res, by = list(res$ClusterId), FUN = mean)
kable(cluster_data)

Group.1	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp	ClusterId
1	5.082759	21511.0966	4084.9700	18129.9000	45800.000	2.837690	80.40345	1.813793	44017.241	1
2	91.610417	879.0635	114.8218	827.0288	3897.354	11.911146	59.23958	4.992083	1909.208	2
3	20.918182	3366.7784	482.8978	3461.4355	13455.568	7.295045	73.22273	2.242159	7332.636	3
4	2.800000	138475.0000	5001.9300	115092.0000	81900.000	1.787000	82.00000	1.390000	75800.000	4

Using the above table about clusters, we can draw the following inferences:

Cluster 1 has good exports, health, imports, income, life expectancy, and GDP indicating these countries are developed and does not need aid.
Cluster 2 has good life expectancy and average exports, imports, and income. It has average child mortality indicating some aid would be helpful.
Cluster 3 has highest exports, health, imports, income, life expectancy, and GDP indicating these countries are developed and does not need aid.
Cluster 4 has highest child mortality, inflation, and total fertility indicating these countries need aid.

Hierarchical Clustering

Hierarchical clustering is an unsupervised learning technique that groups data over a variety of scales by creating a cluster tree or dendrogram. The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level are joined as clusters at the next level.

Finding the Optimal Number of Clusters (k)

Elbow Curve method

fviz_nbclust(country_data_scaled, FUN = hcut, method = "wss")

The optimal number of clusters would be 4.

Silhouette method

fviz_nbclust(country_data_scaled, FUN = hcut, method = "silhouette")

The optimal number of clusters would be 4 or 5.

Gap Statistic method

gap_stat <- clusGap(country_data_scaled, FUN = hcut, K.max = 10)
fviz_gap_stat(gap_stat)

The optimal number of clusters would be 3.

We will take 4 as the optimal number of clusters as elbow curve and silhouette both suggests the same.

Clustering using different linkage methods

We will perform hierarchical clustering using methods single linkage, complete linkage, and ward and select the one which yields the best result.

Minimum or single linkage clustering

It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.

euclidian_dist <- dist(country_data_scaled, method = "euclidean")

# Hierarchical clustering using Single Linkage
hc1 <- hclust(euclidian_dist, method = "single" )


fviz_dend(hc1,k = 4,
          cex = 0.5, # label size
          k_colors = c( "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE, # Add rectangle around groups
          rect_border = c("#00AFBB", "#E7B800"), 
          rect_fill = TRUE)

The clusters of the single linkage are not truly satisfying. The single linkage method appears to be placing each outlier in its own cluster.

Maximum or complete linkage clustering

It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

# Hierarchical clustering using Complete Linkage
hc2 <- hclust(euclidian_dist, method = "complete" )

fviz_dend(hc2,k = 4,
          cex = 0.5, # label size
          k_colors = c( "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE, # Add rectangle around groups
          rect_border = c("#00AFBB", "#E7B800"), 
          rect_fill = TRUE)

The result of complete linkage looks good.

We will use Dunn’s index to confirm if complete linkage is yielding better results than the single linkage. Dunn’s index returns the ratio between the minimum intercluster distance to the maximum intracluster diameter. The smaller the Dunn’s index, the better the clustering result.

Dunn’s Index

memb_single = cutree(hc1, k = 4) 
memb_complete = cutree(hc2, k = 4) 
dunn_single <- dunn(clusters = memb_single, Data = country_data_scaled)
dunn_complete <- dunn(clusters = memb_complete, Data = country_data_scaled)
dunn_single

## [1] 0.3801017

dunn_complete

## [1] 0.07284464

The complete-linkage method returned the lowest ratio of minimal intercluster-distance to minimal cluster diameter. Based on Dunn’s index, the complete-linkage method returned the most compact and separated clusters. Since, a lower Dunn’s index indicates that the minimal intercluster ratio decreased and/or the maximal cluster diameter increased. Therefore, a lower Dunn’s index tells that atleast one pair of clusters became less separated and/or all clusters became less compact.
But the single linkage creates clusters for each of its outliers (here 3 clusters are with just one countries) and therefore, the results do not tend to be that good. Also, complete linkage is always preferred over single linkage.

So our final hierarchical clustering model is formed using the complete linkage method and is visualized as below.

#creating final model
country.3clust = cutree(hc2,k = 4)
table(country.3clust)

## country.3clust
##   1   2   3   4 
## 147  18   1   1

The table shows the number of countries in each cluster where clusters contain 147, 18, 1, and 1 countries.

fviz_cluster(list(data = country_data_scaled, cluster = country.3clust), geom = "point") + ggtitle("k = 4") + theme_bw()

cluster_mean <- aggregate(country_data_scaled,by = list(country.3clust), FUN = mean)
kable(cluster_mean)

Group.1	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
1	0.0933820	-0.2114956	-0.3009356	-0.2118442	-0.2309403	0.0123848	-0.1430328	0.0873536	-0.2933753
2	-0.8401207	1.2033114	2.2675681	1.2152139	1.7057254	-0.5849552	1.1637993	-0.7711533	2.1491644
3	-0.8795190	9.8103094	3.9423402	9.6875457	3.8673643	-0.3937138	1.2081529	-0.8706055	5.0214047
4	2.2745443	-0.3800585	-0.5210378	-0.4202985	-0.6221935	9.1023425	-1.1307201	1.9103878	-0.5801913

hc_clusters <- cbind(country_data, ClusterId = unname(country.3clust))
rownames(hc_clusters) <- NULL
hc_clusters <- as.data.frame(hc_clusters)
cluster_data_hc <- aggregate(hc_clusters, by = list(hc_clusters$ClusterId), FUN = mean)
kable(cluster_data_hc)

Group.1	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp	ClusterId
1	42.036054	3619.221	514.6252	3471.952	12692.61	7.912748	69.28367	3.080204	7586.966	1
2	4.388889	29048.800	5141.5506	24465.133	50027.78	1.598444	80.90556	1.780556	52355.556	2
3	2.800000	183750.000	8158.5000	149100.000	91700.00	3.620000	81.30000	1.630000	105000.000	3
4	130.000000	589.490	118.1310	405.420	5150.00	104.000000	60.50000	5.840000	2330.000	4

Using the above table about clusters, we can draw the following inferences:

Cluster 1 has good life expectancy and average exports, imports, and income. It has average child mortality indicating some aid would be helpful.
Cluster 2 has good exports, health, imports, income, life expectancy, and GDP indicating these countries are developed and does not need aid.
Cluster 3 has highest exports, health, imports, income, life expectancy, and GDP indicating these countries are developed and does not need aid.
Cluster 4 has highest child mortality, inflation, and total fertility indicating these countries need aid.

Final Analysis

K-means clustering is a very simple and fast algorithm. Furthermore, it can efficiently deal with very large data sets. However, there are some weaknesses of the k-means approach.

One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters. Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of clusters. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram. A future tutorial will illustrate the hierarchical clustering approach.

An additional disadvantage of K-means is that it’s sensitive to outliers and different results can occur if you change the ordering of your data. The Partitioning Around Medoids (PAM) clustering approach is less sensititive to outliers and provides a robust alternative to k-means to deal with these situations. A future tutorial will illustrate the PAM clustering approach.

fviz_cluster(k4, data = country_data_scaled, geom = "point") + ggtitle("K-means clustering") + theme_bw()
fviz_cluster(list(data = country_data_scaled, cluster = country.3clust), geom = "point") + ggtitle("Hierarchical clustering") + theme_bw()

Dunn’s Index of K-means and Hierarchical clusering

dunn_km = dunn(clusters = k4$cluster, Data = country_data_scaled)
dunn_km

## [1] 0.03117421

dunn_complete

## [1] 0.07284464

We have analyzed both K-means and Hierarchial clustering and found clusters formed are not identical. The clusters formed in both the cases are great but its better in Hierarchial as compared to K-means, as also indicated by Dunn’s Index (the higher the Dunn’s Index, the better the clustering results). So, we will proceed with the clusters formed by Hierarchial and based on the information provided by the final clusters we will deduce the final list of countries which are in need of aid.

We saw that the Cluster 4 had highest Child mortality and Total fertility, therefore, the country in Cluster 4 needs the aid.
There is only 1 country in Cluster 4 which is Nigeria, as can be seen from the below Hierarchical clusters.

fviz_cluster(list(data = country_data_scaled, cluster = country.3clust)) + ggtitle("Hierarchical clustering") + theme_bw()

Clustering

Snehal Agrawal

Data Cleaning

Data Import

Data Cleaning

Missing Values

Negative Values

Duplicate Records

Data Dictionary

EDA and Data Preparation

Exploratory Data Analysis

Correlation between variables

Outliers

Data Preparation

K-means Clustering

K-means Clustering

Finding the Optimal Number of Clusters (k)

Elbow Curve method

Silhouette method

Gap Statistic method

Hierarchical Clustering

Hierarchical Clustering

Finding the Optimal Number of Clusters (k)

Elbow Curve method

Silhouette method

Gap Statistic method

Clustering using different linkage methods

Minimum or single linkage clustering

Maximum or complete linkage clustering

Dunn’s Index

Final Analysis