HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.
After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. Thus, our job is to categorise the countries using some socio-economic and health factors that determine the overall development of the country and then provide suggestion of the countries which the CEO needs to focus on the most on investing.
We will be consuming dataset Humanitarian Aid to Underdeveloped Countries about the socio-economic and health factors of 167 countries like gdp, child mortality, life expectancy, etc. and clustering them into groups of countries using various clustering techniques.
We are importing the Country data and performing the preliminary analysis.
library(rmarkdown)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(tidyr)
library(knitr)
library(dplyr)
library(factoextra)
library(fpc)
library(cluster)
library(tidyverse)
library(clValid)
set.seed(12345)
country_data <- read.csv("~/Country-data.csv")
dim(country_data)
## [1] 167 10
str(country_data)
## 'data.frame': 167 obs. of 10 variables:
## $ country : Factor w/ 167 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
There are 167 records and 10 variables including the country name.
All are numerical variables except for country name which is factor.
Here is the full data.
paged_table(country_data)
We are checking if there are any missing values in any of the variables.
colSums(is.na(country_data))
## country child_mort exports health imports income inflation
## 0 0 0 0 0 0 0
## life_expec total_fer gdpp
## 0 0 0
print(paste(sum(complete.cases(country_data)),"Complete cases!"))
## [1] "167 Complete cases!"
any(is.null(country_data))
## [1] FALSE
We see that there are no missing values in any of the variables and there are 167 complete cases.
We are checking if there are any negative values in any of the variables.
sum(rowMeans(country_data[,-1] < 0), na.rm = TRUE)
## [1] 0.8888889
sum(rowMeans(country_data[,!names(country_data) %in% c("country", "inflation")] < 0), na.rm = TRUE)
## [1] 0
We see that there are some negative values for inflation which shows the deflation rate for the countries and is normal.
We are checking if there are duplicate records in the dataset which is redundant and hence needs to be removed.
dim(unique(country_data))[1]
## [1] 167
There are no duplicate records in the dataset.
| Variable | Type | Description |
|---|---|---|
| country | factor | Name of the country |
| child_mort | double | Death of children under 5 years of age per 1000 live births |
| exports | double | Exports of goods and services. Given as %age of the Total GDP |
| health | double | Total health spending as %age of Total GDP |
| imports | double | Imports of goods and services. Given as %age of the Total GDP |
| Income | double | Net income per person |
| Inflation | double | The measurement of the annual growth rate of the Total GDP |
| life_expec | double | The average number of years a new born child would live if the current mortality patterns are to remain the same |
| total_fer | double | The number of children that would be born to each woman if the current age-fertility rates remain the same. |
| gdpp | integer | The GDP per capita. Calculated as the Total GDP divided by the total population. |
corrplot(cor(country_data[,2:10]), type = "lower", method = "number")
We observe the following correlations from the plot.
We observe the following about the outliers.
We will deduce the 3 variables namely imports, exports and health spending from percentage values to actual values of their GDP per capita since the percentage values don’t give a clear picture of that country. For example, Austria and Belarus have almost the same exports % but their gdpp has a huge gap which doesn’t give an accurate idea of which country is more developed than the other.
Then we will remove the Country field and keep it as the row names in the final data frame and scale the remaining data.
country_data['exports'] = country_data['exports'] * country_data['gdpp']/100
country_data['imports'] = country_data['imports'] * country_data['gdpp']/100
country_data['health'] = country_data['health'] * country_data['gdpp']/100
row.names(country_data) <- country_data$country
country_data <- country_data[,-1]
country_data_scaled <- scale(country_data)
K-means is a type of unsupervised learning which is considered as one of the most used algorithms due to its simplicity.
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
In Elbow Curve method, the total within-cluster sum of square (wss) measures the compactness of the clustering and it should be as small as possible for better clustering results.
wss <- (nrow(country_data_scaled) - 1)*sum(apply(country_data_scaled,2,var))
for (i in 1:10) wss[i] <- sum(kmeans(country_data_scaled,
centers = i)$withinss)
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
We observe that the optimal number of clusters, k would be 4.
The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
d = dist(country_data_scaled, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:10) {
cluster_result = kmeans(country_data_scaled, i)
clusterstat = cluster.stats(d, cluster_result$cluster)
result[i - 1, 1] = i
result[i - 1, 2] = clusterstat$avg.silwidth
}
plot(result[,c(1,2)], type = "l", ylab = 'silhouette width', xlab = 'number of clusters')
The results show that 2 clusters maximize the average silhouette values with 4 clusters coming in as second optimal number of clusters. Therefore, the optimal number of clusters, k = 4 would be a good choice.
The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
gap_stat <- clusGap(country_data_scaled, FUN = kmeans, nstart = 25, K.max = 10)
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = country_data_scaled, FUNcluster = kmeans, K.max = 10, nstart = 25)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 4
## logW E.logW gap SE.sim
## [1,] 4.968025 5.960431 0.9924059 0.01975041
## [2,] 4.720247 5.749259 1.0290118 0.01593695
## [3,] 4.475412 5.673630 1.1982176 0.01619197
## [4,] 4.395930 5.609470 1.2135406 0.01614209
## [5,] 4.345731 5.556311 1.2105798 0.01599800
## [6,] 4.274437 5.509614 1.2351770 0.01577231
## [7,] 4.197304 5.468737 1.2714336 0.01552893
## [8,] 4.135668 5.432355 1.2966867 0.01547706
## [9,] 4.071479 5.399072 1.3275926 0.01570677
## [10,] 4.041642 5.369123 1.3274804 0.01560518
fviz_gap_stat(gap_stat)
The optimal number of clusters would be 3.
Since both methods (wss and silhouette) indicated that 4 is the optimal number of clusters for this dataset, we proceed by taking k = 4 in our final clustering model.
k4 <- kmeans(country_data_scaled, centers = 4, nstart = 25)
fviz_cluster(k4, geom = "point", data = country_data_scaled) + ggtitle("k = 4") + theme_bw()
table(k4$cluster)
##
## 1 2 3 4
## 29 48 88 2
The table shows the number of countries in each cluster where clusters contain 29, 88, 2, and 48 countries.
cluster_means <- aggregate(country_data_scaled,by = list(k4$cluster), FUN = mean)
kable(cluster_means)
| Group.1 | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.8229155 | 0.7839417 | 1.6810380 | 0.7845623 | 1.4864203 | -0.4677212 | 1.1073394 | -0.7491976 | 1.6942324 |
| 2 | 1.3226325 | -0.3639478 | -0.5228749 | -0.3916387 | -0.6871713 | 0.3906375 | -1.2724487 | 1.3502808 | -0.6031494 |
| 3 | -0.4302588 | -0.2255406 | -0.3185481 | -0.2125591 | -0.1913636 | -0.0460506 | 0.2998973 | -0.4662325 | -0.3072514 |
| 4 | -0.8795190 | 7.2913772 | 2.1900618 | 7.3757764 | 3.3590146 | -0.5671176 | 1.2868650 | -1.0291420 | 3.4282752 |
res <- cbind(country_data, ClusterId = unname(k4$cluster))
rownames(res) <- NULL
res <- as.data.frame(res)
cluster_data <- aggregate(res, by = list(res$ClusterId), FUN = mean)
kable(cluster_data)
| Group.1 | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | ClusterId |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 5.082759 | 21511.0966 | 4084.9700 | 18129.9000 | 45800.000 | 2.837690 | 80.40345 | 1.813793 | 44017.241 | 1 |
| 2 | 91.610417 | 879.0635 | 114.8218 | 827.0288 | 3897.354 | 11.911146 | 59.23958 | 4.992083 | 1909.208 | 2 |
| 3 | 20.918182 | 3366.7784 | 482.8978 | 3461.4355 | 13455.568 | 7.295045 | 73.22273 | 2.242159 | 7332.636 | 3 |
| 4 | 2.800000 | 138475.0000 | 5001.9300 | 115092.0000 | 81900.000 | 1.787000 | 82.00000 | 1.390000 | 75800.000 | 4 |
Using the above table about clusters, we can draw the following inferences:
Hierarchical clustering is an unsupervised learning technique that groups data over a variety of scales by creating a cluster tree or dendrogram. The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level are joined as clusters at the next level.
fviz_nbclust(country_data_scaled, FUN = hcut, method = "wss")
The optimal number of clusters would be 4.
fviz_nbclust(country_data_scaled, FUN = hcut, method = "silhouette")
The optimal number of clusters would be 4 or 5.
The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
gap_stat <- clusGap(country_data_scaled, FUN = hcut, K.max = 10)
fviz_gap_stat(gap_stat)
The optimal number of clusters would be 3.
We will take 4 as the optimal number of clusters as elbow curve and silhouette both suggests the same.
We will perform hierarchical clustering using methods single linkage, complete linkage, and ward and select the one which yields the best result.
It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
euclidian_dist <- dist(country_data_scaled, method = "euclidean")
# Hierarchical clustering using Single Linkage
hc1 <- hclust(euclidian_dist, method = "single" )
fviz_dend(hc1,k = 4,
cex = 0.5, # label size
k_colors = c( "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#00AFBB", "#E7B800"),
rect_fill = TRUE)
The clusters of the single linkage are not truly satisfying. The single linkage method appears to be placing each outlier in its own cluster.
It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
# Hierarchical clustering using Complete Linkage
hc2 <- hclust(euclidian_dist, method = "complete" )
fviz_dend(hc2,k = 4,
cex = 0.5, # label size
k_colors = c( "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#00AFBB", "#E7B800"),
rect_fill = TRUE)
The result of complete linkage looks good.
We will use Dunn’s index to confirm if complete linkage is yielding better results than the single linkage. Dunn’s index returns the ratio between the minimum intercluster distance to the maximum intracluster diameter. The smaller the Dunn’s index, the better the clustering result.
memb_single = cutree(hc1, k = 4)
memb_complete = cutree(hc2, k = 4)
dunn_single <- dunn(clusters = memb_single, Data = country_data_scaled)
dunn_complete <- dunn(clusters = memb_complete, Data = country_data_scaled)
dunn_single
## [1] 0.3801017
dunn_complete
## [1] 0.07284464
The complete-linkage method returned the lowest ratio of minimal intercluster-distance to minimal cluster diameter. Based on Dunn’s index, the complete-linkage method returned the most compact and separated clusters. Since, a lower Dunn’s index indicates that the minimal intercluster ratio decreased and/or the maximal cluster diameter increased. Therefore, a lower Dunn’s index tells that atleast one pair of clusters became less separated and/or all clusters became less compact.
But the single linkage creates clusters for each of its outliers (here 3 clusters are with just one countries) and therefore, the results do not tend to be that good. Also, complete linkage is always preferred over single linkage.
So our final hierarchical clustering model is formed using the complete linkage method and is visualized as below.
#creating final model
country.3clust = cutree(hc2,k = 4)
table(country.3clust)
## country.3clust
## 1 2 3 4
## 147 18 1 1
The table shows the number of countries in each cluster where clusters contain 147, 18, 1, and 1 countries.
fviz_cluster(list(data = country_data_scaled, cluster = country.3clust), geom = "point") + ggtitle("k = 4") + theme_bw()
cluster_mean <- aggregate(country_data_scaled,by = list(country.3clust), FUN = mean)
kable(cluster_mean)
| Group.1 | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0933820 | -0.2114956 | -0.3009356 | -0.2118442 | -0.2309403 | 0.0123848 | -0.1430328 | 0.0873536 | -0.2933753 |
| 2 | -0.8401207 | 1.2033114 | 2.2675681 | 1.2152139 | 1.7057254 | -0.5849552 | 1.1637993 | -0.7711533 | 2.1491644 |
| 3 | -0.8795190 | 9.8103094 | 3.9423402 | 9.6875457 | 3.8673643 | -0.3937138 | 1.2081529 | -0.8706055 | 5.0214047 |
| 4 | 2.2745443 | -0.3800585 | -0.5210378 | -0.4202985 | -0.6221935 | 9.1023425 | -1.1307201 | 1.9103878 | -0.5801913 |
hc_clusters <- cbind(country_data, ClusterId = unname(country.3clust))
rownames(hc_clusters) <- NULL
hc_clusters <- as.data.frame(hc_clusters)
cluster_data_hc <- aggregate(hc_clusters, by = list(hc_clusters$ClusterId), FUN = mean)
kable(cluster_data_hc)
| Group.1 | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | ClusterId |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 42.036054 | 3619.221 | 514.6252 | 3471.952 | 12692.61 | 7.912748 | 69.28367 | 3.080204 | 7586.966 | 1 |
| 2 | 4.388889 | 29048.800 | 5141.5506 | 24465.133 | 50027.78 | 1.598444 | 80.90556 | 1.780556 | 52355.556 | 2 |
| 3 | 2.800000 | 183750.000 | 8158.5000 | 149100.000 | 91700.00 | 3.620000 | 81.30000 | 1.630000 | 105000.000 | 3 |
| 4 | 130.000000 | 589.490 | 118.1310 | 405.420 | 5150.00 | 104.000000 | 60.50000 | 5.840000 | 2330.000 | 4 |
Using the above table about clusters, we can draw the following inferences:
K-means clustering is a very simple and fast algorithm. Furthermore, it can efficiently deal with very large data sets. However, there are some weaknesses of the k-means approach.
One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters. Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of clusters. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram. A future tutorial will illustrate the hierarchical clustering approach.
An additional disadvantage of K-means is that it’s sensitive to outliers and different results can occur if you change the ordering of your data. The Partitioning Around Medoids (PAM) clustering approach is less sensititive to outliers and provides a robust alternative to k-means to deal with these situations. A future tutorial will illustrate the PAM clustering approach.
fviz_cluster(k4, data = country_data_scaled, geom = "point") + ggtitle("K-means clustering") + theme_bw()
fviz_cluster(list(data = country_data_scaled, cluster = country.3clust), geom = "point") + ggtitle("Hierarchical clustering") + theme_bw()
Dunn’s Index of K-means and Hierarchical clusering
dunn_km = dunn(clusters = k4$cluster, Data = country_data_scaled)
dunn_km
## [1] 0.03117421
dunn_complete
## [1] 0.07284464
We have analyzed both K-means and Hierarchial clustering and found clusters formed are not identical. The clusters formed in both the cases are great but its better in Hierarchial as compared to K-means, as also indicated by Dunn’s Index (the higher the Dunn’s Index, the better the clustering results). So, we will proceed with the clusters formed by Hierarchial and based on the information provided by the final clusters we will deduce the final list of countries which are in need of aid.
We saw that the Cluster 4 had highest Child mortality and Total fertility, therefore, the country in Cluster 4 needs the aid.
There is only 1 country in Cluster 4 which is Nigeria, as can be seen from the below Hierarchical clusters.
fviz_cluster(list(data = country_data_scaled, cluster = country.3clust)) + ggtitle("Hierarchical clustering") + theme_bw()