1 Introduction

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid.

The aim of this paper is to categorise the countries using some socio-economic and health factors that determine the overall development of the country, to finally suggest the countries which the CEO needs to focus on the most.

Review of the Dataset

The dataset used in this project was found in a public Kaggle profile, and it describes 9 socio-economic and health factors from a list of 167 countries:

country: Name of the country
child_mort: Death of children under 5 years of age per 1000 live births
exports: Exports of goods and services per capita. Given as %age of the GDP per capita
health: Total health spending per capita. Given as %age of GDP per capita
imports: Imports of goods and services per capita. Given as %age of the GDP per capita
Income: Net income per person
Inflation: The measurement of the annual growth rate of the Total GDP
life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same
gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

Preparing the data

#Libraries needed in the whole paper
library(knitr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(factoextra)
library(tidyverse)
library(cluster)    
library(dendextend) 
library(highcharter)
library(maps)

Loading the data, showing the header of the data and checking if there is any missing value in the dataset:

CD <- read.csv('/Users/albertodelgadolopez/Desktop/Final clustering/Country-data.csv', header=TRUE)
kable(head(CD))

country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	90.2	10.0	7.58	44.9	1610	9.44	56.2	5.82	553
Albania	16.6	28.0	6.55	48.6	9930	4.49	76.3	1.65	4090
Algeria	27.3	38.4	4.17	31.4	12900	16.10	76.5	2.89	4460
Angola	119.0	62.3	2.85	42.9	5900	22.40	60.1	6.16	3530
Antigua and Barbuda	10.3	45.5	6.03	58.9	19100	1.44	76.8	2.13	12200
Argentina	14.5	18.9	8.10	16.0	18700	20.90	75.8	2.37	10300

any(is.na(CD))

## [1] FALSE

There are no missing values so we do not have to modify anything related with that matter in our dataset. It is also possible to appreciate that some features tend to have larger values than others, so to avoid a problem in which some features come to dominate solely, let’s scale our data.

CDscaled <- as.data.frame(lapply(CD[,2:10], scale))
kable(summary(CDscaled))

child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Min. :-0.8845	Min. :-1.4957	Min. :-1.8223	Min. :-1.9341	Min. :-0.8577	Min. :-1.1344	Min. :-4.3242	Min. :-1.1877	Min. :-0.69471
1st Qu.:-0.7444	1st Qu.:-0.6314	1st Qu.:-0.6901	1st Qu.:-0.6894	1st Qu.:-0.7153	1st Qu.:-0.5649	1st Qu.:-0.5910	1st Qu.:-0.7616	1st Qu.:-0.63475
Median :-0.4704	Median :-0.2229	Median :-0.1805	Median :-0.1483	Median :-0.3727	Median :-0.2263	Median : 0.2861	Median :-0.3554	Median :-0.45307
Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.00000
3rd Qu.: 0.5909	3rd Qu.: 0.3736	3rd Qu.: 0.6496	3rd Qu.: 0.4899	3rd Qu.: 0.2934	3rd Qu.: 0.2808	3rd Qu.: 0.7021	3rd Qu.: 0.6157	3rd Qu.: 0.05924
Max. : 4.2086	Max. : 5.7964	Max. : 4.0353	Max. : 5.2504	Max. : 5.5947	Max. : 9.1023	Max. : 1.3768	Max. : 3.0003	Max. : 5.02140

Let’s have a look now at the relationship between the variables, below there is a fragment of the correlation matrix.

corrplot(cor(CDscaled, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="CDscaled", mar=c(0,0,1,0))

From the indexes of correlation we can obtain a first impression of the relationship between the variables and the possible relationship between the final clusters. As we can see, the most correlated pair of data are the income/gdpp followed by the child_mort/total_fer and the exports/imports. These results are, in deed, pretty logical and would be very good indicators for dividing the countries in clusters accordingly with their overall development.

Nevertheless, our intention is to take into cosideration all the variables included in our data set together, and classify our data accordingly, that is why the next step is so important, the dimension reduction.

2. Dimension reduction

Since our dataset is multidimensional with nine variables, it is highly important to do the next Principal Component Analysis. This is a dimension reduction technique to create linear combinations of the original variables that explain as much of the variance in our data as possible. So, this allow us going from 10, 15, 30 or even more variables down to just two or three which explain most of the variation in the data and make it easier to look for patterns.

We will build a PCA model that will hopefully reduce our 9 continuous variables into just two PCAs that capture most of the variability in our data. And perhaps once we have constructed our principal components, these will allow us to perform a proper cluster analysis.

Principal Component Analysis

myPr <- prcomp(CDscaled)
myPr

## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
## 
## Rotation (n x k) = (9 x 9):
##                   PC1          PC2         PC3          PC4         PC5
## child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
## exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
## imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
## income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
## inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
## total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
## gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
##                     PC6         PC7         PC8         PC9
## child_mort -0.200628153  0.07948854  0.68274306  0.32754180
## exports     0.059332832  0.70730269  0.01419742 -0.12308207
## health     -0.007276456  0.24983051 -0.07249683  0.11308797
## imports     0.030031537 -0.59218953  0.02894642  0.09903717
## income     -0.160346990 -0.09556237 -0.35262369  0.61298247
## inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
## life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
## total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

As result our original scaled variables have been combined to form 9 linear combinations, 9 principal components. Principal component one is perhaps the most important which explains most of the variability in the data followed by the principal component 2 which explains the second highest amount. Within each principal component we have a value for each variable, and they form the so called Eigenvectors. These are the coefficients which our original variables are multiplied by to calculate the principal component score for any particular observation.

summary(myPr)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
##                            PC8     PC9
## Standard deviation     0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion  0.99257 1.00000

For each of the principal components we have three values: a standard deviation, a proportion of variance and the cumulative proportion.

The standard deviation is the standard deviation of the data along a single principal components, so, a measure of the variability across that principal component. The proportion of variance is the proportion of all the variability in the original data, explained away by the principal component, i.e., in our case the 46% variance in the data is explained by PC1, 17% by PC2, 13% by PC3… and so on. The coumulative proportion is the same idea but cumulative, so, if you take PC1 you will be able to explain 46%, if you take PC1 and PC2 63%, PC1 PC2 and PC3 76%… and if we take all nine this explains 100% of the variability in our data.

Let’s represent this idea in a scree plot:

plot(myPr, type = "l")

This plot shows the variances across its principal components. Of course, the variance is the square of the standard deviation and so that you can see that when it has been taken into account the first and second principal components big part of the variability in our data has been accounted for.

To actually interpret our principal components we can use a function biplot:

biplot(myPr, scale = 0)

In this plot, each number is an observation from our data frame, and each observation is plotted with its values for PC1 and PC2. The red arrows represent the eigenvectors for each variable in our data frame. This plot helps us interpret the importance or the relationship between the different variables and the principal components.

Another statistically way of interpreting the relationship between each variable in our original dataset and the principal components is by looking at the correlations between our original variables and the PCs as a way of helping us understand how as one PC changes whether our variables go up or down and by how much.

Lets extract the PC1 and PC2 because they are the most important and attach them to our data frame.

CD_Pr <- cbind(CDscaled, myPr$x)
head(CD_Pr)

##   child_mort     exports      health     imports      income  inflation
## 1  1.2876597 -1.13486665  0.27825140 -0.08220771 -0.80582187  0.1568645
## 2 -0.5373329 -0.47822017 -0.09672528  0.07062429 -0.37424335 -0.3114109
## 3 -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227  0.7869076
## 4  2.0017872  0.77305618 -1.44372888 -0.16481961 -0.58328920  1.3828944
## 5 -0.6935483  0.16018613 -0.28603389  0.49607554  0.10142673 -0.5999442
## 6 -0.5894047 -0.81019144  0.46756001 -1.27594958  0.08067776  1.2409928
##   life_expec   total_fer        gdpp         PC1         PC2        PC3
## 1 -1.6142372  1.89717646 -0.67714308 -2.90428986 -0.09533386  0.7159652
## 2  0.6459238 -0.85739418 -0.48416709  0.42862224  0.58639208  0.3324855
## 3  0.6684130 -0.03828924 -0.46398018 -0.28436983  0.45380957 -1.2178421
## 4 -1.1756985  2.12176975 -0.51472026 -2.92362976 -1.69047094 -1.5204709
## 5  0.7021467 -0.54032130 -0.04169175  1.03047668 -0.13624894  0.2250441
## 6  0.5897009 -0.38178486 -0.14535428  0.02234007  1.77385167 -0.8673884
##           PC4        PC5          PC6         PC7         PC8         PC9
## 1 -1.00224038  0.1578353  0.253834026 -0.38185183 -0.41383141 -0.01410602
## 2  1.15757715 -0.1741535 -0.084325021 -0.24817249  0.22037967  0.17279609
## 3  0.86551146 -0.1560055  0.400491017  0.08695208  0.18360988  0.08378519
## 4 -0.83710739  0.2723897  0.546352696  0.43951279  0.35493006 -0.09106518
## 5  0.84452276  0.1924282  0.206298298 -0.24125232  0.02361042  0.09398709
## 6  0.03685602 -0.9781148  0.003585307  0.15038009 -0.12557185  0.12570099

cor(CD_Pr[,1:9], CD_Pr[,10:11])

##                   PC1         PC2
## child_mort -0.8531479 -0.23985537
## exports     0.5773418 -0.76248213
## health      0.3067485  0.30228369
## imports     0.3283958 -0.83542357
## income      0.8102823 -0.02802342
## inflation  -0.3928425  0.01045115
## life_expec  0.8660003  0.27694068
## total_fer  -0.8210359 -0.19303574
## gdpp        0.7984948  0.05722985

These above are Pearson correlation coefficients, so, for example, the correlation between the variable ‘income’ and PC1 is large and positive which means that generally as the value for an observation increases along PC1 so does it the variable ‘income’. Alternatively, for variable ‘child_mort’ we have moderate negative correlation between them, so that, as PC1 increases, generally variable ‘child_mort’ decreases. And, for example, although we have negative correlation at ‘income’ and PC2 this is quite small and so that there is no significant relationship between them.

This is actually how the arrows are distributed in the biplot graph: when the correlations with PC2 are negative the arrows are going down some degree, for example, the largest negative is with variable ‘import’; when the correlations are positive, the arrows are pointing up. With PC1 it happens the same but with right for positive and left for negative correlations. When the correlations are not significant lines are almost horizontal for the case of PC2 and almost vertical for the case of PC1.

Since we have already the coordenates of PC1 and PC2 for each point of data let’s finally plot these using ggplot:

ggplot(CD_Pr, aes(PC1, PC2)) +
  geom_point(shape = 21, col = "black")

So now, each data point in our original data set is plotted based on its coordinates rewarded by PC1 and PC2. So we can see, that starting with nine original variables and just condensing them down into two linear combinations of those variables has been able to separate our data of countries. Next step after performing PCA is to apply a cluster analysis to our data projected onto the principal components to try automatically sort and categorize each data point into a cluster.

*** For all the following pre-clustering analysis the information about PCs obtained during the dimension reduction analysis will not be used, since we have to remember the fact that using only PC1 and PC2 it represent only the 63% of the variability of our data, and for this pre-analyis we need the 100%. Then PC1 and PC2 will be mentioned later on again during the projection of the data process when plotting the clusters.

3. Clustering

In addition, before proceeding with any clustering method, it is worth to assess the general clustering tendency of the data. For this purpose, Hopkins statistic and visual assessment are used.

get_clust_tendency(CDscaled, 2, graph=TRUE, gradient=list(low="blue",  high="white"), seed=1234)

## $hopkins_stat
## [1] 0.8325281
## 
## $plot

A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0. In the case of our dataset the value of the Hopkins statistic is 0.83 which claims that the dataset is significantly clusterable. Also the darker rectangular blocks shown in the graph are another proof of the clusterability of the dataset.

3.1 K-means and PAM clustering

Silhoutte statistic:

f1 <- fviz_nbclust(CDscaled, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(CDscaled, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

The silhouette statistic shows that for K-means the proper number of clusters is 5, but since too many clusters could lead to overlapping, we will also do K-means analysis for 4 clusters because the average silhouette width is quite similar. For PAM analysis the proper number of clusters according with the the silhouette statistic is 2, but diving the countries in only two groups would be too general information and would not really help to decide which countries need help. Therefore, as 3 clusters has almost the same average silhouette width than 2, it will finally be considered a PAM analysis of 3 clusters.

Total within-clusters sum of square:

f1 <- fviz_nbclust(CDscaled, FUNcluster = kmeans, method = "wss") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(CDscaled, FUNcluster = cluster::pam, method = "wss") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

With the within-clusters sum of squares statistics it is possible to check that the elbow point is around the number of clusters mentioned before: 4~5 for K-means and 3 for PAM.

3.1.1 K-means clustering

K-means for 5 clusters:

km5 <- eclust(CDscaled, k=5, FUNcluster="kmeans", hc_metric="euclidean", graph=F)

c5 <- fviz_cluster(km5, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 5 clusters")
s5 <- fviz_silhouette(km5)

##   cluster size ave.sil.width
## 1       1   22          0.46
## 2       2   40          0.19
## 3       3   45          0.20
## 4       4   52          0.22
## 5       5    8         -0.03

grid.arrange(c5, s5, ncol=2)

Not a very good result since we have overlappings between all the clusters and also some negatives values of the Silhouette width for all the clusters but 1.

K-means for 4 clusters:

km4 <- eclust(CDscaled, k=4 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

c4 <- fviz_cluster(km4, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 4 clusters")
s4 <- fviz_silhouette(km4)

##   cluster size ave.sil.width
## 1       1   30          0.30
## 2       2   87          0.33
## 3       3   47          0.25
## 4       4    3          0.39

grid.arrange(c4, s4, ncol=2)

table(km4$cluster)

## 
##  1  2  3  4 
## 30 87 47  3

This result looks much better since we got a higher average silhouette width with less negatives values, and only clusters 2 and 3 overlap a little.

3.1.2 Partition Around Medoids

PAM for 3 clusters

pam3 <- eclust(CDscaled, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)

##   cluster size ave.sil.width
## 1       1   51          0.25
## 2       2   85          0.31
## 3       3   31          0.26

grid.arrange(cp3, sp3, ncol=2)

The clustering with PAM and k=3 gives a better result than K-means for k=3, but lightly worst result than K-means for k=4 since PAM is influenced by the 3 points from cluster 4 in K-means wich are behaving as outliers. PAM with 4 clusters has been tested too but it did not improve the result, that is why it has not been included here.

3.1.3 Discussion

Before commenting the clustering, it is very important to highlight two numbers located in every clustering plot that has been created: 17.2% and 46% which are found in the axis for Dim2 and Dim1 respectively. It is very important to notice that the algorithms used to represent the clusters have obtained the same values of “Proportion of Variance” that we found during the dimension reduction analysis (PC1:0.4595 and PC2:0.1718) so we can clearly say that these two dimensions are the Principal Components 1 and 2 that we obtain before, and the data is projected on them to be understandable when plotting it, that is why the points are equally distributed than in our ggplot in the dimension reduction analysis.

Let’s come back the the clustering analysis:

K-means with 4 clusters is the best option according with the silhouette statistics so let’s see what information can we obtain from this clusters:

CD_cl <- as.data.frame(cbind(CD, km4$cluster))
colnames(CD_cl) <- c(colnames(CD),"cluster")
table(km4$cluster)

## 
##  1  2  3  4 
## 30 87 47  3

Cluster 1

summary(CD_cl[CD_cl$cluster==1,c("gdpp","income","child_mort")]) %>% kable()

gdpp	income	child_mort
Min. :22100	Min. : 27200	Min. : 2.600
1st Qu.:34025	1st Qu.: 34375	1st Qu.: 3.825
Median :41850	Median : 40550	Median : 4.200
Mean :43333	Mean : 45250	Mean : 4.953
3rd Qu.:48625	3rd Qu.: 45650	3rd Qu.: 5.100
Max. :87800	Max. :125000	Max. :10.800

cat(CD_cl[CD_cl$cluster==1,c("country")], sep = ", ")

## Australia, Austria, Belgium, Brunei, Canada, Cyprus, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Netherlands, New Zealand, Norway, Portugal, Qatar, Slovenia, South Korea, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States

dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==1,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 2

summary(CD_cl[CD_cl$cluster==2,c("gdpp","income","child_mort")]) %>% kable()

gdpp	income	child_mort
Min. : 592	Min. : 1780	Min. : 3.40
1st Qu.: 2970	1st Qu.: 6705	1st Qu.:11.00
Median : 5020	Median :10500	Median :18.10
Mean : 6919	Mean :12969	Mean :21.39
3rd Qu.:10500	3rd Qu.:17600	3rd Qu.:27.70
Max. :28000	Max. :45400	Max. :64.40

cat(CD_cl[CD_cl$cluster==2,c("country")], sep = ", ")

## Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Czech Republic, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Slovak Republic, Solomon Islands, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam

dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==2,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 3

summary(CD_cl[CD_cl$cluster==3,c("gdpp","income","child_mort")]) %>% kable()

gdpp	income	child_mort
Min. : 231	Min. : 609	Min. : 36.90
1st Qu.: 550	1st Qu.: 1390	1st Qu.: 63.80
Median : 897	Median : 1870	Median : 90.20
Mean : 1922	Mean : 3942	Mean : 92.96
3rd Qu.: 1470	3rd Qu.: 3675	3rd Qu.:111.00
Max. :17100	Max. :33700	Max. :208.00

cat(CD_cl[CD_cl$cluster==3,c("country")], sep = ", ")

## Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d'Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia

dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==3,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 4

summary(CD_cl[CD_cl$cluster==4,c("gdpp","income","child_mort")]) %>% kable()

gdpp	income	child_mort
Min. : 21100	Min. :28300	Min. :2.800
1st Qu.: 33850	1st Qu.:50200	1st Qu.:2.800
Median : 46600	Median :72100	Median :2.800
Mean : 57567	Mean :64033	Mean :4.133
3rd Qu.: 75800	3rd Qu.:81900	3rd Qu.:4.800
Max. :105000	Max. :91700	Max. :6.800

cat(CD_cl[CD_cl$cluster==4,c("country")], sep = ", ")

## Luxembourg, Malta, Singapore

Looking the variables gdpp, income and child_mort we can get and idea about wich group of countries are more in need of help. So, according with the stats summarised above, the cluster 3 is the one with lowest income, gpp and highest child mortality and therefore the countries that make it up will be considered the ones with extreme need of help. If we take a look closer to the list and the map, this cluster of 47 countries mostly consist of African countries, also some from South Asia (Pakistan, Laos…), Middle East (Yemen, Iraq), Caribbean (Haiti) and Oceania (Kiribati).

The countries from cluster 2 have got better statistics than cluster 3 so they are in a better situation. These countries are mostly from South America, North Africa, Asia, and East Europe.

The countries from cluster 1 are the most developed countries in the world with no need of help at all, i.e., North and West Europe, U.S., Canada, Australia, New Zeland, Japan…

The cluster 4 can be consider as the ones with best conditions of all, and it is good they have been included in a different cluster cause otherwise they would have been interfering in the statistics as outliers.

3.2. Hierarchical clustering

It is good to use extra methods for clustering our data to check if our results are consistent. That is why a Hierarchical clustering analysis has been done.

To decide which method to use for the Hierarchical clustering let’s calculate the agglomerative coefficient for each method. This coefficient measures the amount of clustering structure found, values closer to 1 suggest strong clustering structure.

# multiple methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# the coefficient
ac <- function(x) {
  agnes(CDscaled, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.8752786 0.8430072 0.9130134 0.9519134

I will choose the Ward’s and Complete Linkage’s methods since they are the ones with closest values to 1.

Now to choose the proper number of clusters I will use the gap statistic which compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data.

gap_stat <- clusGap(CDscaled, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

It shows that the proper value of clusters will be 3.

3.3.1 Ward’s method

3 clusters

# dissimilarity matrix
d <- dist(CDscaled, method = "euclidean")
hcw <- hclust(d, method = "ward.D2" )

# cut tree into 3 groups
sub_grpW <- cutree(hcw, k = 3)

# number of members in each cluster
table(sub_grpW)

## sub_grpW
##   1   2   3 
##  27 106  34

# plots with borders
hcw$labels <- CD$country
plot(hcw, cex = 0.5, hang=-1)
rect.hclust(hcw, k = 3, border = 2:5)

# more visualization
fviz_cluster(list(data = CDscaled, cluster = sub_grpW))

This clustering does not look good since the three clusters overlap each other.

2.3.2 Complete linkage’s method

3 clusters

hcc <- hclust(d, method = "complete" )

# cut tree into 3 groups
sub_grpC <- cutree(hcc, k = 3)

# number of members in each cluster
table(sub_grpC)

## sub_grpC
##   1   2   3 
##  55 109   3

# plots with borders
hcc$labels <- CD$country
plot(hcc, cex = 0.5, hang=-1)
rect.hclust(hcc, k = 3, border = 2:5)

# more visualization - way like with kmeans or pam
fviz_cluster(list(data = CDscaled, cluster = sub_grpC))

This clustering looks more similar to the one obtained with the K-means analysis. And it was also tested for the case of k=4 but the result did not improved at all, the fourth cluster turned out to be one single point of data, the 114 representing Nigeria with a very high child mortality and located on the top of the inflation list.

2.3.3 Discussion

Comparing the silhouette statistics of both methods:

complete_statistics <- fpc::cluster.stats(dist(CDscaled), sub_grpW)
complete_statistics$avg.silwidth

## [1] 0.24563

complete_statistics <- fpc::cluster.stats(dist(CDscaled), sub_grpC)
complete_statistics$avg.silwidth

## [1] 0.2900525

The best method is the Complete Linkage with a higher value of the Average Silhouette Width. Let’s see now what the statistics of these 3 clusters say about the countries.

CD_cl <- as.data.frame(cbind(CD_cl, sub_grpC))
colnames(CD_cl) <- c(colnames(CD),"cluster","clusterC")
table(sub_grpC)

## sub_grpC
##   1   2   3 
##  55 109   3

Cluster 1:

summary(CD_cl[CD_cl$clusterC==1,c("gdpp","income","child_mort")]) %>% kable()

gdpp	income	child_mort
Min. : 231	Min. : 609	Min. : 17.40
1st Qu.: 569	1st Qu.: 1420	1st Qu.: 54.85
Median : 1000	Median : 2180	Median : 80.30
Mean : 1918	Mean : 3812	Mean : 82.23
3rd Qu.: 2535	3rd Qu.: 4230	3rd Qu.:110.00
Max. :17100	Max. :33700	Max. :208.00

cat(CD_cl[CD_cl$clusterC==1,c("country")], sep = ", ")

## Afghanistan, Angola, Benin, Bhutan, Botswana, Burkina Faso, Burundi, Cambodia, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d'Ivoire, Equatorial Guinea, Fiji, Gambia, Ghana, Guinea, Guinea-Bissau, Guyana, Haiti, Iraq, Kenya, Kiribati, Kyrgyz Republic, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Micronesia, Fed. Sts., Mozambique, Namibia, Niger, Nigeria, Rwanda, Samoa, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tajikistan, Tanzania, Timor-Leste, Togo, Tonga, Uganda, Vanuatu, Yemen, Zambia

dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$clusterC==1,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 2:

summary(CD_cl[CD_cl$clusterC==2,c("gdpp","income","child_mort")]) %>% kable()

gdpp	income	child_mort
Min. : 482	Min. : 1420	Min. : 2.60
1st Qu.: 4440	1st Qu.: 9720	1st Qu.: 5.60
Median : 9070	Median : 16500	Median :11.70
Mean :17310	Mean : 22582	Mean :17.03
3rd Qu.:26900	3rd Qu.: 32300	3rd Qu.:20.30
Max. :87800	Max. :125000	Max. :92.10

cat(CD_cl[CD_cl$clusterC==2,c("country")], sep = ", ")

## Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Bolivia, Bosnia and Herzegovina, Brazil, Brunei, Bulgaria, Canada, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czech Republic, Denmark, Dominican Republic, Ecuador, Egypt, El Salvador, Eritrea, Estonia, Finland, France, Gabon, Georgia, Germany, Greece, Grenada, Guatemala, Hungary, Iceland, India, Indonesia, Iran, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kuwait, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Netherlands, New Zealand, Norway, Oman, Pakistan, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russia, Saudi Arabia, Serbia, Seychelles, Slovak Republic, Slovenia, South Korea, Spain, Sri Lanka, St. Vincent and the Grenadines, Suriname, Sweden, Switzerland, Thailand, Tunisia, Turkey, Turkmenistan, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Venezuela, Vietnam

dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$clusterC==2,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 3:

summary(CD_cl[CD_cl$clusterC==3,c("gdpp","income","child_mort")]) %>% kable()

gdpp	income	child_mort
Min. : 21100	Min. :28300	Min. :2.800
1st Qu.: 33850	1st Qu.:50200	1st Qu.:2.800
Median : 46600	Median :72100	Median :2.800
Mean : 57567	Mean :64033	Mean :4.133
3rd Qu.: 75800	3rd Qu.:81900	3rd Qu.:4.800
Max. :105000	Max. :91700	Max. :6.800

cat(CD_cl[CD_cl$clusterC==3,c("country")], sep = ", ")

## Luxembourg, Malta, Singapore

Looking the variables gdpp, income and child_mort it is possible to see the clear relation between these new clusters and the clusters obtained during the K-means analysis:

Cluster 1 in the Complete Linkage is almost the same than the cluster 3 for the K-means, they have a difference of 8 countries but the statistics are quite similar, so this is the cluster with the countrieas which are extreme need of help.

Then clusters 1 and 2 from K-means are now represented together by the cluster 2 and, as it is possible to see in the map, it consists in developed and high developed countries.

The cluster 3 is exactly the same three countries than in the cluster 4 in K-means case.

4. Conclusion

It can be concluded that the classification made by the K-means analysis with 4 clusters was consistent and the list of countries is ready to be submitted the HELP International organization.

5. Sources

Jacek Lewkowicz. Unsupervised Learning - Master in Data Science and Business Analytics - University of Warsaw. Lecture slides
https://rpubs.com/sgroszkiewicz/clustering
Map-countries plot algorithms: https://towardsdatascience.com/world-map-of-visited-countries-in-r-d46b9be7d52b
Data - Kaggle: https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data?select=data-dictionary.csv

Unsupervised Learning on Country Data

Alberto Delgado Lopez