1 Introduction

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid.

The aim of this paper is to categorise the countries using some socio-economic and health factors that determine the overall development of the country, to finally suggest the countries which the CEO needs to focus on the most.

Review of the Dataset

The dataset used in this project was found in a public Kaggle profile, and it describes 9 socio-economic and health factors from a list of 167 countries:

  • country: Name of the country
  • child_mort: Death of children under 5 years of age per 1000 live births
  • exports: Exports of goods and services per capita. Given as %age of the GDP per capita
  • health: Total health spending per capita. Given as %age of GDP per capita
  • imports: Imports of goods and services per capita. Given as %age of the GDP per capita
  • Income: Net income per person
  • Inflation: The measurement of the annual growth rate of the Total GDP
  • life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
  • total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same
  • gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

Preparing the data

#Libraries needed in the whole paper
library(knitr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(factoextra)
library(tidyverse)
library(cluster)    
library(dendextend) 
library(highcharter)
library(maps)

Loading the data, showing the header of the data and checking if there is any missing value in the dataset:

CD <- read.csv('/Users/albertodelgadolopez/Desktop/Final clustering/Country-data.csv', header=TRUE)
kable(head(CD))
country child_mort exports health imports income inflation life_expec total_fer gdpp
Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553
Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090
Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460
Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530
Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200
Argentina 14.5 18.9 8.10 16.0 18700 20.90 75.8 2.37 10300
any(is.na(CD))
## [1] FALSE

There are no missing values so we do not have to modify anything related with that matter in our dataset. It is also possible to appreciate that some features tend to have larger values than others, so to avoid a problem in which some features come to dominate solely, let’s scale our data.

CDscaled <- as.data.frame(lapply(CD[,2:10], scale))
kable(summary(CDscaled))
child_mort exports health imports income inflation life_expec total_fer gdpp
Min. :-0.8845 Min. :-1.4957 Min. :-1.8223 Min. :-1.9341 Min. :-0.8577 Min. :-1.1344 Min. :-4.3242 Min. :-1.1877 Min. :-0.69471
1st Qu.:-0.7444 1st Qu.:-0.6314 1st Qu.:-0.6901 1st Qu.:-0.6894 1st Qu.:-0.7153 1st Qu.:-0.5649 1st Qu.:-0.5910 1st Qu.:-0.7616 1st Qu.:-0.63475
Median :-0.4704 Median :-0.2229 Median :-0.1805 Median :-0.1483 Median :-0.3727 Median :-0.2263 Median : 0.2861 Median :-0.3554 Median :-0.45307
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.5909 3rd Qu.: 0.3736 3rd Qu.: 0.6496 3rd Qu.: 0.4899 3rd Qu.: 0.2934 3rd Qu.: 0.2808 3rd Qu.: 0.7021 3rd Qu.: 0.6157 3rd Qu.: 0.05924
Max. : 4.2086 Max. : 5.7964 Max. : 4.0353 Max. : 5.2504 Max. : 5.5947 Max. : 9.1023 Max. : 1.3768 Max. : 3.0003 Max. : 5.02140

Let’s have a look now at the relationship between the variables, below there is a fragment of the correlation matrix.

corrplot(cor(CDscaled, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="CDscaled", mar=c(0,0,1,0))

From the indexes of correlation we can obtain a first impression of the relationship between the variables and the possible relationship between the final clusters. As we can see, the most correlated pair of data are the income/gdpp followed by the child_mort/total_fer and the exports/imports. These results are, in deed, pretty logical and would be very good indicators for dividing the countries in clusters accordingly with their overall development.

Nevertheless, our intention is to take into cosideration all the variables included in our data set together, and classify our data accordingly, that is why the next step is so important, the dimension reduction.

2. Dimension reduction

Since our dataset is multidimensional with nine variables, it is highly important to do the next Principal Component Analysis. This is a dimension reduction technique to create linear combinations of the original variables that explain as much of the variance in our data as possible. So, this allow us going from 10, 15, 30 or even more variables down to just two or three which explain most of the variation in the data and make it easier to look for patterns.

We will build a PCA model that will hopefully reduce our 9 continuous variables into just two PCAs that capture most of the variability in our data. And perhaps once we have constructed our principal components, these will allow us to perform a proper cluster analysis.

Principal Component Analysis

myPr <- prcomp(CDscaled)
myPr
## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
## 
## Rotation (n x k) = (9 x 9):
##                   PC1          PC2         PC3          PC4         PC5
## child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
## exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
## imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
## income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
## inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
## total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
## gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
##                     PC6         PC7         PC8         PC9
## child_mort -0.200628153  0.07948854  0.68274306  0.32754180
## exports     0.059332832  0.70730269  0.01419742 -0.12308207
## health     -0.007276456  0.24983051 -0.07249683  0.11308797
## imports     0.030031537 -0.59218953  0.02894642  0.09903717
## income     -0.160346990 -0.09556237 -0.35262369  0.61298247
## inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
## life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
## total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

As result our original scaled variables have been combined to form 9 linear combinations, 9 principal components. Principal component one is perhaps the most important which explains most of the variability in the data followed by the principal component 2 which explains the second highest amount. Within each principal component we have a value for each variable, and they form the so called Eigenvectors. These are the coefficients which our original variables are multiplied by to calculate the principal component score for any particular observation.

summary(myPr)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
##                            PC8     PC9
## Standard deviation     0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion  0.99257 1.00000

For each of the principal components we have three values: a standard deviation, a proportion of variance and the cumulative proportion.

The standard deviation is the standard deviation of the data along a single principal components, so, a measure of the variability across that principal component. The proportion of variance is the proportion of all the variability in the original data, explained away by the principal component, i.e., in our case the 46% variance in the data is explained by PC1, 17% by PC2, 13% by PC3… and so on. The coumulative proportion is the same idea but cumulative, so, if you take PC1 you will be able to explain 46%, if you take PC1 and PC2 63%, PC1 PC2 and PC3 76%… and if we take all nine this explains 100% of the variability in our data.

Let’s represent this idea in a scree plot:

plot(myPr, type = "l")

This plot shows the variances across its principal components. Of course, the variance is the square of the standard deviation and so that you can see that when it has been taken into account the first and second principal components big part of the variability in our data has been accounted for.

To actually interpret our principal components we can use a function biplot:

biplot(myPr, scale = 0)

In this plot, each number is an observation from our data frame, and each observation is plotted with its values for PC1 and PC2. The red arrows represent the eigenvectors for each variable in our data frame. This plot helps us interpret the importance or the relationship between the different variables and the principal components.

Another statistically way of interpreting the relationship between each variable in our original dataset and the principal components is by looking at the correlations between our original variables and the PCs as a way of helping us understand how as one PC changes whether our variables go up or down and by how much.

Lets extract the PC1 and PC2 because they are the most important and attach them to our data frame.

CD_Pr <- cbind(CDscaled, myPr$x)
head(CD_Pr)
##   child_mort     exports      health     imports      income  inflation
## 1  1.2876597 -1.13486665  0.27825140 -0.08220771 -0.80582187  0.1568645
## 2 -0.5373329 -0.47822017 -0.09672528  0.07062429 -0.37424335 -0.3114109
## 3 -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227  0.7869076
## 4  2.0017872  0.77305618 -1.44372888 -0.16481961 -0.58328920  1.3828944
## 5 -0.6935483  0.16018613 -0.28603389  0.49607554  0.10142673 -0.5999442
## 6 -0.5894047 -0.81019144  0.46756001 -1.27594958  0.08067776  1.2409928
##   life_expec   total_fer        gdpp         PC1         PC2        PC3
## 1 -1.6142372  1.89717646 -0.67714308 -2.90428986 -0.09533386  0.7159652
## 2  0.6459238 -0.85739418 -0.48416709  0.42862224  0.58639208  0.3324855
## 3  0.6684130 -0.03828924 -0.46398018 -0.28436983  0.45380957 -1.2178421
## 4 -1.1756985  2.12176975 -0.51472026 -2.92362976 -1.69047094 -1.5204709
## 5  0.7021467 -0.54032130 -0.04169175  1.03047668 -0.13624894  0.2250441
## 6  0.5897009 -0.38178486 -0.14535428  0.02234007  1.77385167 -0.8673884
##           PC4        PC5          PC6         PC7         PC8         PC9
## 1 -1.00224038  0.1578353  0.253834026 -0.38185183 -0.41383141 -0.01410602
## 2  1.15757715 -0.1741535 -0.084325021 -0.24817249  0.22037967  0.17279609
## 3  0.86551146 -0.1560055  0.400491017  0.08695208  0.18360988  0.08378519
## 4 -0.83710739  0.2723897  0.546352696  0.43951279  0.35493006 -0.09106518
## 5  0.84452276  0.1924282  0.206298298 -0.24125232  0.02361042  0.09398709
## 6  0.03685602 -0.9781148  0.003585307  0.15038009 -0.12557185  0.12570099
cor(CD_Pr[,1:9], CD_Pr[,10:11])
##                   PC1         PC2
## child_mort -0.8531479 -0.23985537
## exports     0.5773418 -0.76248213
## health      0.3067485  0.30228369
## imports     0.3283958 -0.83542357
## income      0.8102823 -0.02802342
## inflation  -0.3928425  0.01045115
## life_expec  0.8660003  0.27694068
## total_fer  -0.8210359 -0.19303574
## gdpp        0.7984948  0.05722985

These above are Pearson correlation coefficients, so, for example, the correlation between the variable ‘income’ and PC1 is large and positive which means that generally as the value for an observation increases along PC1 so does it the variable ‘income’. Alternatively, for variable ‘child_mort’ we have moderate negative correlation between them, so that, as PC1 increases, generally variable ‘child_mort’ decreases. And, for example, although we have negative correlation at ‘income’ and PC2 this is quite small and so that there is no significant relationship between them.

This is actually how the arrows are distributed in the biplot graph: when the correlations with PC2 are negative the arrows are going down some degree, for example, the largest negative is with variable ‘import’; when the correlations are positive, the arrows are pointing up. With PC1 it happens the same but with right for positive and left for negative correlations. When the correlations are not significant lines are almost horizontal for the case of PC2 and almost vertical for the case of PC1.

Since we have already the coordenates of PC1 and PC2 for each point of data let’s finally plot these using ggplot:

ggplot(CD_Pr, aes(PC1, PC2)) +
  geom_point(shape = 21, col = "black")

So now, each data point in our original data set is plotted based on its coordinates rewarded by PC1 and PC2. So we can see, that starting with nine original variables and just condensing them down into two linear combinations of those variables has been able to separate our data of countries. Next step after performing PCA is to apply a cluster analysis to our data projected onto the principal components to try automatically sort and categorize each data point into a cluster.

*** For all the following pre-clustering analysis the information about PCs obtained during the dimension reduction analysis will not be used, since we have to remember the fact that using only PC1 and PC2 it represent only the 63% of the variability of our data, and for this pre-analyis we need the 100%. Then PC1 and PC2 will be mentioned later on again during the projection of the data process when plotting the clusters.

3. Clustering

In addition, before proceeding with any clustering method, it is worth to assess the general clustering tendency of the data. For this purpose, Hopkins statistic and visual assessment are used.

get_clust_tendency(CDscaled, 2, graph=TRUE, gradient=list(low="blue",  high="white"), seed=1234)
## $hopkins_stat
## [1] 0.8325281
## 
## $plot

A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0. In the case of our dataset the value of the Hopkins statistic is 0.83 which claims that the dataset is significantly clusterable. Also the darker rectangular blocks shown in the graph are another proof of the clusterability of the dataset.

3.1 K-means and PAM clustering

Silhoutte statistic:

f1 <- fviz_nbclust(CDscaled, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(CDscaled, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

The silhouette statistic shows that for K-means the proper number of clusters is 5, but since too many clusters could lead to overlapping, we will also do K-means analysis for 4 clusters because the average silhouette width is quite similar. For PAM analysis the proper number of clusters according with the the silhouette statistic is 2, but diving the countries in only two groups would be too general information and would not really help to decide which countries need help. Therefore, as 3 clusters has almost the same average silhouette width than 2, it will finally be considered a PAM analysis of 3 clusters.

Total within-clusters sum of square:

f1 <- fviz_nbclust(CDscaled, FUNcluster = kmeans, method = "wss") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(CDscaled, FUNcluster = cluster::pam, method = "wss") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

With the within-clusters sum of squares statistics it is possible to check that the elbow point is around the number of clusters mentioned before: 4~5 for K-means and 3 for PAM.

3.1.1 K-means clustering

K-means for 5 clusters:

km5 <- eclust(CDscaled, k=5, FUNcluster="kmeans", hc_metric="euclidean", graph=F)

c5 <- fviz_cluster(km5, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 5 clusters")
s5 <- fviz_silhouette(km5)
##   cluster size ave.sil.width
## 1       1   22          0.46
## 2       2   40          0.19
## 3       3   45          0.20
## 4       4   52          0.22
## 5       5    8         -0.03
grid.arrange(c5, s5, ncol=2)

Not a very good result since we have overlappings between all the clusters and also some negatives values of the Silhouette width for all the clusters but 1.

K-means for 4 clusters:

km4 <- eclust(CDscaled, k=4 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

c4 <- fviz_cluster(km4, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 4 clusters")
s4 <- fviz_silhouette(km4)
##   cluster size ave.sil.width
## 1       1   30          0.30
## 2       2   87          0.33
## 3       3   47          0.25
## 4       4    3          0.39
grid.arrange(c4, s4, ncol=2)

table(km4$cluster)
## 
##  1  2  3  4 
## 30 87 47  3

This result looks much better since we got a higher average silhouette width with less negatives values, and only clusters 2 and 3 overlap a little.

3.1.2 Partition Around Medoids

PAM for 3 clusters

pam3 <- eclust(CDscaled, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)
##   cluster size ave.sil.width
## 1       1   51          0.25
## 2       2   85          0.31
## 3       3   31          0.26
grid.arrange(cp3, sp3, ncol=2)

The clustering with PAM and k=3 gives a better result than K-means for k=3, but lightly worst result than K-means for k=4 since PAM is influenced by the 3 points from cluster 4 in K-means wich are behaving as outliers. PAM with 4 clusters has been tested too but it did not improve the result, that is why it has not been included here.

3.1.3 Discussion

Before commenting the clustering, it is very important to highlight two numbers located in every clustering plot that has been created: 17.2% and 46% which are found in the axis for Dim2 and Dim1 respectively. It is very important to notice that the algorithms used to represent the clusters have obtained the same values of “Proportion of Variance” that we found during the dimension reduction analysis (PC1:0.4595 and PC2:0.1718) so we can clearly say that these two dimensions are the Principal Components 1 and 2 that we obtain before, and the data is projected on them to be understandable when plotting it, that is why the points are equally distributed than in our ggplot in the dimension reduction analysis.

Let’s come back the the clustering analysis:

K-means with 4 clusters is the best option according with the silhouette statistics so let’s see what information can we obtain from this clusters:

CD_cl <- as.data.frame(cbind(CD, km4$cluster))
colnames(CD_cl) <- c(colnames(CD),"cluster")
table(km4$cluster)
## 
##  1  2  3  4 
## 30 87 47  3

Cluster 1

summary(CD_cl[CD_cl$cluster==1,c("gdpp","income","child_mort")]) %>% kable()
gdpp income child_mort
Min. :22100 Min. : 27200 Min. : 2.600
1st Qu.:34025 1st Qu.: 34375 1st Qu.: 3.825
Median :41850 Median : 40550 Median : 4.200
Mean :43333 Mean : 45250 Mean : 4.953
3rd Qu.:48625 3rd Qu.: 45650 3rd Qu.: 5.100
Max. :87800 Max. :125000 Max. :10.800
cat(CD_cl[CD_cl$cluster==1,c("country")], sep = ", ")
## Australia, Austria, Belgium, Brunei, Canada, Cyprus, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Netherlands, New Zealand, Norway, Portugal, Qatar, Slovenia, South Korea, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==1,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 2

summary(CD_cl[CD_cl$cluster==2,c("gdpp","income","child_mort")]) %>% kable()
gdpp income child_mort
Min. : 592 Min. : 1780 Min. : 3.40
1st Qu.: 2970 1st Qu.: 6705 1st Qu.:11.00
Median : 5020 Median :10500 Median :18.10
Mean : 6919 Mean :12969 Mean :21.39
3rd Qu.:10500 3rd Qu.:17600 3rd Qu.:27.70
Max. :28000 Max. :45400 Max. :64.40
cat(CD_cl[CD_cl$cluster==2,c("country")], sep = ", ")
## Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Czech Republic, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Slovak Republic, Solomon Islands, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==2,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 3

summary(CD_cl[CD_cl$cluster==3,c("gdpp","income","child_mort")]) %>% kable()
gdpp income child_mort
Min. : 231 Min. : 609 Min. : 36.90
1st Qu.: 550 1st Qu.: 1390 1st Qu.: 63.80
Median : 897 Median : 1870 Median : 90.20
Mean : 1922 Mean : 3942 Mean : 92.96
3rd Qu.: 1470 3rd Qu.: 3675 3rd Qu.:111.00
Max. :17100 Max. :33700 Max. :208.00
cat(CD_cl[CD_cl$cluster==3,c("country")], sep = ", ")
## Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d'Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==3,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 4

summary(CD_cl[CD_cl$cluster==4,c("gdpp","income","child_mort")]) %>% kable()
gdpp income child_mort
Min. : 21100 Min. :28300 Min. :2.800
1st Qu.: 33850 1st Qu.:50200 1st Qu.:2.800
Median : 46600 Median :72100 Median :2.800
Mean : 57567 Mean :64033 Mean :4.133
3rd Qu.: 75800 3rd Qu.:81900 3rd Qu.:4.800
Max. :105000 Max. :91700 Max. :6.800
cat(CD_cl[CD_cl$cluster==4,c("country")], sep = ", ")
## Luxembourg, Malta, Singapore

Looking the variables gdpp, income and child_mort we can get and idea about wich group of countries are more in need of help. So, according with the stats summarised above, the cluster 3 is the one with lowest income, gpp and highest child mortality and therefore the countries that make it up will be considered the ones with extreme need of help. If we take a look closer to the list and the map, this cluster of 47 countries mostly consist of African countries, also some from South Asia (Pakistan, Laos…), Middle East (Yemen, Iraq), Caribbean (Haiti) and Oceania (Kiribati).

The countries from cluster 2 have got better statistics than cluster 3 so they are in a better situation. These countries are mostly from South America, North Africa, Asia, and East Europe.

The countries from cluster 1 are the most developed countries in the world with no need of help at all, i.e., North and West Europe, U.S., Canada, Australia, New Zeland, Japan…

The cluster 4 can be consider as the ones with best conditions of all, and it is good they have been included in a different cluster cause otherwise they would have been interfering in the statistics as outliers.

3.2. Hierarchical clustering

It is good to use extra methods for clustering our data to check if our results are consistent. That is why a Hierarchical clustering analysis has been done.

To decide which method to use for the Hierarchical clustering let’s calculate the agglomerative coefficient for each method. This coefficient measures the amount of clustering structure found, values closer to 1 suggest strong clustering structure.

# multiple methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# the coefficient
ac <- function(x) {
  agnes(CDscaled, method = x)$ac
}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.8752786 0.8430072 0.9130134 0.9519134

I will choose the Ward’s and Complete Linkage’s methods since they are the ones with closest values to 1.

Now to choose the proper number of clusters I will use the gap statistic which compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data.

gap_stat <- clusGap(CDscaled, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

It shows that the proper value of clusters will be 3.

3.3.1 Ward’s method

3 clusters

# dissimilarity matrix
d <- dist(CDscaled, method = "euclidean")
hcw <- hclust(d, method = "ward.D2" )

# cut tree into 3 groups
sub_grpW <- cutree(hcw, k = 3)

# number of members in each cluster
table(sub_grpW)
## sub_grpW
##   1   2   3 
##  27 106  34
# plots with borders
hcw$labels <- CD$country
plot(hcw, cex = 0.5, hang=-1)
rect.hclust(hcw, k = 3, border = 2:5)

# more visualization
fviz_cluster(list(data = CDscaled, cluster = sub_grpW))

This clustering does not look good since the three clusters overlap each other.

2.3.2 Complete linkage’s method

3 clusters

hcc <- hclust(d, method = "complete" )

# cut tree into 3 groups
sub_grpC <- cutree(hcc, k = 3)

# number of members in each cluster
table(sub_grpC)
## sub_grpC
##   1   2   3 
##  55 109   3
# plots with borders
hcc$labels <- CD$country
plot(hcc, cex = 0.5, hang=-1)
rect.hclust(hcc, k = 3, border = 2:5)

# more visualization - way like with kmeans or pam
fviz_cluster(list(data = CDscaled, cluster = sub_grpC))

This clustering looks more similar to the one obtained with the K-means analysis. And it was also tested for the case of k=4 but the result did not improved at all, the fourth cluster turned out to be one single point of data, the 114 representing Nigeria with a very high child mortality and located on the top of the inflation list.

2.3.3 Discussion

Comparing the silhouette statistics of both methods:

complete_statistics <- fpc::cluster.stats(dist(CDscaled), sub_grpW)
complete_statistics$avg.silwidth
## [1] 0.24563
complete_statistics <- fpc::cluster.stats(dist(CDscaled), sub_grpC)
complete_statistics$avg.silwidth
## [1] 0.2900525

The best method is the Complete Linkage with a higher value of the Average Silhouette Width. Let’s see now what the statistics of these 3 clusters say about the countries.

CD_cl <- as.data.frame(cbind(CD_cl, sub_grpC))
colnames(CD_cl) <- c(colnames(CD),"cluster","clusterC")
table(sub_grpC)
## sub_grpC
##   1   2   3 
##  55 109   3

Cluster 1:

summary(CD_cl[CD_cl$clusterC==1,c("gdpp","income","child_mort")]) %>% kable()
gdpp income child_mort
Min. : 231 Min. : 609 Min. : 17.40
1st Qu.: 569 1st Qu.: 1420 1st Qu.: 54.85
Median : 1000 Median : 2180 Median : 80.30
Mean : 1918 Mean : 3812 Mean : 82.23
3rd Qu.: 2535 3rd Qu.: 4230 3rd Qu.:110.00
Max. :17100 Max. :33700 Max. :208.00
cat(CD_cl[CD_cl$clusterC==1,c("country")], sep = ", ")
## Afghanistan, Angola, Benin, Bhutan, Botswana, Burkina Faso, Burundi, Cambodia, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d'Ivoire, Equatorial Guinea, Fiji, Gambia, Ghana, Guinea, Guinea-Bissau, Guyana, Haiti, Iraq, Kenya, Kiribati, Kyrgyz Republic, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Micronesia, Fed. Sts., Mozambique, Namibia, Niger, Nigeria, Rwanda, Samoa, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tajikistan, Tanzania, Timor-Leste, Togo, Tonga, Uganda, Vanuatu, Yemen, Zambia
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$clusterC==1,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 2:

summary(CD_cl[CD_cl$clusterC==2,c("gdpp","income","child_mort")]) %>% kable()
gdpp income child_mort
Min. : 482 Min. : 1420 Min. : 2.60
1st Qu.: 4440 1st Qu.: 9720 1st Qu.: 5.60
Median : 9070 Median : 16500 Median :11.70
Mean :17310 Mean : 22582 Mean :17.03
3rd Qu.:26900 3rd Qu.: 32300 3rd Qu.:20.30
Max. :87800 Max. :125000 Max. :92.10
cat(CD_cl[CD_cl$clusterC==2,c("country")], sep = ", ")
## Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Bolivia, Bosnia and Herzegovina, Brazil, Brunei, Bulgaria, Canada, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czech Republic, Denmark, Dominican Republic, Ecuador, Egypt, El Salvador, Eritrea, Estonia, Finland, France, Gabon, Georgia, Germany, Greece, Grenada, Guatemala, Hungary, Iceland, India, Indonesia, Iran, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kuwait, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Netherlands, New Zealand, Norway, Oman, Pakistan, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russia, Saudi Arabia, Serbia, Seychelles, Slovak Republic, Slovenia, South Korea, Spain, Sri Lanka, St. Vincent and the Grenadines, Suriname, Sweden, Switzerland, Thailand, Tunisia, Turkey, Turkmenistan, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Venezuela, Vietnam
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$clusterC==2,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)

hcmap(
  map = "custom/world-highres3", # high resolution world map
  data = dat, # name of dataset
  joinBy = "iso-a3",
  value = "cluster",
  showInLegend = FALSE, # hide legend
  nullColor = "#DADADA",
  download_map_data = TRUE
) %>%
  hc_mapNavigation(enabled = FALSE) %>%
  hc_legend("none") %>%
  hc_title(text = "World map") # title

Cluster 3:

summary(CD_cl[CD_cl$clusterC==3,c("gdpp","income","child_mort")]) %>% kable()
gdpp income child_mort
Min. : 21100 Min. :28300 Min. :2.800
1st Qu.: 33850 1st Qu.:50200 1st Qu.:2.800
Median : 46600 Median :72100 Median :2.800
Mean : 57567 Mean :64033 Mean :4.133
3rd Qu.: 75800 3rd Qu.:81900 3rd Qu.:4.800
Max. :105000 Max. :91700 Max. :6.800
cat(CD_cl[CD_cl$clusterC==3,c("country")], sep = ", ")
## Luxembourg, Malta, Singapore

Looking the variables gdpp, income and child_mort it is possible to see the clear relation between these new clusters and the clusters obtained during the K-means analysis:

Cluster 1 in the Complete Linkage is almost the same than the cluster 3 for the K-means, they have a difference of 8 countries but the statistics are quite similar, so this is the cluster with the countrieas which are extreme need of help.

Then clusters 1 and 2 from K-means are now represented together by the cluster 2 and, as it is possible to see in the map, it consists in developed and high developed countries.

The cluster 3 is exactly the same three countries than in the cluster 4 in K-means case.

4. Conclusion

It can be concluded that the classification made by the K-means analysis with 4 clusters was consistent and the list of countries is ready to be submitted the HELP International organization.

5. Sources

  1. Jacek Lewkowicz. Unsupervised Learning - Master in Data Science and Business Analytics - University of Warsaw. Lecture slides
  2. https://rpubs.com/sgroszkiewicz/clustering
  3. Map-countries plot algorithms: https://towardsdatascience.com/world-map-of-visited-countries-in-r-d46b9be7d52b
  4. Data - Kaggle: https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data?select=data-dictionary.csv