HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid.
The aim of this paper is to categorise the countries using some socio-economic and health factors that determine the overall development of the country, to finally suggest the countries which the CEO needs to focus on the most.
The dataset used in this project was found in a public Kaggle profile, and it describes 9 socio-economic and health factors from a list of 167 countries:
Preparing the data
#Libraries needed in the whole paper
library(knitr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(factoextra)
library(tidyverse)
library(cluster)
library(dendextend)
library(highcharter)
library(maps)
Loading the data, showing the header of the data and checking if there is any missing value in the dataset:
CD <- read.csv('/Users/albertodelgadolopez/Desktop/Final clustering/Country-data.csv', header=TRUE)
kable(head(CD))
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.90 | 75.8 | 2.37 | 10300 |
any(is.na(CD))
## [1] FALSE
There are no missing values so we do not have to modify anything related with that matter in our dataset. It is also possible to appreciate that some features tend to have larger values than others, so to avoid a problem in which some features come to dominate solely, let’s scale our data.
CDscaled <- as.data.frame(lapply(CD[,2:10], scale))
kable(summary(CDscaled))
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :-0.8845 | Min. :-1.4957 | Min. :-1.8223 | Min. :-1.9341 | Min. :-0.8577 | Min. :-1.1344 | Min. :-4.3242 | Min. :-1.1877 | Min. :-0.69471 | |
| 1st Qu.:-0.7444 | 1st Qu.:-0.6314 | 1st Qu.:-0.6901 | 1st Qu.:-0.6894 | 1st Qu.:-0.7153 | 1st Qu.:-0.5649 | 1st Qu.:-0.5910 | 1st Qu.:-0.7616 | 1st Qu.:-0.63475 | |
| Median :-0.4704 | Median :-0.2229 | Median :-0.1805 | Median :-0.1483 | Median :-0.3727 | Median :-0.2263 | Median : 0.2861 | Median :-0.3554 | Median :-0.45307 | |
| Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.00000 | |
| 3rd Qu.: 0.5909 | 3rd Qu.: 0.3736 | 3rd Qu.: 0.6496 | 3rd Qu.: 0.4899 | 3rd Qu.: 0.2934 | 3rd Qu.: 0.2808 | 3rd Qu.: 0.7021 | 3rd Qu.: 0.6157 | 3rd Qu.: 0.05924 | |
| Max. : 4.2086 | Max. : 5.7964 | Max. : 4.0353 | Max. : 5.2504 | Max. : 5.5947 | Max. : 9.1023 | Max. : 1.3768 | Max. : 3.0003 | Max. : 5.02140 |
Let’s have a look now at the relationship between the variables, below there is a fragment of the correlation matrix.
corrplot(cor(CDscaled, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="CDscaled", mar=c(0,0,1,0))
From the indexes of correlation we can obtain a first impression of the relationship between the variables and the possible relationship between the final clusters. As we can see, the most correlated pair of data are the income/gdpp followed by the child_mort/total_fer and the exports/imports. These results are, in deed, pretty logical and would be very good indicators for dividing the countries in clusters accordingly with their overall development.
Nevertheless, our intention is to take into cosideration all the variables included in our data set together, and classify our data accordingly, that is why the next step is so important, the dimension reduction.
Since our dataset is multidimensional with nine variables, it is highly important to do the next Principal Component Analysis. This is a dimension reduction technique to create linear combinations of the original variables that explain as much of the variance in our data as possible. So, this allow us going from 10, 15, 30 or even more variables down to just two or three which explain most of the variation in the data and make it easier to look for patterns.
We will build a PCA model that will hopefully reduce our 9 continuous variables into just two PCAs that capture most of the variability in our data. And perhaps once we have constructed our principal components, these will allow us to perform a proper cluster analysis.
myPr <- prcomp(CDscaled)
myPr
## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
## exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
## imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
## income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
## inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
## total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
## gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
## PC6 PC7 PC8 PC9
## child_mort -0.200628153 0.07948854 0.68274306 0.32754180
## exports 0.059332832 0.70730269 0.01419742 -0.12308207
## health -0.007276456 0.24983051 -0.07249683 0.11308797
## imports 0.030031537 -0.59218953 0.02894642 0.09903717
## income -0.160346990 -0.09556237 -0.35262369 0.61298247
## inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
## life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
## total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
As result our original scaled variables have been combined to form 9 linear combinations, 9 principal components. Principal component one is perhaps the most important which explains most of the variability in the data followed by the principal component 2 which explains the second highest amount. Within each principal component we have a value for each variable, and they form the so called Eigenvectors. These are the coefficients which our original variables are multiplied by to calculate the principal component score for any particular observation.
summary(myPr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion 0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
## PC8 PC9
## Standard deviation 0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion 0.99257 1.00000
For each of the principal components we have three values: a standard deviation, a proportion of variance and the cumulative proportion.
The standard deviation is the standard deviation of the data along a single principal components, so, a measure of the variability across that principal component. The proportion of variance is the proportion of all the variability in the original data, explained away by the principal component, i.e., in our case the 46% variance in the data is explained by PC1, 17% by PC2, 13% by PC3… and so on. The coumulative proportion is the same idea but cumulative, so, if you take PC1 you will be able to explain 46%, if you take PC1 and PC2 63%, PC1 PC2 and PC3 76%… and if we take all nine this explains 100% of the variability in our data.
Let’s represent this idea in a scree plot:
plot(myPr, type = "l")
This plot shows the variances across its principal components. Of course, the variance is the square of the standard deviation and so that you can see that when it has been taken into account the first and second principal components big part of the variability in our data has been accounted for.
To actually interpret our principal components we can use a function biplot:
biplot(myPr, scale = 0)
In this plot, each number is an observation from our data frame, and each observation is plotted with its values for PC1 and PC2. The red arrows represent the eigenvectors for each variable in our data frame. This plot helps us interpret the importance or the relationship between the different variables and the principal components.
Another statistically way of interpreting the relationship between each variable in our original dataset and the principal components is by looking at the correlations between our original variables and the PCs as a way of helping us understand how as one PC changes whether our variables go up or down and by how much.
Lets extract the PC1 and PC2 because they are the most important and attach them to our data frame.
CD_Pr <- cbind(CDscaled, myPr$x)
head(CD_Pr)
## child_mort exports health imports income inflation
## 1 1.2876597 -1.13486665 0.27825140 -0.08220771 -0.80582187 0.1568645
## 2 -0.5373329 -0.47822017 -0.09672528 0.07062429 -0.37424335 -0.3114109
## 3 -0.2720146 -0.09882442 -0.96317624 -0.63983800 -0.22018227 0.7869076
## 4 2.0017872 0.77305618 -1.44372888 -0.16481961 -0.58328920 1.3828944
## 5 -0.6935483 0.16018613 -0.28603389 0.49607554 0.10142673 -0.5999442
## 6 -0.5894047 -0.81019144 0.46756001 -1.27594958 0.08067776 1.2409928
## life_expec total_fer gdpp PC1 PC2 PC3
## 1 -1.6142372 1.89717646 -0.67714308 -2.90428986 -0.09533386 0.7159652
## 2 0.6459238 -0.85739418 -0.48416709 0.42862224 0.58639208 0.3324855
## 3 0.6684130 -0.03828924 -0.46398018 -0.28436983 0.45380957 -1.2178421
## 4 -1.1756985 2.12176975 -0.51472026 -2.92362976 -1.69047094 -1.5204709
## 5 0.7021467 -0.54032130 -0.04169175 1.03047668 -0.13624894 0.2250441
## 6 0.5897009 -0.38178486 -0.14535428 0.02234007 1.77385167 -0.8673884
## PC4 PC5 PC6 PC7 PC8 PC9
## 1 -1.00224038 0.1578353 0.253834026 -0.38185183 -0.41383141 -0.01410602
## 2 1.15757715 -0.1741535 -0.084325021 -0.24817249 0.22037967 0.17279609
## 3 0.86551146 -0.1560055 0.400491017 0.08695208 0.18360988 0.08378519
## 4 -0.83710739 0.2723897 0.546352696 0.43951279 0.35493006 -0.09106518
## 5 0.84452276 0.1924282 0.206298298 -0.24125232 0.02361042 0.09398709
## 6 0.03685602 -0.9781148 0.003585307 0.15038009 -0.12557185 0.12570099
cor(CD_Pr[,1:9], CD_Pr[,10:11])
## PC1 PC2
## child_mort -0.8531479 -0.23985537
## exports 0.5773418 -0.76248213
## health 0.3067485 0.30228369
## imports 0.3283958 -0.83542357
## income 0.8102823 -0.02802342
## inflation -0.3928425 0.01045115
## life_expec 0.8660003 0.27694068
## total_fer -0.8210359 -0.19303574
## gdpp 0.7984948 0.05722985
These above are Pearson correlation coefficients, so, for example, the correlation between the variable ‘income’ and PC1 is large and positive which means that generally as the value for an observation increases along PC1 so does it the variable ‘income’. Alternatively, for variable ‘child_mort’ we have moderate negative correlation between them, so that, as PC1 increases, generally variable ‘child_mort’ decreases. And, for example, although we have negative correlation at ‘income’ and PC2 this is quite small and so that there is no significant relationship between them.
This is actually how the arrows are distributed in the biplot graph: when the correlations with PC2 are negative the arrows are going down some degree, for example, the largest negative is with variable ‘import’; when the correlations are positive, the arrows are pointing up. With PC1 it happens the same but with right for positive and left for negative correlations. When the correlations are not significant lines are almost horizontal for the case of PC2 and almost vertical for the case of PC1.
Since we have already the coordenates of PC1 and PC2 for each point of data let’s finally plot these using ggplot:
ggplot(CD_Pr, aes(PC1, PC2)) +
geom_point(shape = 21, col = "black")
So now, each data point in our original data set is plotted based on its coordinates rewarded by PC1 and PC2. So we can see, that starting with nine original variables and just condensing them down into two linear combinations of those variables has been able to separate our data of countries. Next step after performing PCA is to apply a cluster analysis to our data projected onto the principal components to try automatically sort and categorize each data point into a cluster.
*** For all the following pre-clustering analysis the information about PCs obtained during the dimension reduction analysis will not be used, since we have to remember the fact that using only PC1 and PC2 it represent only the 63% of the variability of our data, and for this pre-analyis we need the 100%. Then PC1 and PC2 will be mentioned later on again during the projection of the data process when plotting the clusters.
In addition, before proceeding with any clustering method, it is worth to assess the general clustering tendency of the data. For this purpose, Hopkins statistic and visual assessment are used.
get_clust_tendency(CDscaled, 2, graph=TRUE, gradient=list(low="blue", high="white"), seed=1234)
## $hopkins_stat
## [1] 0.8325281
##
## $plot
A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0. In the case of our dataset the value of the Hopkins statistic is 0.83 which claims that the dataset is significantly clusterable. Also the darker rectangular blocks shown in the graph are another proof of the clusterability of the dataset.
Silhoutte statistic:
f1 <- fviz_nbclust(CDscaled, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(CDscaled, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
The silhouette statistic shows that for K-means the proper number of clusters is 5, but since too many clusters could lead to overlapping, we will also do K-means analysis for 4 clusters because the average silhouette width is quite similar. For PAM analysis the proper number of clusters according with the the silhouette statistic is 2, but diving the countries in only two groups would be too general information and would not really help to decide which countries need help. Therefore, as 3 clusters has almost the same average silhouette width than 2, it will finally be considered a PAM analysis of 3 clusters.
Total within-clusters sum of square:
f1 <- fviz_nbclust(CDscaled, FUNcluster = kmeans, method = "wss") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(CDscaled, FUNcluster = cluster::pam, method = "wss") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
With the within-clusters sum of squares statistics it is possible to check that the elbow point is around the number of clusters mentioned before: 4~5 for K-means and 3 for PAM.
K-means for 5 clusters:
km5 <- eclust(CDscaled, k=5, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c5 <- fviz_cluster(km5, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 5 clusters")
s5 <- fviz_silhouette(km5)
## cluster size ave.sil.width
## 1 1 22 0.46
## 2 2 40 0.19
## 3 3 45 0.20
## 4 4 52 0.22
## 5 5 8 -0.03
grid.arrange(c5, s5, ncol=2)
Not a very good result since we have overlappings between all the clusters and also some negatives values of the Silhouette width for all the clusters but 1.
K-means for 4 clusters:
km4 <- eclust(CDscaled, k=4 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c4 <- fviz_cluster(km4, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 4 clusters")
s4 <- fviz_silhouette(km4)
## cluster size ave.sil.width
## 1 1 30 0.30
## 2 2 87 0.33
## 3 3 47 0.25
## 4 4 3 0.39
grid.arrange(c4, s4, ncol=2)
table(km4$cluster)
##
## 1 2 3 4
## 30 87 47 3
This result looks much better since we got a higher average silhouette width with less negatives values, and only clusters 2 and 3 overlap a little.
PAM for 3 clusters
pam3 <- eclust(CDscaled, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=CDscaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 51 0.25
## 2 2 85 0.31
## 3 3 31 0.26
grid.arrange(cp3, sp3, ncol=2)
The clustering with PAM and k=3 gives a better result than K-means for k=3, but lightly worst result than K-means for k=4 since PAM is influenced by the 3 points from cluster 4 in K-means wich are behaving as outliers. PAM with 4 clusters has been tested too but it did not improve the result, that is why it has not been included here.
Before commenting the clustering, it is very important to highlight two numbers located in every clustering plot that has been created: 17.2% and 46% which are found in the axis for Dim2 and Dim1 respectively. It is very important to notice that the algorithms used to represent the clusters have obtained the same values of “Proportion of Variance” that we found during the dimension reduction analysis (PC1:0.4595 and PC2:0.1718) so we can clearly say that these two dimensions are the Principal Components 1 and 2 that we obtain before, and the data is projected on them to be understandable when plotting it, that is why the points are equally distributed than in our ggplot in the dimension reduction analysis.
Let’s come back the the clustering analysis:
K-means with 4 clusters is the best option according with the silhouette statistics so let’s see what information can we obtain from this clusters:
CD_cl <- as.data.frame(cbind(CD, km4$cluster))
colnames(CD_cl) <- c(colnames(CD),"cluster")
table(km4$cluster)
##
## 1 2 3 4
## 30 87 47 3
Cluster 1
summary(CD_cl[CD_cl$cluster==1,c("gdpp","income","child_mort")]) %>% kable()
| gdpp | income | child_mort | |
|---|---|---|---|
| Min. :22100 | Min. : 27200 | Min. : 2.600 | |
| 1st Qu.:34025 | 1st Qu.: 34375 | 1st Qu.: 3.825 | |
| Median :41850 | Median : 40550 | Median : 4.200 | |
| Mean :43333 | Mean : 45250 | Mean : 4.953 | |
| 3rd Qu.:48625 | 3rd Qu.: 45650 | 3rd Qu.: 5.100 | |
| Max. :87800 | Max. :125000 | Max. :10.800 |
cat(CD_cl[CD_cl$cluster==1,c("country")], sep = ", ")
## Australia, Austria, Belgium, Brunei, Canada, Cyprus, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Japan, Kuwait, Netherlands, New Zealand, Norway, Portugal, Qatar, Slovenia, South Korea, Spain, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==1,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)
hcmap(
map = "custom/world-highres3", # high resolution world map
data = dat, # name of dataset
joinBy = "iso-a3",
value = "cluster",
showInLegend = FALSE, # hide legend
nullColor = "#DADADA",
download_map_data = TRUE
) %>%
hc_mapNavigation(enabled = FALSE) %>%
hc_legend("none") %>%
hc_title(text = "World map") # title
Cluster 2
summary(CD_cl[CD_cl$cluster==2,c("gdpp","income","child_mort")]) %>% kable()
| gdpp | income | child_mort | |
|---|---|---|---|
| Min. : 592 | Min. : 1780 | Min. : 3.40 | |
| 1st Qu.: 2970 | 1st Qu.: 6705 | 1st Qu.:11.00 | |
| Median : 5020 | Median :10500 | Median :18.10 | |
| Mean : 6919 | Mean :12969 | Mean :21.39 | |
| 3rd Qu.:10500 | 3rd Qu.:17600 | 3rd Qu.:27.70 | |
| Max. :28000 | Max. :45400 | Max. :64.40 |
cat(CD_cl[CD_cl$cluster==2,c("country")], sep = ", ")
## Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belize, Bhutan, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Cambodia, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Czech Republic, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Fiji, Georgia, Grenada, Guatemala, Guyana, Hungary, India, Indonesia, Iran, Jamaica, Jordan, Kazakhstan, Kyrgyz Republic, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Micronesia, Fed. Sts., Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Oman, Panama, Paraguay, Peru, Philippines, Poland, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Slovak Republic, Solomon Islands, Sri Lanka, St. Vincent and the Grenadines, Suriname, Tajikistan, Thailand, Tonga, Tunisia, Turkey, Turkmenistan, Ukraine, Uruguay, Uzbekistan, Vanuatu, Venezuela, Vietnam
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==2,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)
hcmap(
map = "custom/world-highres3", # high resolution world map
data = dat, # name of dataset
joinBy = "iso-a3",
value = "cluster",
showInLegend = FALSE, # hide legend
nullColor = "#DADADA",
download_map_data = TRUE
) %>%
hc_mapNavigation(enabled = FALSE) %>%
hc_legend("none") %>%
hc_title(text = "World map") # title
Cluster 3
summary(CD_cl[CD_cl$cluster==3,c("gdpp","income","child_mort")]) %>% kable()
| gdpp | income | child_mort | |
|---|---|---|---|
| Min. : 231 | Min. : 609 | Min. : 36.90 | |
| 1st Qu.: 550 | 1st Qu.: 1390 | 1st Qu.: 63.80 | |
| Median : 897 | Median : 1870 | Median : 90.20 | |
| Mean : 1922 | Mean : 3942 | Mean : 92.96 | |
| 3rd Qu.: 1470 | 3rd Qu.: 3675 | 3rd Qu.:111.00 | |
| Max. :17100 | Max. :33700 | Max. :208.00 |
cat(CD_cl[CD_cl$cluster==3,c("country")], sep = ", ")
## Afghanistan, Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d'Ivoire, Equatorial Guinea, Eritrea, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Haiti, Iraq, Kenya, Kiribati, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, Pakistan, Rwanda, Senegal, Sierra Leone, South Africa, Sudan, Tanzania, Timor-Leste, Togo, Uganda, Yemen, Zambia
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$cluster==3,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)
hcmap(
map = "custom/world-highres3", # high resolution world map
data = dat, # name of dataset
joinBy = "iso-a3",
value = "cluster",
showInLegend = FALSE, # hide legend
nullColor = "#DADADA",
download_map_data = TRUE
) %>%
hc_mapNavigation(enabled = FALSE) %>%
hc_legend("none") %>%
hc_title(text = "World map") # title
Cluster 4
summary(CD_cl[CD_cl$cluster==4,c("gdpp","income","child_mort")]) %>% kable()
| gdpp | income | child_mort | |
|---|---|---|---|
| Min. : 21100 | Min. :28300 | Min. :2.800 | |
| 1st Qu.: 33850 | 1st Qu.:50200 | 1st Qu.:2.800 | |
| Median : 46600 | Median :72100 | Median :2.800 | |
| Mean : 57567 | Mean :64033 | Mean :4.133 | |
| 3rd Qu.: 75800 | 3rd Qu.:81900 | 3rd Qu.:4.800 | |
| Max. :105000 | Max. :91700 | Max. :6.800 |
cat(CD_cl[CD_cl$cluster==4,c("country")], sep = ", ")
## Luxembourg, Malta, Singapore
Looking the variables gdpp, income and child_mort we can get and idea about wich group of countries are more in need of help. So, according with the stats summarised above, the cluster 3 is the one with lowest income, gpp and highest child mortality and therefore the countries that make it up will be considered the ones with extreme need of help. If we take a look closer to the list and the map, this cluster of 47 countries mostly consist of African countries, also some from South Asia (Pakistan, Laos…), Middle East (Yemen, Iraq), Caribbean (Haiti) and Oceania (Kiribati).
The countries from cluster 2 have got better statistics than cluster 3 so they are in a better situation. These countries are mostly from South America, North Africa, Asia, and East Europe.
The countries from cluster 1 are the most developed countries in the world with no need of help at all, i.e., North and West Europe, U.S., Canada, Australia, New Zeland, Japan…
The cluster 4 can be consider as the ones with best conditions of all, and it is good they have been included in a different cluster cause otherwise they would have been interfering in the statistics as outliers.
It is good to use extra methods for clustering our data to check if our results are consistent. That is why a Hierarchical clustering analysis has been done.
To decide which method to use for the Hierarchical clustering let’s calculate the agglomerative coefficient for each method. This coefficient measures the amount of clustering structure found, values closer to 1 suggest strong clustering structure.
# multiple methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
# the coefficient
ac <- function(x) {
agnes(CDscaled, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.8752786 0.8430072 0.9130134 0.9519134
I will choose the Ward’s and Complete Linkage’s methods since they are the ones with closest values to 1.
Now to choose the proper number of clusters I will use the gap statistic which compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data.
gap_stat <- clusGap(CDscaled, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
It shows that the proper value of clusters will be 3.
3 clusters
# dissimilarity matrix
d <- dist(CDscaled, method = "euclidean")
hcw <- hclust(d, method = "ward.D2" )
# cut tree into 3 groups
sub_grpW <- cutree(hcw, k = 3)
# number of members in each cluster
table(sub_grpW)
## sub_grpW
## 1 2 3
## 27 106 34
# plots with borders
hcw$labels <- CD$country
plot(hcw, cex = 0.5, hang=-1)
rect.hclust(hcw, k = 3, border = 2:5)
# more visualization
fviz_cluster(list(data = CDscaled, cluster = sub_grpW))
This clustering does not look good since the three clusters overlap each other.
3 clusters
hcc <- hclust(d, method = "complete" )
# cut tree into 3 groups
sub_grpC <- cutree(hcc, k = 3)
# number of members in each cluster
table(sub_grpC)
## sub_grpC
## 1 2 3
## 55 109 3
# plots with borders
hcc$labels <- CD$country
plot(hcc, cex = 0.5, hang=-1)
rect.hclust(hcc, k = 3, border = 2:5)
# more visualization - way like with kmeans or pam
fviz_cluster(list(data = CDscaled, cluster = sub_grpC))
This clustering looks more similar to the one obtained with the K-means analysis. And it was also tested for the case of k=4 but the result did not improved at all, the fourth cluster turned out to be one single point of data, the 114 representing Nigeria with a very high child mortality and located on the top of the inflation list.
Comparing the silhouette statistics of both methods:
complete_statistics <- fpc::cluster.stats(dist(CDscaled), sub_grpW)
complete_statistics$avg.silwidth
## [1] 0.24563
complete_statistics <- fpc::cluster.stats(dist(CDscaled), sub_grpC)
complete_statistics$avg.silwidth
## [1] 0.2900525
The best method is the Complete Linkage with a higher value of the Average Silhouette Width. Let’s see now what the statistics of these 3 clusters say about the countries.
CD_cl <- as.data.frame(cbind(CD_cl, sub_grpC))
colnames(CD_cl) <- c(colnames(CD),"cluster","clusterC")
table(sub_grpC)
## sub_grpC
## 1 2 3
## 55 109 3
Cluster 1:
summary(CD_cl[CD_cl$clusterC==1,c("gdpp","income","child_mort")]) %>% kable()
| gdpp | income | child_mort | |
|---|---|---|---|
| Min. : 231 | Min. : 609 | Min. : 17.40 | |
| 1st Qu.: 569 | 1st Qu.: 1420 | 1st Qu.: 54.85 | |
| Median : 1000 | Median : 2180 | Median : 80.30 | |
| Mean : 1918 | Mean : 3812 | Mean : 82.23 | |
| 3rd Qu.: 2535 | 3rd Qu.: 4230 | 3rd Qu.:110.00 | |
| Max. :17100 | Max. :33700 | Max. :208.00 |
cat(CD_cl[CD_cl$clusterC==1,c("country")], sep = ", ")
## Afghanistan, Angola, Benin, Bhutan, Botswana, Burkina Faso, Burundi, Cambodia, Cameroon, Central African Republic, Chad, Comoros, Congo, Dem. Rep., Congo, Rep., Cote d'Ivoire, Equatorial Guinea, Fiji, Gambia, Ghana, Guinea, Guinea-Bissau, Guyana, Haiti, Iraq, Kenya, Kiribati, Kyrgyz Republic, Lao, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Micronesia, Fed. Sts., Mozambique, Namibia, Niger, Nigeria, Rwanda, Samoa, Senegal, Sierra Leone, Solomon Islands, South Africa, Sudan, Tajikistan, Tanzania, Timor-Leste, Togo, Tonga, Uganda, Vanuatu, Yemen, Zambia
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$clusterC==1,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)
hcmap(
map = "custom/world-highres3", # high resolution world map
data = dat, # name of dataset
joinBy = "iso-a3",
value = "cluster",
showInLegend = FALSE, # hide legend
nullColor = "#DADADA",
download_map_data = TRUE
) %>%
hc_mapNavigation(enabled = FALSE) %>%
hc_legend("none") %>%
hc_title(text = "World map") # title
Cluster 2:
summary(CD_cl[CD_cl$clusterC==2,c("gdpp","income","child_mort")]) %>% kable()
| gdpp | income | child_mort | |
|---|---|---|---|
| Min. : 482 | Min. : 1420 | Min. : 2.60 | |
| 1st Qu.: 4440 | 1st Qu.: 9720 | 1st Qu.: 5.60 | |
| Median : 9070 | Median : 16500 | Median :11.70 | |
| Mean :17310 | Mean : 22582 | Mean :17.03 | |
| 3rd Qu.:26900 | 3rd Qu.: 32300 | 3rd Qu.:20.30 | |
| Max. :87800 | Max. :125000 | Max. :92.10 |
cat(CD_cl[CD_cl$clusterC==2,c("country")], sep = ", ")
## Albania, Algeria, Antigua and Barbuda, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Bolivia, Bosnia and Herzegovina, Brazil, Brunei, Bulgaria, Canada, Cape Verde, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czech Republic, Denmark, Dominican Republic, Ecuador, Egypt, El Salvador, Eritrea, Estonia, Finland, France, Gabon, Georgia, Germany, Greece, Grenada, Guatemala, Hungary, Iceland, India, Indonesia, Iran, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kuwait, Latvia, Lebanon, Libya, Lithuania, Macedonia, FYR, Malaysia, Maldives, Mauritius, Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Netherlands, New Zealand, Norway, Oman, Pakistan, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russia, Saudi Arabia, Serbia, Seychelles, Slovak Republic, Slovenia, South Korea, Spain, Sri Lanka, St. Vincent and the Grenadines, Suriname, Sweden, Switzerland, Thailand, Tunisia, Turkey, Turkmenistan, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Venezuela, Vietnam
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
countries <- CD_cl[CD_cl$clusterC==2,c("country")]
dat$cluster <- ifelse(dat$`ISOname` %in% countries, 1, 0)
hcmap(
map = "custom/world-highres3", # high resolution world map
data = dat, # name of dataset
joinBy = "iso-a3",
value = "cluster",
showInLegend = FALSE, # hide legend
nullColor = "#DADADA",
download_map_data = TRUE
) %>%
hc_mapNavigation(enabled = FALSE) %>%
hc_legend("none") %>%
hc_title(text = "World map") # title
Cluster 3:
summary(CD_cl[CD_cl$clusterC==3,c("gdpp","income","child_mort")]) %>% kable()
| gdpp | income | child_mort | |
|---|---|---|---|
| Min. : 21100 | Min. :28300 | Min. :2.800 | |
| 1st Qu.: 33850 | 1st Qu.:50200 | 1st Qu.:2.800 | |
| Median : 46600 | Median :72100 | Median :2.800 | |
| Mean : 57567 | Mean :64033 | Mean :4.133 | |
| 3rd Qu.: 75800 | 3rd Qu.:81900 | 3rd Qu.:4.800 | |
| Max. :105000 | Max. :91700 | Max. :6.800 |
cat(CD_cl[CD_cl$clusterC==3,c("country")], sep = ", ")
## Luxembourg, Malta, Singapore
Looking the variables gdpp, income and child_mort it is possible to see the clear relation between these new clusters and the clusters obtained during the K-means analysis:
Cluster 1 in the Complete Linkage is almost the same than the cluster 3 for the K-means, they have a difference of 8 countries but the statistics are quite similar, so this is the cluster with the countrieas which are extreme need of help.
Then clusters 1 and 2 from K-means are now represented together by the cluster 2 and, as it is possible to see in the map, it consists in developed and high developed countries.
The cluster 3 is exactly the same three countries than in the cluster 4 in K-means case.
It can be concluded that the classification made by the K-means analysis with 4 clusters was consistent and the list of countries is ready to be submitted the HELP International organization.