Clustering the Countries by using Unsupervised Learning for HELP International Organization

Margareth Devina

01 April 2021

Introduction and Objective

This RMarkdown is created to increase our knowledge and experience as we learn by building partitions or groups using K-means clustering algorithm on socio-economic and health data of multiple countries.

We will use data from Kaggle: https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data?select=Country-data.csv. Here we already provided with a dataset consists of socio-economic and health factors from various countries that can be used to determine the overall development of those countries. And we will also use the case or objective foretold inside the kaggle dataset description.

The reason we want to categorize the overall development of those countries is to identify which country that in needs for help from HELP International organization.

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

Currently, based on the dataset description, HELP International has been able to raise around $10 million funds and the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, we will help the CEO to categorize the countries using the socio-economic and health factors that determine the overall development of the country.

Library Used

library(tidyverse)
library(FactoMineR)
library(factoextra)

Read Data and Exploratory Data Analysis

We will read the dataset first then take a look on each columns’ data type.

country <- read.csv("data/HELP international/Country-data.csv")
glimpse(country)

## Rows: 167
## Columns: 10
## $ country    <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua ...
## $ child_mort <dbl> 90.2, 16.6, 27.3, 119.0, 10.3, 14.5, 18.1, 4.8, 4.3, 39....
## $ exports    <dbl> 10.0, 28.0, 38.4, 62.3, 45.5, 18.9, 20.8, 19.8, 51.3, 54...
## $ health     <dbl> 7.58, 6.55, 4.17, 2.85, 6.03, 8.10, 4.40, 8.73, 11.00, 5...
## $ imports    <dbl> 44.9, 48.6, 31.4, 42.9, 58.9, 16.0, 45.3, 20.9, 47.8, 20...
## $ income     <int> 1610, 9930, 12900, 5900, 19100, 18700, 6700, 41400, 4320...
## $ inflation  <dbl> 9.440, 4.490, 16.100, 22.400, 1.440, 20.900, 7.770, 1.16...
## $ life_expec <dbl> 56.2, 76.3, 76.5, 60.1, 76.8, 75.8, 73.3, 82.0, 80.5, 69...
## $ total_fer  <dbl> 5.82, 1.65, 2.89, 6.16, 2.13, 2.37, 1.69, 1.93, 1.44, 1....
## $ gdpp       <int> 553, 4090, 4460, 3530, 12200, 10300, 3220, 51900, 46900,...

Below are the description for each column in the dataset:

country : Name of the country
child_mort : Death of children under 5 years of age per 1000 live births.
exports : Exports of goods and services per capita. Given as %age of the GDP per capita.
health : Total health spending per capita. Given as %age of GDP per capita.
imports : Imports of goods and services per capita. Given as %age of the GDP per capita.
income : Net income per person.
inflation : The measurement of the annual growth rate of the Total GDP.
life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same.
total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.

Now, let’s also take a look on the first 6 rows of the dataset.

head(country)

##               country child_mort exports health imports income inflation
## 1         Afghanistan       90.2    10.0   7.58    44.9   1610      9.44
## 2             Albania       16.6    28.0   6.55    48.6   9930      4.49
## 3             Algeria       27.3    38.4   4.17    31.4  12900     16.10
## 4              Angola      119.0    62.3   2.85    42.9   5900     22.40
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44
## 6           Argentina       14.5    18.9   8.10    16.0  18700     20.90
##   life_expec total_fer  gdpp
## 1       56.2      5.82   553
## 2       76.3      1.65  4090
## 3       76.5      2.89  4460
## 4       60.1      6.16  3530
## 5       76.8      2.13 12200
## 6       75.8      2.37 10300

We will also check the dimension and the uniqueness of the country column.

dim(country)

## [1] 167  10

length(unique(country$country))

## [1] 167

Then, we check whether currently the dataset have a NA values or not.

colSums(is.na(country))

##    country child_mort    exports     health    imports     income  inflation 
##          0          0          0          0          0          0          0 
## life_expec  total_fer       gdpp 
##          0          0          0

Data Pre-processing

Before, when we glimpse and see the first 6 rows of the dataset, we note that the values inside the dataset are various in the minimum and maximum amount or in various intervals. Thus, we need to scale the values first.

The reason we need to scale the dataset first because K-means clustering is calculating the distance between the data (euclidean distance) therefore the range between the data must be the same. The scaling process is carried out using the z-score method and the scaling process only changes the scale of the data without changing the distribution of the initial data.

We will change the country column into an index first because we only need to scale the numeric values.

country <- country %>% 
   column_to_rownames("country")

country_z <- scale(country)

Then, to identify the number of the cluster (k) that we should use, we can apply elbow method technique. Elbow method will optimize the distance between cluster’s centroid (cluster’s center) with the data, this is usually called as within sum of square (wss). The optimum k value can be seen when the cluster amount is accumulated while the wss is not decreasing significantly.

fviz_nbclust(x = country_z, method = "wss", kmeans)

As seen on the above elbow plot, when the amount of cluster added from 3 into 4, the total withinss is not decreasing significantly thus, we can categorize the data into 3 clusters.

Then, we can create a new object, country_cluster, to stored the K-means clustering result and visualize it to see the distribution of each country inside those 3 clusters.

RNGkind(sample.kind = "Rounding")
set.seed(77)
country_cluster <- kmeans(x = country_z, centers = 3)

fviz_cluster(object = country_cluster, data = country_z)

Based on the cluster plot above and the table below, we can see cluster 2 and 3 are a fairly dense clusters. Further, cluster 1 indicates a fairly wide data distribution especially for Malta, Luxembourg, and Singapore countries. Overall, the total withinss for those 3 clusters are quite similar.

data.frame(cluster = c(1:3),
           wss = country_cluster$withinss) %>% 
  arrange(wss)

##   cluster      wss
## 1       2 259.5575
## 2       3 269.6604
## 3       1 297.2279

If we check the between_ss and total_ss ratio (BSS/totss), which indicate the distance between each centroid to the overall data center, we get a percentage amounting to 44.7%. Thus, based on the ratio itself, the amount of cluster (k) we used is not that good since a good cluster should have BSS/totss ratio close to 100%.

Nevertheless, if we see the categorization or the clustering from business knowledge perspective, usually for analyzing overall development of world countries, the data separated into 2 or 3 categories such as developed countries - developing countries or higher development countries - middle development countries - lower development countries (can be referred to IMF publication page 31 [https://www.imf.org/en/search#q=wp1131&sort=relevancy]). Thus, even though BSS/totss ratio is quite low, we still continue the clustering using k=3.

(country_cluster$betweens/country_cluster$totss)*100

## [1] 44.68234

Principal Component Analysis (PCA)

Through PCA, we can retain lots of information from our data even though the dimension is reduced and we can also get a better understanding of the dataset pattern.

country_pca <- country_z %>% prcomp()
summary(country_pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
##                            PC8     PC9
## Standard deviation     0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion  0.99257 1.00000

In this study, I want to retain at minimum 90% information from the dataset and to get at least 90% accumulation, we can pick PC1 until PC5. Thus, selecting those PCs will reduce ~ 44% of total dimension from the original dataset while still retaining 95% information from the dataset.

From the PCs selection, we can extract PC1-PC5 values then convert it into dataframe, afterwards we can combine it with the previous dataset while deselecting the original variables / columns since those will be replaced by the PCs. This combined dataframe can be used for further analysis using supervised learning classification technique or other purposes.

country_pca_keep <- country_pca$x[,1:5] %>% 
  as.data.frame()

country_pca_new <- data.frame(country_z) %>%
  bind_cols(country_pca_keep)

country_pca_new <- country_pca_new[,c(10:14)]

head(country_pca_new)

##                             PC1         PC2        PC3         PC4        PC5
## Afghanistan         -2.90428986 -0.09533386  0.7159652 -1.00224038  0.1578353
## Albania              0.42862224  0.58639208  0.3324855  1.15757715 -0.1741535
## Algeria             -0.28436983  0.45380957 -1.2178421  0.86551146 -0.1560055
## Angola              -2.92362976 -1.69047094 -1.5204709 -0.83710739  0.2723897
## Antigua and Barbuda  1.03047668 -0.13624894  0.2250441  0.84452276  0.1924282
## Argentina            0.02234007  1.77385167 -0.8673884  0.03685602 -0.9781148

Combining Clustering and PCA

To understand the dataset pattern and each cluster characteristic, we can use biplot to visualize PC1 and PC2 resulted from the PCA creation above since the characteristics inside each clustering can be traced visually via loading score arrow.

biplot(country_pca)

Variable Factor Map

Based on the biplot above, we can get these information:

Cluster 1 countries, especially Malta, Luxembourg, and Singapore, are significantly influenced by imports and exports variables.
CLuster 1 countries also heavily influenced by life_expec, gdpp, income variables.
Cluster 3 countries are heavily influenced by total_fer and child_mort variables.

Since, the biplot figure can only pin point several variables which influenced or being a visible characteristics of a cluster, we can trace each cluster’s characteristic in more detail by plotting them into a bar chart.

country_new <- data.frame(country_z) %>% 
   rownames_to_column("country") %>% 
   mutate(cluster = as.factor(country_cluster$cluster))

profil <- country_new %>%
   select(-country) %>% 
   group_by(cluster) %>% 
   summarise_all(mean)

profil %>% 
   pivot_longer(cols = -cluster, names_to = "type", values_to = "value") %>% 
   ggplot(aes(x = cluster, y = value)) +
   geom_col(aes(fill = cluster)) +
   facet_wrap(~type)

By plotting each variables one by one via bar chart, we can see which characteristic is more dominant for each cluster. From the bar chart above, we can get these insights:

As seen on the biplot previously, the bar chart also shows that cluster 1 are dominated by high imports, exports, gdpp, income, and life_expec variables. Moreover, this cluster also has a high health variable. With those variables in a high and dominant figures, it is clear that the countries in this cluster are most definitely can tackle their socio-economic and health problems. Further, we can say that their residents will also survive should they impacted with a natural disaster or calamity.

Their positive vibes also supported by how low their child_mort, inflation, and total_fer variables. These factors also indicate that the survivability of those countries’ residents are very good.
On the other hand, cluster 3’s characteristics are the exact opposite of cluster 1 and it’s very dire especially on child_mort, health, income, inflation, life_expec, and total_fer variables. Thus, the residents in cluster 3 are most probably struggling to fulfill their daily basic needs and it will be worse should their countries impacted with a natural disaster or calamity.
Between cluster 1 and 3, there is cluster 2. Countries categorized in cluster 2 have a combination of both cluster 1 and 3 characteristics except for the health variable. Countries in cluster 2 has a better finance management thus they can spend more funds to their health departments.

Conclusion

Based on all insights we can perceived above, HELP International should focus their funds allocation to countries in cluster 3 because countries inside cluster 3 are extremely in needs for help.

The following countries are included in cluster 3:

country_new %>%
  filter(cluster == 3) %>% 
  select(country)

##                     country
## 1               Afghanistan
## 2                    Angola
## 3                     Benin
## 4                  Botswana
## 5              Burkina Faso
## 6                   Burundi
## 7                  Cameroon
## 8  Central African Republic
## 9                      Chad
## 10                  Comoros
## 11         Congo, Dem. Rep.
## 12              Congo, Rep.
## 13            Cote d'Ivoire
## 14        Equatorial Guinea
## 15                  Eritrea
## 16                    Gabon
## 17                   Gambia
## 18                    Ghana
## 19                   Guinea
## 20            Guinea-Bissau
## 21                    Haiti
## 22                     Iraq
## 23                    Kenya
## 24                 Kiribati
## 25                      Lao
## 26                  Lesotho
## 27                  Liberia
## 28               Madagascar
## 29                   Malawi
## 30                     Mali
## 31               Mauritania
## 32               Mozambique
## 33                  Namibia
## 34                    Niger
## 35                  Nigeria
## 36                 Pakistan
## 37                   Rwanda
## 38                  Senegal
## 39             Sierra Leone
## 40             South Africa
## 41                    Sudan
## 42                 Tanzania
## 43              Timor-Leste
## 44                     Togo
## 45                   Uganda
## 46                    Yemen
## 47                   Zambia