Introduction

Background

The objective of this work is to categorize the countries using socio-economic and health factors that determine the overall development of the country.

HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

Dataset

The dataset is consist of list of 167 countries and information about socio-economic and health factors that determine the overall development of the country. The main dataset containing 167 rows and 10 columns.

Columns description:
1. country : Name of the country, there are 167 country listed
2. child_mort : (Child mortality) Death of children under 5 years of age per 1000 live births
3. exports : Exports of goods and services per capita. Given as %age of the GDP per capita
4. health : Total health spending per capita. Given as %age of GDP per capita
5. imports : Imports of goods and services per capita. Given as %age of the GDP per capita
6. Income : Net income per person
7. Inflation : The measurement of the annual growth rate of the Total GDP
8. life_expec : (Life expectancy) The average number of years a new born child would live if the current mortality patterns are to remain the same
9. total_fer : (Total fertility) The number of children that would be born to each woman if the current age-fertility rates remain the same
10. gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.

The dataset can be obtained from Kaggle

Importing the necessary tools

library(tidyverse)
library(gridExtra)
library(factoextra)
library(FactoMineR)

Importing the Dataset

data <- read.csv("Country-data.csv")
data

Let’s see the data types

glimpse(data)
## Rows: 167
## Columns: 10
## $ country    <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua and~
## $ child_mort <dbl> 90.2, 16.6, 27.3, 119.0, 10.3, 14.5, 18.1, 4.8, 4.3, 39.2, ~
## $ exports    <dbl> 10.0, 28.0, 38.4, 62.3, 45.5, 18.9, 20.8, 19.8, 51.3, 54.3,~
## $ health     <dbl> 7.58, 6.55, 4.17, 2.85, 6.03, 8.10, 4.40, 8.73, 11.00, 5.88~
## $ imports    <dbl> 44.9, 48.6, 31.4, 42.9, 58.9, 16.0, 45.3, 20.9, 47.8, 20.7,~
## $ income     <int> 1610, 9930, 12900, 5900, 19100, 18700, 6700, 41400, 43200, ~
## $ inflation  <dbl> 9.440, 4.490, 16.100, 22.400, 1.440, 20.900, 7.770, 1.160, ~
## $ life_expec <dbl> 56.2, 76.3, 76.5, 60.1, 76.8, 75.8, 73.3, 82.0, 80.5, 69.1,~
## $ total_fer  <dbl> 5.82, 1.65, 2.89, 6.16, 2.13, 2.37, 1.69, 1.93, 1.44, 1.92,~
## $ gdpp       <int> 553, 4090, 4460, 3530, 12200, 10300, 3220, 51900, 46900, 58~

The data types has already appropriate for each columns.

Exploratory Data Analysis

Data Cleaning

First, we check for any missing values or any values that indicates a missing value.

colSums(is.na(data))
##    country child_mort    exports     health    imports     income  inflation 
##          0          0          0          0          0          0          0 
## life_expec  total_fer       gdpp 
##          0          0          0

It seems that there are no missing values. now, let see the summary of the dataset.

Data Scaling

summary(data)
##    country            child_mort        exports            health      
##  Length:167         Min.   :  2.60   Min.   :  0.109   Min.   : 1.810  
##  Class :character   1st Qu.:  8.25   1st Qu.: 23.800   1st Qu.: 4.920  
##  Mode  :character   Median : 19.30   Median : 35.000   Median : 6.320  
##                     Mean   : 38.27   Mean   : 41.109   Mean   : 6.816  
##                     3rd Qu.: 62.10   3rd Qu.: 51.350   3rd Qu.: 8.600  
##                     Max.   :208.00   Max.   :200.000   Max.   :17.900  
##     imports             income         inflation         life_expec   
##  Min.   :  0.0659   Min.   :   609   Min.   : -4.210   Min.   :32.10  
##  1st Qu.: 30.2000   1st Qu.:  3355   1st Qu.:  1.810   1st Qu.:65.30  
##  Median : 43.3000   Median :  9960   Median :  5.390   Median :73.10  
##  Mean   : 46.8902   Mean   : 17145   Mean   :  7.782   Mean   :70.56  
##  3rd Qu.: 58.7500   3rd Qu.: 22800   3rd Qu.: 10.750   3rd Qu.:76.80  
##  Max.   :174.0000   Max.   :125000   Max.   :104.000   Max.   :82.80  
##    total_fer          gdpp       
##  Min.   :1.150   Min.   :   231  
##  1st Qu.:1.795   1st Qu.:  1330  
##  Median :2.410   Median :  4660  
##  Mean   :2.948   Mean   : 12964  
##  3rd Qu.:3.880   3rd Qu.: 14050  
##  Max.   :7.490   Max.   :105000

Insights:
1. There are 167 unique countries
2. Most of the variable has a very wide range. it means that there are some country that has been developed well while there are others that don’t.
3. Some variable has different scaling compared to other variable such as income. for dimensionality reduction using PCA later we must scaling the data first. The standardization scaling is important because the higher the variable range, the higher the variance or covariance value which might cause bias on our model.

As we now, that our data contain variable with different range of values. we will scaling the data to make it perform well in our modelization and to prevent any bias that might be happen.

data.scaled <- data %>%
 mutate_at(c(2:10), funs(c(scale(.))))

summary(data.scaled)
##    country            child_mort         exports            health       
##  Length:167         Min.   :-0.8845   Min.   :-1.4957   Min.   :-1.8223  
##  Class :character   1st Qu.:-0.7444   1st Qu.:-0.6314   1st Qu.:-0.6901  
##  Mode  :character   Median :-0.4704   Median :-0.2229   Median :-0.1805  
##                     Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##                     3rd Qu.: 0.5909   3rd Qu.: 0.3736   3rd Qu.: 0.6496  
##                     Max.   : 4.2086   Max.   : 5.7964   Max.   : 4.0353  
##     imports            income          inflation         life_expec     
##  Min.   :-1.9341   Min.   :-0.8577   Min.   :-1.1344   Min.   :-4.3242  
##  1st Qu.:-0.6894   1st Qu.:-0.7153   1st Qu.:-0.5649   1st Qu.:-0.5910  
##  Median :-0.1483   Median :-0.3727   Median :-0.2263   Median : 0.2861  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4899   3rd Qu.: 0.2934   3rd Qu.: 0.2808   3rd Qu.: 0.7021  
##  Max.   : 5.2504   Max.   : 5.5947   Max.   : 9.1023   Max.   : 1.3768  
##    total_fer            gdpp         
##  Min.   :-1.1877   Min.   :-0.69471  
##  1st Qu.:-0.7616   1st Qu.:-0.63475  
##  Median :-0.3554   Median :-0.45307  
##  Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.6157   3rd Qu.: 0.05924  
##  Max.   : 3.0003   Max.   : 5.02140

Now, all variables has the same range scale of values. and as we can see that it seems that there are some outlier countries in term of some of the variables such as life_expec, inflation, and gdpp. We will further assess whether to remove or keep the outliers.

Next, we will change the row name using country variable

rownames(data.scaled) <- data.scaled[,"country"]
data.scaled <- data.scaled %>% select(-country)

Data Preprocessing

Dimensionality Reduction with Principal Component Analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

PCA Measurement

PCA measurement can be performed by using function prcomp from package stats

pca.data <- prcomp(data.scaled)
pca.data
## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
## 
## Rotation (n x k) = (9 x 9):
##                   PC1          PC2         PC3          PC4         PC5
## child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
## exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
## imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
## income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
## inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
## total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
## gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
##                     PC6         PC7         PC8         PC9
## child_mort -0.200628153  0.07948854  0.68274306  0.32754180
## exports     0.059332832  0.70730269  0.01419742 -0.12308207
## health     -0.007276456  0.24983051 -0.07249683  0.11308797
## imports     0.030031537 -0.59218953  0.02894642  0.09903717
## income     -0.160346990 -0.09556237 -0.35262369  0.61298247
## inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
## life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
## total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

Optimal Number of PCA

fviz_eig(pca.data, addlabels = T, main = "Variance explained by each dimensions")

summary(pca.data)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
##                            PC8     PC9
## Standard deviation     0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion  0.99257 1.00000

If we tolerate no more than 20% of information loss or to retain more than 80% of information. then we can use 5 principal component (PC). which are dim 1 to dim 5. In this case we retain about 87% of the information.

country.pca.selected <- data.frame(pca.data$x[,1:4])
head(country.pca.selected)

PCA Implementation

Once the number of PC dim is selected, it can be combined with the initial data and used for further analysis.

country.pca <- data %>% 
  select_if(negate(is.numeric)) %>% 
  cbind(country.pca.selected) %>% 
  select(-country)

country.pca %>% head()

K-Means Clustering

Clustering with PCA

Optimal Number of Clusters

Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.

Elbow Method

Choosing the number of clusters using elbow method is a little bit subjective. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

fviz_nbclust(country.pca, kmeans, method = "wss") + labs(subtitle = "Elbow method with PCA")

Arguably, the optimum number for k is 3. but we need more method to determine the best k for our model. In my opinion, 3 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

Silhouette Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(country.pca, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method with PCA")

Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2.

Gap Method

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

fviz_nbclust(country.pca, kmeans, method = "gap_stat") + labs(subtitle = "Gap Statistic with PCA method")

As we can see that the gap method confirm the previous elbow method that 3 is the most optimum number of cluster. So, for this model we will cluster the data into 3 cluster. for the case of country development level, it could be said the 3 level might be (under developed, developing, and developed countries).

K-means clustering

RNGkind(sample.kind = "Rounding")
set.seed(100)

km.pca <- kmeans(country.pca, centers = 3) 
# add the cluster as new column
country.pca$cluster <- km.pca$cluster

visualizing the cluster

fviz_cluster(km.pca, data = country.pca) +
   labs(subtitle = "K-Means With PCA & K-Value = 3")

outliers <- c("Singapore","Malta","Luxembourg")
data %>% filter(country %in% outliers)

As we can see that Malta, Luxembourg, and Singapore is pretty much an outlier that is very far from other countries. we can exclude these outiers and can consider them as developed countries based on the criteria that is better than the average countries.

# remove outliers from data
country.pca2 <- data %>% 
  select_if(negate(is.numeric)) %>% 
  cbind(country.pca.selected) %>% 
  filter(!country %in% outliers) %>% 
  select(-country)

# Clustering
RNGkind(sample.kind = "Rounding")
set.seed(100)

km.pca2 <- kmeans(country.pca2, centers = 3) 

# add the cluster as new column
country.pca2$cluster <- km.pca2$cluster

# Visualize the new cluster
fviz_cluster(km.pca2, data = country.pca2) +
   labs(subtitle = "K-Means With PCA & K-Value = 3")

Now, the seperation of each cluster is more distinct than before. especially between developing and developed countries. while the under developed coutries has a higher gap than the rest of cluster.

Country Cluster Profiling

The purpose of profiling is to to understand the characteristics of each cluster, in this case to understand which country cluster is the direst need of aid.

To find out characteristic from each cluster, the value from each columns can be averaged.

#Assign into new object
country.clustered <- data %>% 
  filter(!country %in% outliers) %>% 
  select(-country)

#Assign cluster result into the new object
country.clustered$cluster <- km.pca2$cluster

criteria.mean <- country.clustered %>%
  group_by(cluster) %>% 
  summarise_all(mean)

criteria.mean

Profiling from the insights:

  • Cluster 1 (under developed countries):
    • From an economic point of view, cluster 1 country population do not have a good economic level that can be seen from minus average values in columns income, gdpp and positive average value in column inflation. Furthermore, cluster 1 country have an unfavorable on the industrial side since columns export and import has minus average value.
    • From health point of view, this is very sad because cluster 1 country have high positive average value on column child_mort and minus on columns health and life_expe.
  • Cluster 2 (developed countries):
    • From an economic point of view, cluster 2 country population have a good economic level that can be seen from positive average values in columns income, gdpp and minus average value in column inflation. Other than that, cluster 2 can be said as developed country since columns export and import has high positive average value.
    • From health point of view, cluster 2 country have high negative average on column child_mort but minus in column health.
  • Cluster 3 (developing countries):
    • Both in economic and health condition. cluster 3 in average is in between cluster 1 and 2. so we can conclude that cluster 3 countries are developing countries.

Conclusion and Insights

How many K-Value or how many groups desired as the final result must be determined in advance, since model K-Means required that information to grouping the data based on similar characteristics. There are several way to find out the optimum K-Value such as Elbow Method, Silhouette Method and Gap Statistic Method. It is important to implement several methods, since from the case above the K-Value result from one method to another might be different. Even though there are difference in the results, the final suggestion for K-Value = 3 since two out three methods produced the same results.

After knowing how many groups are ideal, the characteristics of each country in each cluster can be determined by averaging the values in each column, this method can be call Cluster Profiling. From the results of cluster profiling, countries in cluster 1 are the countries most in need of aid when compared to countries included in cluster 2 and cluster 3.

In order to help country that has the most needs we must consider to seperate between economic and health criterias. So, the aid will be more effective.

Economic Sector

# get all mean values for cluster 1
mean.clus.1 <- criteria.mean[criteria.mean$cluster==1,]

#Parameter filter
economic.help <- data %>% 
  filter(exports < mean.clus.1 %>% pull(exports),
         imports < mean.clus.1 %>% pull(imports),
         income < mean.clus.1 %>% pull(income),
         gdpp < mean.clus.1 %>% pull(gdpp)
         ) %>% 
  select(country,income,exports,imports, gdpp)

list.metrics <- economic.help %>% select(-country) %>% colnames()

# see the worst country on each economic category
for (col in list.metrics) {
  help.country <- economic.help %>% 
    arrange(!!as.symbol(col)) %>% 
    select(country, !!as.symbol(col)) %>% 
    head(1)
  print(help.country)
}
##   country income
## 1 Burundi    764
##   country exports
## 1 Myanmar   0.109
##   country imports
## 1 Myanmar  0.0659
##   country gdpp
## 1 Burundi  231

From the economic sector, there are two countries that need help the most are Myanmar and Burundi. the different needs between those two is myanmar needs more in its productivity (import & export) while Burundi in its income and GDP.

Health Sector

# get all mean values for cluster 1
mean.clus.1 <- criteria.mean[criteria.mean$cluster==1,]

#Parameter filter
health.help <- data %>% 
  filter(health < mean.clus.1 %>% pull(health),
         life_expec < mean.clus.1 %>% pull(life_expec)
         ) %>% 
  select(country, health, life_expec)

list.metrics <- health.help %>% select(-country) %>% colnames()

# see the worst country on each economic category
for (col in list.metrics) {
  help.country <- health.help %>% 
    arrange(!!as.symbol(col)) %>% 
    select(country, !!as.symbol(col)) %>% 
    head(1)
  print(help.country)
}
##                    country health
## 1 Central African Republic   3.98
##                    country life_expec
## 1 Central African Republic       47.5

In health sector, country that needs help the most is Central African Republic