Clustering Countries With K-Means For HELP International

Sunarto Rusli

3/11/2022

HELP International

knitr::include_graphics("HELP-int.jfif")

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively.

Dataset

With this dataset, we will be using K-Means Clustering to cluster the countries, to categorise the countries using socio-economic and health factors that determine the overall development of the country. And then, with PCA, we will decide which countries need help the most.

country <- read.csv("Country-data.csv")

rownames(country) <- country$country

country_clean <- country[,-1]

str(country_clean)

## 'data.frame':    167 obs. of  9 variables:
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

About the columns :

country: Name of the country, we will use this column as rownames
child_mort: Death of children under 5 years of age per 1000 live births
exports: Exports of goods and services per capita. Given as %age of the GDP per capita
health: Total health spending per capita. Given as %age of GDP per capita
imports: Imports of goods and services per capita. Given as %age of the GDP per capita
income: Net income per person
inflation: The measurement of the annual growth rate of the Total GDP
life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

Exploratory Data Analysis

There are no missing value in the dataset.

colSums(is.na(country_clean))

## child_mort    exports     health    imports     income  inflation life_expec 
##          0          0          0          0          0          0          0 
##  total_fer       gdpp 
##          0          0

The columns are in different scales. We will scale it while doing clustering and PCA.

summary(country_clean)

##    child_mort        exports            health          imports        
##  Min.   :  2.60   Min.   :  0.109   Min.   : 1.810   Min.   :  0.0659  
##  1st Qu.:  8.25   1st Qu.: 23.800   1st Qu.: 4.920   1st Qu.: 30.2000  
##  Median : 19.30   Median : 35.000   Median : 6.320   Median : 43.3000  
##  Mean   : 38.27   Mean   : 41.109   Mean   : 6.816   Mean   : 46.8902  
##  3rd Qu.: 62.10   3rd Qu.: 51.350   3rd Qu.: 8.600   3rd Qu.: 58.7500  
##  Max.   :208.00   Max.   :200.000   Max.   :17.900   Max.   :174.0000  
##      income         inflation         life_expec      total_fer    
##  Min.   :   609   Min.   : -4.210   Min.   :32.10   Min.   :1.150  
##  1st Qu.:  3355   1st Qu.:  1.810   1st Qu.:65.30   1st Qu.:1.795  
##  Median :  9960   Median :  5.390   Median :73.10   Median :2.410  
##  Mean   : 17145   Mean   :  7.782   Mean   :70.56   Mean   :2.948  
##  3rd Qu.: 22800   3rd Qu.: 10.750   3rd Qu.:76.80   3rd Qu.:3.880  
##  Max.   :125000   Max.   :104.000   Max.   :82.80   Max.   :7.490  
##       gdpp       
##  Min.   :   231  
##  1st Qu.:  1330  
##  Median :  4660  
##  Mean   : 12964  
##  3rd Qu.: 14050  
##  Max.   :105000

country_clean_s <- scale(country_clean)

There are outliers in the dataset. But we will include them, as it might be as represent the real case in the country.

country_clean %>% 
  gather(attr, values, c(1:9)) %>%
  ggplot(aes(x = attr, y = values)) +
  geom_boxplot() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  facet_wrap(facets = ~ attr, scales = "free_y")

There are strong positive correlation between :

exports - imports
child_mort - total_fer
income - gdpp

Strong negative correlation between :

child_mort - life_expec
life_expec - total_fer

ggcorr(country_clean, label = T)

K-Means Clustering

Optimal Number of Clusters

We will use two method, Elbow and Silhouette, to determine the optimal number of clusters. Both method recommend different number of clusters. Elbow method recommend 2 clusters (where the elbow is), but Silhouette recommend 5 clusters. With logical thinking, it will be not logic to devide the world to only 2 cluster. We will build models with 3, 4 and 5 clusters. And decide later which cluster we will use.

Elbow Method

fviz_nbclust(x = country_clean_s,
             FUNcluster = kmeans,
             method = "wss") +
#  geom_vline(xintercept = 6, linetype = 2)+
  labs(subtitle = "Elbow method")

Silhouette Method

set.seed(310)
fviz_nbclust(country_clean_s, FUN = kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method")

Clustering

K = 3

options(warn = - 1)
RNGkind(sample.kind = "Rounding")
set.seed(310)

country_km3 <- kmeans(x = country_clean_s, centers = 3)

country_k3 <- country_clean
country_k3$cluster <- country_km3$cluster

country_k3 %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 3 x 10
##   cluster child_mort exports health imports income inflation life_expec
##     <int>      <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1       1       93.0    29.2   6.39    42.3  3942.     12.0        59.2
## 2       2        5      58.7   8.81    51.5 45672.      2.67       80.1
## 3       3       21.9    40.2   6.20    47.5 12306.      7.60       72.8
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>

K = 4

options(warn = - 1)
RNGkind(sample.kind = "Rounding")
set.seed(310)

country_km4 <- kmeans(x = country_clean_s, centers = 4)

country_k4 <- country_clean
country_k4$cluster <- country_km4$cluster

country_k4 %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 4 x 10
##   cluster child_mort exports health imports income inflation life_expec
##     <int>      <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1       1      93.0     29.2   6.39    42.3  3942.     12.0        59.2
## 2       2       4.22    39.5  10.3     38.1 39292.      1.46       80.8
## 3       3      21.2     41.4   6.24    48.0 13167.      7.37       73.0
## 4       4       6.94   113.    5.06    87.8 72025       6.45       79.5
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>

K = 5

options(warn = - 1)
RNGkind(sample.kind = "Rounding")
set.seed(310)

country_km5 <- kmeans(x = country_clean_s, centers = 5)

country_k5 <- country_clean
country_k5$cluster <- country_km5$cluster

country_k5 %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 5 x 10
##   cluster child_mort exports health imports income inflation life_expec
##     <int>      <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1       1      94.9     28.3   6.41    43.0  3493.     11.8        58.9
## 2       2       4.3     37.9  10.6     36.1 40018.      1.54       80.9
## 3       3      15.2     52.2   7.19    62.2 14792.      3.85       74.0
## 4       4       6.94   113.    5.06    87.8 72025       6.45       79.5
## 5       5      27.7     31.6   5.35    33.5 12369.     11.0        72.0
## # ... with 2 more variables: total_fer <dbl>, gdpp <dbl>

Goodness of Fit

A good number of cluster can be judge by its :

Within Sum of Squares (wss) : the sum of square of the distance of the point from its centroids. wss should be close to 0.
Between Sun of Squares (bss) : the sum of square of the distance of the centroids from center of the centroid.
Total Sum of Squres (tss) : the sum of square of the distance of the point from the center of the centroids. bss/tss should be close to 1.

All of our wss and bss/tss is far from the goodness. But for this dataset, we will ignore this numbers.

K = 3

country_km3$withinss

## [1] 269.6604 297.2279 259.5575

country_km3$betweenss/country_km3$totss

## [1] 0.4468234

K = 4

country_km4$withinss

## [1] 269.66044  55.13569 277.01987 115.75637

country_km4$betweenss/country_km3$totss

## [1] 0.5196972

K = 5

country_km5$withinss

## [1] 260.44689  46.96944 107.03581 115.75637 117.75126

country_km5$betweenss/country_km3$totss

## [1] 0.566292

Choosing Cluster

Based on the wss and bss/tss above, and from the plot generated below, there are no significat differences between 3 and 4 or 5 clusters. One thing to notice, Cluster 1 is consistent, no shape change between k=3, k=4, k=5. Hence, we will continue with 3 clusters.

K = 3

fviz_cluster(object = country_km3, data = country_clean, geom = "point")+ ggtitle("k = 3")

K = 4

fviz_cluster(object = country_km4, data = country_clean, geom = "point") + ggtitle("k = 4")

K = 5

fviz_cluster(object = country_km5, data = country_clean, geom = "point") + ggtitle("k = 5")

Cluster Characteristic

We will use PCA to find out what are the characteristic of each cluster. From the resulting plot :

Cluster 1 : high on child mortality, fertility, and low GDP. We can say, cluster 1 are Poor Countries.
Cluster 2 : high income, GDP, and good health. We can say, cluster 3 are Developed Countries.
Cluster 3 : high on export, imports, and middle income. We can say, cluster 2 are Developing Countries.

FactoMineR

country_k3_pca <- PCA(X = country_k3,
                      quali.sup = 10,
                      scale.unit = T,
                      graph = F)

fviz_pca_biplot(country_k3_pca,
                habillage = 10,
                addEllipses = T,
                geom.ind = "point")

prcomp & pca3d

country_k3_pr <- prcomp(country_k3, scale = T)

# pca3d(country_k3_pr, group = country_k3$cluster, 
#       biplot = TRUE, biplot.vars = 5,
#       show.ellipses=TRUE, show.group.labels = T)

knitr::include_graphics("pca-3d.JPG")

Contribution of Variable

From Dimension 1 of the PCA, the contributing variables are life_expec, child_mort, total_fer, income, gdpp.

fviz_contrib(X = country_k3_pca, choice = "var", axes = 1)

Which Country to HELP

It is clear that countries in Cluster 1 are the cluster need help. But what countries need the most ? We will choose top 20 countries of each contributing variables, and join. The resulting dataframe shows countries that need the most help.

# low life expectation, ascending
c1 <- country_k3 %>% 
  filter(cluster == 1) %>% 
  tibble::rownames_to_column("row_names") %>% 
  arrange(life_expec) %>% 
  head(20)

# high child mortality, descending
c2 <- country_k3 %>% 
  filter(cluster == 1) %>% 
  tibble::rownames_to_column("row_names") %>% 
  arrange(desc(child_mort)) %>% 
  head(20)

#  high fertility, descending
c3 <- country_k3 %>% 
  filter(cluster == 1) %>% 
  tibble::rownames_to_column("row_names") %>% 
  arrange(desc(total_fer)) %>% 
  head(20)

# lowest income, ascending
c4 <- country_k3 %>% 
  filter(cluster == 1) %>% 
  tibble::rownames_to_column("row_names") %>% 
  arrange(income) %>% 
  head(20)

# lowest GDP, ascending
c5 <- country_k3 %>% 
  filter(cluster == 1) %>% 
  tibble::rownames_to_column("row_names") %>% 
  arrange(gdpp) %>% 
  head(20)

# merge forward
m1 <- merge(c1, c2)
m2 <- merge(m1, c3)
m3 <- merge(m2, c4)
m4 <- merge(m3, c5)
m4

##                  row_names child_mort exports health imports income inflation
## 1             Burkina Faso      116.0   19.20   6.74    29.6   1430      6.81
## 2                  Burundi       93.6    8.92  11.60    39.2    764     12.30
## 3 Central African Republic      149.0   11.80   3.98    26.5    888      2.01
## 4         Congo, Dem. Rep.      116.0   41.10   7.91    49.6    609     20.80
## 5                   Guinea      109.0   30.30   4.93    43.2   1190     16.10
## 6               Mozambique      101.0   31.50   5.21    46.2    918      7.64
##   life_expec total_fer gdpp cluster
## 1       57.9      5.87  575       1
## 2       57.7      6.26  231       1
## 3       47.5      5.21  446       1
## 4       57.5      6.54  334       1
## 5       58.0      5.34  648       1
## 6       54.5      5.56  419       1

#merge backward
m10 <- merge(c4,c5)
m11 <- merge(m10,c3)
m12 <- merge(m11,c2)
m13 <- merge(m12,c1)
m13

##                  row_names child_mort exports health imports income inflation
## 1             Burkina Faso      116.0   19.20   6.74    29.6   1430      6.81
## 2                  Burundi       93.6    8.92  11.60    39.2    764     12.30
## 3 Central African Republic      149.0   11.80   3.98    26.5    888      2.01
## 4         Congo, Dem. Rep.      116.0   41.10   7.91    49.6    609     20.80
## 5                   Guinea      109.0   30.30   4.93    43.2   1190     16.10
## 6               Mozambique      101.0   31.50   5.21    46.2    918      7.64
##   life_expec total_fer gdpp cluster
## 1       57.9      5.87  575       1
## 2       57.7      6.26  231       1
## 3       47.5      5.21  446       1
## 4       57.5      6.54  334       1
## 5       58.0      5.34  648       1
## 6       54.5      5.56  419       1

Both merged dataframe shows the same result. Countries need the most help are : Burkina Faso, Burundi, Central African Republic, Democratic Republic Congo, Guinea, and Mozambique.

Alternative Choices

Or we can plot the PCA, and find the outliers in Cluster 1. We got Haiti, Chad, Niger, Central African Republic, Nigeria

options(ggrepel.max.overlaps = Inf)
plot.PCA(x = country_k3_pca , 
         choix = "ind",
         invisible = "quali",
         select = "contrib 13")

Indonesia

What cluster is Indonesia in ? Indonesia is in Cluster 3.

country_k3["Indonesia",]$cluster

## [1] 3