How deep is the divide? Analyzing Countries Based the Human Development Index (HDI)

Summary
1. Importing and cleaning data
2. Clustering with K-means
- Determining Optimal Clusters
3. Data Analysis
4. Putting everything on a map
Conclusion
- Countries of the “North” and countries of the “South”: a global divide of inequalities
- Can we speak of an “iron curtain of inequalities”?

Summary

No, the gross domestic product (GDP) is not enough to fully appreciate the economic and social conditions of a country at a given moment. To estimate the well-being of a population, one must also take into account the way in which material and immaterial wealth are distributed.

Every year, the United Nations Development Program (UNDP) calculates the Human Development Index (HDI) of each state in the world and ranks the nations according to the score obtained. Beyond gross national income per capita, the HDI takes into account life expectancy at birth and the level of education of young people. The latest ranking is for 2019 figures and was published in December 2020.

In this project we will be using the data provided by the UNDP to analyse how countries are divided based on the HDI. There are 189 countries in the dataset and some countries (Korea (Democratic People’s Rep. of), Monaco, Nauru, San Marino, Somalia, Tuvalu) have been excluded from the analysis because of the lack of data. The data will be analysed using K-Means clustering method.

1. Importing and cleaning data

Let’s start by loading the necessary packages and reading in the data.

library(tidyverse)
library(readr)
library(readxl)
library(factoextra)
library(NbClust)
library(ggplot2)
library(ggpubr)
library(GGally)

human_dev_index <- read_xlsx("HDI.xlsx")
df <- human_dev_index

Let’s now have a glimpse at the data set to see what it looks like:

glimpse(df)

## Rows: 189
## Columns: 9
## $ `HDI rank...1`                           <dbl> 1, 2, 2, 4, 4, 6, 7, 8, 8, 1…
## $ Country                                  <chr> "Norway", "Ireland", "Switze…
## $ HDI                                      <dbl> 0.957, 0.955, 0.955, 0.949, …
## $ Life_expectancy                          <dbl> 82.40, 82.31, 83.78, 84.86, …
## $ `Expected years of schooling`            <dbl> 18.06615, 18.70529, 16.32844…
## $ `Mean years of schooling`                <dbl> 12.89775, 12.66633, 13.38081…
## $ `Gross national income (GNI) per capita` <dbl> 66494.25, 68370.59, 69393.52…
## $ `GNI per capita rank minus HDI rank`     <dbl> 7, 4, 3, 7, 14, 11, 12, 15, …
## $ `HDI rank...9`                           <dbl> 1, 3, 2, 4, 4, 4, 7, 7, 9, 1…

This certainly looks a bit messy and needs some cleaning up before we can continue. Specifically, we need to rename the columns.

#Changing the column names
names(df) <- c("HDI_rank","Country", "HDI", "Life_exp", "Exp_years_edu","Mean_years_edu", "GNI_capita", "GNI_HDI_rank","HDI_rank2")
#Getting rid of the 2nd rank column at the end of the dataset
df$HDI_rank2 <- NULL
#We also exclude the column "GNI per capita rank minus HDI rank" as it doesn't contribute to the analysis
df$GNI_HDI_rank <- NULL
glimpse(df)

## Rows: 189
## Columns: 7
## $ HDI_rank       <dbl> 1, 2, 2, 4, 4, 6, 7, 8, 8, 10, 11, 11, 13, 14, 14, 16,…
## $ Country        <chr> "Norway", "Ireland", "Switzerland", "Hong Kong, China …
## $ HDI            <dbl> 0.957, 0.955, 0.955, 0.949, 0.949, 0.947, 0.945, 0.944…
## $ Life_exp       <dbl> 82.40, 82.31, 83.78, 84.86, 82.99, 81.33, 82.80, 83.44…
## $ Exp_years_edu  <dbl> 18.06615, 18.70529, 16.32844, 16.92947, 19.08309, 16.9…
## $ Mean_years_edu <dbl> 12.89775, 12.66633, 13.38081, 12.27996, 12.77279, 14.1…
## $ GNI_capita     <dbl> 66494.25, 68370.59, 69393.52, 62984.77, 54682.38, 5531…

Now the data looks much more presentable.

2. Clustering with K-means

It would be interesting to represent the data graphically and see if there are any clusters that we can spot. For this, we will set a random seed for reproducibility and first “guesstimate” that the number of clusters might be 4. Later on, we will use a more scientific way of determining the number of clusters and confirm the actual optimal number of clusters.

set.seed(98765) 
df_comp <- df[3:7]
#z-score standardization 
df_comp_z <- as.data.frame(lapply(df_comp, scale))
#let's arbitrarily select the cluster number = 4
df_clusters <- kmeans(df_comp_z, 4)
df$cluster <- as.factor(df_clusters$cluster)

We can take a look size of each cluster:

table(df$cluster)

## 
##  1  2  3  4 
## 51 51 48 39

And we can see the distribution of clusters by HDI:

ggplot(df, aes(cluster, HDI, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by HDI") +
  theme_bw()

The plot is quite convincing and we can say that, the number of clusters that we picked arbitrarily does a good job in representing the actual clusters in the data.

But to find out the optimal number of clusters, we use several other methods.

Determining Optimal Clusters

Elbow Method

Recall that, the basic idea behind cluster partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized.

Average Silhouette Method

In short, the average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k

Gap Statistic Method

The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering). The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process.

m1 <- fviz_nbclust(df_comp_z, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) +
  labs(subtitle = "Elbow method")

m2 <- fviz_nbclust(df_comp_z, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method")

m3 <- fviz_nbclust(df_comp_z, kmeans, nstart = 25,  method = "gap_stat", nboot = 50) +
  labs(subtitle = "Gap statistic method")

ggarrange(m1, m2, m3, ncol = 2, nrow = 2)

The elbow method suggests 4, while the silhouette method suggests 2 and the gap statistic method suggests 3. To see how 3 clusters looks like we can plot the it:

df_clusters <- kmeans(df_comp_z, 3)
df$cluster <- as.factor(df_clusters$cluster)

ggplot(df, aes(cluster, HDI, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by HDI") +
  theme_bw()

However, even though it is clearly possible to pick 3 clusters, we can see from the 4 cluster plot that there are differences even between countries that have high HDI and very high HDI. So we will keep the number of cluster obtained by the “Elbow method”.

df_clusters <- kmeans(df_comp_z, 4)
df$cluster <- as.factor(df_clusters$cluster)

3. Data Analysis

Now, lets group by the cluster assignment and calculate averages:

df$cluster <- df_clusters$cluster

df_clus_avg <- df %>%
  group_by(cluster) %>%
  summarize_if(is.numeric, mean, na.rm=TRUE)
df_clus_avg

## # A tibble: 4 x 7
##   cluster HDI_rank   HDI Life_exp Exp_years_edu Mean_years_edu GNI_capita
##     <int>    <dbl> <dbl>    <dbl>         <dbl>          <dbl>      <dbl>
## 1       1    170.  0.496     61.7          9.37           4.33      3024.
## 2       2    137.  0.621     69.2         11.7            6.45      5797.
## 3       3     22.5 0.908     81.0         16.7           12.0      53057.
## 4       4     80.9 0.770     74.8         14.0           10.0      16238.

We can clearly see the divide beween the countries: the ones in the bottom of the ranking are far away from their closes neighbors.

#assign clusters to the nomalasied dataset
df_comp_z_cluster <- df_comp_z
df_comp_z_cluster$cluster <- as.factor(df_clusters$cluster) 
#group by clusters
df_clus_avg1 <- df_comp_z_cluster %>%
  group_by(cluster) %>%
  summarize_if(is.numeric, mean, na.rm=TRUE)

#Create a parallel coordinate plot of the values:
library(ggplot2)
ggparcoord(df_clus_avg1, columns = c(2:6), 
           groupColumn = "cluster", scale = "globalminmax", order = "skewness") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

From the chart above we can see that the cluster 3 has the highest HDI. Lets take a look at the countries that it represents:

#cluster 4
df %>%
  filter(cluster == 3) %>%
  select(Country, HDI) %>%
  arrange(desc(HDI)) %>%
  head(n = 10)

## # A tibble: 10 x 2
##    Country                  HDI
##    <chr>                  <dbl>
##  1 Norway                 0.957
##  2 Ireland                0.955
##  3 Switzerland            0.955
##  4 Hong Kong, China (SAR) 0.949
##  5 Iceland                0.949
##  6 Germany                0.947
##  7 Sweden                 0.945
##  8 Australia              0.944
##  9 Netherlands            0.944
## 10 Denmark                0.94

On the other hand, the cluster 1 has the lowest scores. After graduating, Niger will probably not be the country where data scientist go.

#cluster 1
df %>%
  filter(cluster == 1) %>%
  select(Country, HDI) %>%
  arrange(HDI)%>%
  head(n = 10)

## # A tibble: 10 x 2
##    Country                    HDI
##    <chr>                    <dbl>
##  1 Niger                    0.394
##  2 Central African Republic 0.397
##  3 Chad                     0.398
##  4 Burundi                  0.433
##  5 South Sudan              0.433
##  6 Mali                     0.434
##  7 Burkina Faso             0.452
##  8 Sierra Leone             0.452
##  9 Mozambique               0.456
## 10 Eritrea                  0.459

#cluster 2
df %>%
  filter(cluster == 2) %>%
  select(Country, HDI, GNI_capita) %>%
  arrange(desc(HDI)) %>%
  head(n = 10)

## # A tibble: 10 x 3
##    Country       HDI GNI_capita
##    <chr>       <dbl>      <dbl>
##  1 Egypt       0.707     11466.
##  2 Gabon       0.703     13930.
##  3 Morocco     0.686      7368.
##  4 Guyana      0.682      9455.
##  5 Iraq        0.674     10801.
##  6 El Salvador 0.673      8359.
##  7 Cabo Verde  0.665      7019.
##  8 Guatemala   0.663      8494.
##  9 Nicaragua   0.66       5284.
## 10 Bhutan      0.654     10746.

#cluster 3
df %>%
  filter(cluster == 4) %>%
  select(Country, HDI) %>%
  arrange(desc(HDI)) %>%
  head(n = 10)

## # A tibble: 10 x 2
##    Country              HDI
##    <chr>              <dbl>
##  1 Slovakia           0.86 
##  2 Hungary            0.854
##  3 Chile              0.851
##  4 Croatia            0.851
##  5 Argentina          0.845
##  6 Montenegro         0.829
##  7 Romania            0.828
##  8 Palau              0.826
##  9 Kazakhstan         0.825
## 10 Russian Federation 0.824

We can see where Poland is in the ranking:

df %>%
  filter(Country == "Poland") %>%
  select(Country, HDI,HDI_rank, cluster, GNI_capita) %>%
  arrange(desc(HDI)) %>%
  head(n = 15)

## # A tibble: 1 x 5
##   Country   HDI HDI_rank cluster GNI_capita
##   <chr>   <dbl>    <dbl>   <int>      <dbl>
## 1 Poland   0.88       35       3     31623.

Poand is quite well positioned among the world countries

4. Putting everything on a map

library(plotly)
fig <- plot_ly(df, type='choropleth', locations=df$Country, z=df$HDI, 
               locationmode ="country names")
fig <- fig %>% colorbar(title = "HDI")
fig

Conclusion

The human development index is measured using three main criteria: the gross domestic product (GDP) per capita, the life expectancy of the citizens of a state and the level of education measured from the age of 15 and over . Since 1990, it has replaced GDP, which largely obscured the level of individual and collective fulfillment to focus only on economic criteria. By including the education and life expectancy of the population in its reading grid, this measurement index makes it possible to be more precise in the analysis of the development of States.

Countries of the “North” and countries of the “South”: a global divide of inequalities

In particular, the HDI provides a better understanding of the divide that has existed for some fifty years between developed and developing countries. This divide is occurring between the so-called “North” developed countries and the so-called “South” developing countries. This name makes it possible to simplify and better represent the inequalities which appear between the countries of the “North” (Europe, North America (United States and Canada), Russia, Japan, Australia) and countries of the “South” (Africa, South America, India and China).

Can we speak of an “iron curtain of inequalities”?

On a map of the world, what emerges is what economist and HDI precursor Amartya Sen calls an “iron curtain of inequality” in his essay A New Economic Model (2005). While the countries of the North continue to progress, get richer and improve the standard of living of their inhabitants, the countries of the South, at the best of times, pursue groping development. In the worst case, they suffer a drop in the standard of living of the inhabitants and a development at half mast. Access to drinking water, which has become a universal fundamental right, is almost absent in some countries of the South (such as South Sudan or the Central African Republic).

Our analysis gave a better view of the development in the world by using the latest data.