No, the gross domestic product (GDP) is not enough to fully appreciate the economic and social conditions of a country at a given moment. To estimate the well-being of a population, one must also take into account the way in which material and immaterial wealth are distributed.
Every year, the United Nations Development Program (UNDP) calculates the Human Development Index (HDI) of each state in the world and ranks the nations according to the score obtained. Beyond gross national income per capita, the HDI takes into account life expectancy at birth and the level of education of young people. The latest ranking is for 2019 figures and was published in December 2020.
In this project we will be using the data provided by the UNDP to analyse how countries are divided based on the HDI. There are 189 countries in the dataset and some countries (Korea (Democratic People’s Rep. of), Monaco, Nauru, San Marino, Somalia, Tuvalu) have been excluded from the analysis because of the lack of data. The data will be analysed using K-Means clustering method.
Let’s start by loading the necessary packages and reading in the data.
library(tidyverse)
library(readr)
library(readxl)
library(factoextra)
library(NbClust)
library(ggplot2)
library(ggpubr)
library(GGally)
human_dev_index <- read_xlsx("HDI.xlsx")
df <- human_dev_index
Let’s now have a glimpse at the data set to see what it looks like:
glimpse(df)
## Rows: 189
## Columns: 9
## $ `HDI rank...1` <dbl> 1, 2, 2, 4, 4, 6, 7, 8, 8, 1…
## $ Country <chr> "Norway", "Ireland", "Switze…
## $ HDI <dbl> 0.957, 0.955, 0.955, 0.949, …
## $ Life_expectancy <dbl> 82.40, 82.31, 83.78, 84.86, …
## $ `Expected years of schooling` <dbl> 18.06615, 18.70529, 16.32844…
## $ `Mean years of schooling` <dbl> 12.89775, 12.66633, 13.38081…
## $ `Gross national income (GNI) per capita` <dbl> 66494.25, 68370.59, 69393.52…
## $ `GNI per capita rank minus HDI rank` <dbl> 7, 4, 3, 7, 14, 11, 12, 15, …
## $ `HDI rank...9` <dbl> 1, 3, 2, 4, 4, 4, 7, 7, 9, 1…
This certainly looks a bit messy and needs some cleaning up before we can continue. Specifically, we need to rename the columns.
#Changing the column names
names(df) <- c("HDI_rank","Country", "HDI", "Life_exp", "Exp_years_edu","Mean_years_edu", "GNI_capita", "GNI_HDI_rank","HDI_rank2")
#Getting rid of the 2nd rank column at the end of the dataset
df$HDI_rank2 <- NULL
#We also exclude the column "GNI per capita rank minus HDI rank" as it doesn't contribute to the analysis
df$GNI_HDI_rank <- NULL
glimpse(df)
## Rows: 189
## Columns: 7
## $ HDI_rank <dbl> 1, 2, 2, 4, 4, 6, 7, 8, 8, 10, 11, 11, 13, 14, 14, 16,…
## $ Country <chr> "Norway", "Ireland", "Switzerland", "Hong Kong, China …
## $ HDI <dbl> 0.957, 0.955, 0.955, 0.949, 0.949, 0.947, 0.945, 0.944…
## $ Life_exp <dbl> 82.40, 82.31, 83.78, 84.86, 82.99, 81.33, 82.80, 83.44…
## $ Exp_years_edu <dbl> 18.06615, 18.70529, 16.32844, 16.92947, 19.08309, 16.9…
## $ Mean_years_edu <dbl> 12.89775, 12.66633, 13.38081, 12.27996, 12.77279, 14.1…
## $ GNI_capita <dbl> 66494.25, 68370.59, 69393.52, 62984.77, 54682.38, 5531…
Now the data looks much more presentable.
It would be interesting to represent the data graphically and see if there are any clusters that we can spot. For this, we will set a random seed for reproducibility and first “guesstimate” that the number of clusters might be 4. Later on, we will use a more scientific way of determining the number of clusters and confirm the actual optimal number of clusters.
set.seed(98765)
df_comp <- df[3:7]
#z-score standardization
df_comp_z <- as.data.frame(lapply(df_comp, scale))
#let's arbitrarily select the cluster number = 4
df_clusters <- kmeans(df_comp_z, 4)
df$cluster <- as.factor(df_clusters$cluster)
We can take a look size of each cluster:
table(df$cluster)
##
## 1 2 3 4
## 51 51 48 39
And we can see the distribution of clusters by HDI:
ggplot(df, aes(cluster, HDI, col = cluster)) +
geom_point(alpha = 0.6) +
geom_jitter() +
ggtitle("Distribution of clusters by HDI") +
theme_bw()
The plot is quite convincing and we can say that, the number of clusters that we picked arbitrarily does a good job in representing the actual clusters in the data.
But to find out the optimal number of clusters, we use several other methods.
Recall that, the basic idea behind cluster partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized.
In short, the average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k
The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering). The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process.
m1 <- fviz_nbclust(df_comp_z, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
m2 <- fviz_nbclust(df_comp_z, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method")
m3 <- fviz_nbclust(df_comp_z, kmeans, nstart = 25, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap statistic method")
ggarrange(m1, m2, m3, ncol = 2, nrow = 2)
The elbow method suggests 4, while the silhouette method suggests 2 and the gap statistic method suggests 3. To see how 3 clusters looks like we can plot the it:
df_clusters <- kmeans(df_comp_z, 3)
df$cluster <- as.factor(df_clusters$cluster)
ggplot(df, aes(cluster, HDI, col = cluster)) +
geom_point(alpha = 0.6) +
geom_jitter() +
ggtitle("Distribution of clusters by HDI") +
theme_bw()
However, even though it is clearly possible to pick 3 clusters, we can see from the 4 cluster plot that there are differences even between countries that have high HDI and very high HDI. So we will keep the number of cluster obtained by the “Elbow method”.
df_clusters <- kmeans(df_comp_z, 4)
df$cluster <- as.factor(df_clusters$cluster)
Now, lets group by the cluster assignment and calculate averages:
df$cluster <- df_clusters$cluster
df_clus_avg <- df %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean, na.rm=TRUE)
df_clus_avg
## # A tibble: 4 x 7
## cluster HDI_rank HDI Life_exp Exp_years_edu Mean_years_edu GNI_capita
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 170. 0.496 61.7 9.37 4.33 3024.
## 2 2 137. 0.621 69.2 11.7 6.45 5797.
## 3 3 22.5 0.908 81.0 16.7 12.0 53057.
## 4 4 80.9 0.770 74.8 14.0 10.0 16238.
We can clearly see the divide beween the countries: the ones in the bottom of the ranking are far away from their closes neighbors.
#assign clusters to the nomalasied dataset
df_comp_z_cluster <- df_comp_z
df_comp_z_cluster$cluster <- as.factor(df_clusters$cluster)
#group by clusters
df_clus_avg1 <- df_comp_z_cluster %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean, na.rm=TRUE)
#Create a parallel coordinate plot of the values:
library(ggplot2)
ggparcoord(df_clus_avg1, columns = c(2:6),
groupColumn = "cluster", scale = "globalminmax", order = "skewness") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
From the chart above we can see that the cluster 3 has the highest HDI. Lets take a look at the countries that it represents:
#cluster 4
df %>%
filter(cluster == 3) %>%
select(Country, HDI) %>%
arrange(desc(HDI)) %>%
head(n = 10)
## # A tibble: 10 x 2
## Country HDI
## <chr> <dbl>
## 1 Norway 0.957
## 2 Ireland 0.955
## 3 Switzerland 0.955
## 4 Hong Kong, China (SAR) 0.949
## 5 Iceland 0.949
## 6 Germany 0.947
## 7 Sweden 0.945
## 8 Australia 0.944
## 9 Netherlands 0.944
## 10 Denmark 0.94
On the other hand, the cluster 1 has the lowest scores. After graduating, Niger will probably not be the country where data scientist go.
#cluster 1
df %>%
filter(cluster == 1) %>%
select(Country, HDI) %>%
arrange(HDI)%>%
head(n = 10)
## # A tibble: 10 x 2
## Country HDI
## <chr> <dbl>
## 1 Niger 0.394
## 2 Central African Republic 0.397
## 3 Chad 0.398
## 4 Burundi 0.433
## 5 South Sudan 0.433
## 6 Mali 0.434
## 7 Burkina Faso 0.452
## 8 Sierra Leone 0.452
## 9 Mozambique 0.456
## 10 Eritrea 0.459
#cluster 2
df %>%
filter(cluster == 2) %>%
select(Country, HDI, GNI_capita) %>%
arrange(desc(HDI)) %>%
head(n = 10)
## # A tibble: 10 x 3
## Country HDI GNI_capita
## <chr> <dbl> <dbl>
## 1 Egypt 0.707 11466.
## 2 Gabon 0.703 13930.
## 3 Morocco 0.686 7368.
## 4 Guyana 0.682 9455.
## 5 Iraq 0.674 10801.
## 6 El Salvador 0.673 8359.
## 7 Cabo Verde 0.665 7019.
## 8 Guatemala 0.663 8494.
## 9 Nicaragua 0.66 5284.
## 10 Bhutan 0.654 10746.
#cluster 3
df %>%
filter(cluster == 4) %>%
select(Country, HDI) %>%
arrange(desc(HDI)) %>%
head(n = 10)
## # A tibble: 10 x 2
## Country HDI
## <chr> <dbl>
## 1 Slovakia 0.86
## 2 Hungary 0.854
## 3 Chile 0.851
## 4 Croatia 0.851
## 5 Argentina 0.845
## 6 Montenegro 0.829
## 7 Romania 0.828
## 8 Palau 0.826
## 9 Kazakhstan 0.825
## 10 Russian Federation 0.824
We can see where Poland is in the ranking:
df %>%
filter(Country == "Poland") %>%
select(Country, HDI,HDI_rank, cluster, GNI_capita) %>%
arrange(desc(HDI)) %>%
head(n = 15)
## # A tibble: 1 x 5
## Country HDI HDI_rank cluster GNI_capita
## <chr> <dbl> <dbl> <int> <dbl>
## 1 Poland 0.88 35 3 31623.
Poand is quite well positioned among the world countries
library(plotly)
fig <- plot_ly(df, type='choropleth', locations=df$Country, z=df$HDI,
locationmode ="country names")
fig <- fig %>% colorbar(title = "HDI")
fig
The human development index is measured using three main criteria: the gross domestic product (GDP) per capita, the life expectancy of the citizens of a state and the level of education measured from the age of 15 and over . Since 1990, it has replaced GDP, which largely obscured the level of individual and collective fulfillment to focus only on economic criteria. By including the education and life expectancy of the population in its reading grid, this measurement index makes it possible to be more precise in the analysis of the development of States.
In particular, the HDI provides a better understanding of the divide that has existed for some fifty years between developed and developing countries. This divide is occurring between the so-called “North” developed countries and the so-called “South” developing countries. This name makes it possible to simplify and better represent the inequalities which appear between the countries of the “North” (Europe, North America (United States and Canada), Russia, Japan, Australia) and countries of the “South” (Africa, South America, India and China).
On a map of the world, what emerges is what economist and HDI precursor Amartya Sen calls an “iron curtain of inequality” in his essay A New Economic Model (2005). While the countries of the North continue to progress, get richer and improve the standard of living of their inhabitants, the countries of the South, at the best of times, pursue groping development. In the worst case, they suffer a drop in the standard of living of the inhabitants and a development at half mast. Access to drinking water, which has become a universal fundamental right, is almost absent in some countries of the South (such as South Sudan or the Central African Republic).
Our analysis gave a better view of the development in the world by using the latest data.