The aim of this paper is to analyze the similarities and differences in the basic country statistics of the African continent with the goal of finding a useful grouping. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. I will apply several clustering set of techniques for finding subgroups of observations within a country’s data set. These techniques include partitioning clustering (K-means, PAM) and hierarchical clustering.
Country Statistics UN Dataset used in this analysis was found on UN website. The dataset has 230 countries from all continents with about 50 columns which describe country economic statistics in the year 2017. Our data of interest is to examine the African continent, so we filter countries from the African Region (Southern, Northern, Eastern, and Western) to get 52 African countries. The columns of the Data set of interest are as following:
• Population in thousands (2017)
• GDP: Gross domestic product (million current US$)
• Unemployment (% of labor force)
• International trade: Exports (million US$)
• International trade: Imports (million US$)
• Individuals using the Internet (per 100 inhabitants)
• Co2 emissions (million tons/tons per capita)
load.libraries <- c( 'cluster', 'clustertend','corrplot', 'tidyverse','ggplot2','factoextra')
suppressPackageStartupMessages(library(dendextend))
suppressPackageStartupMessages(library(gridExtra))
## Warning: package 'gridExtra' was built under R version 4.1.2
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(data.table))
## Warning: package 'data.table' was built under R version 4.1.2
#loading the data set
countries_data <- read.csv("country_profile_variables.csv")
#Select data of interest and rename some columns
data <- countries_data[,c(1,2,4,7,16,20,21,42,45)]
new_names <- c("Country","Region","Population","GDP_millions","Unemployment_rate",
"Exports(millions)","Imports(millions)","Internet_use(/100)","cO2_emission")
names(data) <- new_names
We filter countries from the African Region (Southern, Northern, Eastern, and Western)
library(data.table)
#Selecting African countries
data <- data[data$Region %like% "Africa", ]
The dataset now consists of 56 African countries however, there are columns with null variables such as ‘-~0.0’, ‘-99’, ‘…’, ‘~0’. These variables are converted into NA values and columns with NA values removed. Observations with NA values were omitted because there isn’t any good justification to support the assumptions. After omitting observations with NA’s, we are left with 52 African countries. Country Zimbabwe is from Southern African Region and hence it was changed to Southern Africa Region instead of Eastern Africa.
data[data$Country=="Zimbabwe",]$Region <- "SouthernAfrica"
data$GDP_millions <- ifelse(data$GDP_millions == -99,NA,data$GDP_millions)
data$Unemployment_rate <- ifelse(data$Unemployment_rate== '...'|data$Unemployment_rate== '-99',
NA,data$Unemployment_rate)
data$`Exports(millions)`<- ifelse(data$`Exports(millions)`== '~0'|data$`Exports(millions)`=='-99',
NA,data$`Exports(millions)`)
data$`Imports(millions)`<- ifelse(data$`Imports(millions)`== '-99'|data$`Imports(millions)`=='~0'
|data$`Imports(millions)`=='...',NA,data$`Imports(millions)`)
data$cO2_emission <- ifelse(data$cO2_emission== -99,0,data$cO2_emission)
data <- na.omit(data)
Table below shows number of countries in each region.
#Shortening country name for clear visualization when clustering
data[data$Country=="Burkina Faso",]$Country <- "Burkina.F"
data[data$Country=="Sierra Leone",]$Country <- "Sierra.L"
data[data$Country=="Guinea-Bissau",]$Country <- "Guinea-B"
data[data$Country=="Sao Tome&Principe",]$Country <- "S.T&Principle"
data[data$Country=="United Republic of Tanzania",]$Country <- "Tanzania"
data[data$Country=="Democratic Republic of the Congo",]$Country <- "D.R Congo"
table(data$Region)
##
## EasternAfrica MiddleAfrica NorthernAfrica SouthernAfrica WesternAfrica
## 16 9 6 6 15
Some numeric variables are treated as characters . We therefore, need to change such character variables to numeric using a for loop.
#Changing char variables to numeric
countries <- data[-c(1,2)]
for(item in colnames(countries)) {
countries[[item]] <- as.numeric(countries[[item]])
}
Since considering that variables in the dataset are measured in different scales, it is recommended to scale the data in order to make those variables comparable. This avoids a problem in which some features come to dominate solely because they tend to have larger values than others.
countries_scaled <- as.data.frame(lapply(countries, scale))
View(countries_scaled)
row.names(countries_scaled) <- data[,1]
Here we look at the relationship between the variables using a correlation matrix. The correlation matrix share information about the distribution of each variable measured. The correlation plot shows us the correlation between all these different variables in the country dataset. There are fairly tightly linear relationships such as the correlation between GDP of the country and { Co2_emission, international trade Imports and Exports }. We can also observe the strongest positive linear correlation lies between between cO2 emission and International trade exports.
library(corrplot)
## corrplot 0.91 loaded
corrplot(cor(countries_scaled, use="complete"), method="number", type="upper", diag=F,
tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="African Countries 2017", mar=c(0,0,1,0))
Before proceeding with any clustering method, it is worth to assess the general clustering tendency of the data. For this purpose, Hopkins’s statistic and visual assessment were used.
suppressPackageStartupMessages(library(factoextra))
## Warning: package 'factoextra' was built under R version 4.1.2
## Warning: package 'ggplot2' was built under R version 4.1.2
get_clust_tendency(countries_scaled, 2, graph=TRUE, gradient=list(low = 'blue', high = 'white'),seed=1234)
## $hopkins_stat
## [1] 0.637119
##
## $plot
In the context of the conducted analysis, the results are rather satisfactory. Hopkins’s statistic is equal to 0.637119, which is above 0.5 and thus, according to R Documentation, we can conclude that the data set is significantly clusterable.
In the next step it is necessary to obtain the optimal number of clusters for each of partitional clustering method. Since the analyzed data set is rather small, therefore there is no need to consider CLARA which is intended for large data sets. However, for comparative purposes, both K-means and PAM will be implemented. The optimal number of clusters will be chosen primarily based on silhouette statistic.
f1 <- fviz_nbclust(countries_scaled, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(countries_scaled, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
The results indicate that the optimal number of clusters for the countries dataset was assessed as 2 for both K-means analysis and PAM analysis. The average silhouette width for K-means and PAM is therefore comparable in the case of 2 clusters. What is more, in both cases (K-means, PAM), the average silhouette width value for 3 clusters is only significantly lower than its value for the optimal number of clusters.
To confirm the results, it is always good to look at an alternative method. Therefore, I check the stability of the above obtained results by using the WSS statistics.
f1 <- fviz_nbclust(countries_scaled, FUNcluster = kmeans, method = "wss") +
geom_vline(xintercept = 2, linetype = 2)+
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(countries_scaled, FUNcluster = cluster::pam, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
Summing up, in both cases (K-means and PAM) the division into 2 clusters seems to be the most promising. However, due to the subject of interest of the analysis and the obtained results, the case for 3 and 4 clusters will also be considered and analyzed.
Clustering will be produced based on the K-means algorithm for the case with two, three and four clusters. It was decided to use Euclidean distance to calculate dissimilarities between observations.
km2 <- eclust(countries_scaled, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(km2, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
s2 <- fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 48 0.72
## 2 2 4 0.23
grid.arrange(c2, s2, ncol=2)
km3 <- eclust(countries_scaled, k=3 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c3 <- fviz_cluster(km3, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s3 <- fviz_silhouette(km3)
## cluster size ave.sil.width
## 1 1 11 0.47
## 2 2 4 0.22
## 3 3 37 0.32
grid.arrange(c3, s3, ncol=2)
km4 <- eclust(countries_scaled, k=4 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c4 <- fviz_cluster(km4, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 4 clusters")
s4 <- fviz_silhouette(km4)
## cluster size ave.sil.width
## 1 1 9 0.53
## 2 2 4 0.21
## 3 3 36 0.42
## 4 4 3 0.51
grid.arrange(c4, s4, ncol=2)
There is a total of 5 Regions in African Continent. The table below shows the distribution of country Region among clusters when kmeans is 4.
table(data$Region, km4$cluster)
##
## 1 2 3 4
## EasternAfrica 2 0 12 2
## MiddleAfrica 1 0 7 1
## NorthernAfrica 1 2 3 0
## SouthernAfrica 4 1 1 0
## WesternAfrica 1 1 13 0
Silhouette statistic is higher in the case of 2 clusters however there is an imbalance in cluster membership as cluster 1 has 36 countries as compared to cluster 2 with just 4 countries. Both K-means with 3 clusters and 4 clusters have negative average silhouette value which means that there may be mis-allocation of some clusters. K-means with 4 clusters however, is slighter has a higher silhouette statistic than in the case of 3 clusters. It can also be easily noticed that the case of 4 clusters was created by splitting one of the clusters from the case of K-means with 2 clusters.
fviz_cluster(km4, data = countries_scaled)
The tables below present the basic statistics for the characteristics of each cluster. It is worth recalling that the data has been scaled, hence negative values appear.
countries_cl <- as.data.frame(cbind(countries_scaled, km4$cluster))
colnames(countries_cl ) <- c(colnames(countries_scaled),"cluster")
# Cluster 1 (red)
summary(countries_cl[countries_cl$cluster==1,1:7]) %>% kable()
| Population | GDP_millions | Unemployment_rate | Exports.millions. | Imports.millions. | Internet_use..100. | cO2_emission | |
|---|---|---|---|---|---|---|---|
| Min. :-0.6696 | Min. :-0.4572 | Min. :1.082 | Min. :-0.415946 | Min. :-0.572688 | Min. :-0.7474 | Min. :-0.4301 | |
| 1st Qu.:-0.6337 | 1st Qu.:-0.4456 | 1st Qu.:1.229 | 1st Qu.:-0.380925 | 1st Qu.:-0.537428 | 1st Qu.:-0.6835 | 1st Qu.:-0.4221 | |
| Median :-0.6276 | Median :-0.3418 | Median :1.883 | Median :-0.258071 | Median :-0.393973 | Median :-0.5516 | Median :-0.4120 | |
| Mean :-0.5311 | Mean :-0.3495 | Mean :1.739 | Mean :-0.258833 | Mean :-0.345516 | Mean :-0.3673 | Mean :-0.2649 | |
| 3rd Qu.:-0.6187 | 3rd Qu.:-0.3100 | 3rd Qu.:2.016 | 3rd Qu.:-0.159699 | 3rd Qu.:-0.186788 | 3rd Qu.:-0.3597 | 3rd Qu.:-0.1511 | |
| Max. : 0.1852 | Max. :-0.0904 | Max. :2.630 | Max. :-0.008795 | Max. :-0.005722 | Max. : 0.4157 | Max. : 0.3236 |
Countries in cluster 1 generally have the highest unemployment rates as compared to other countries according to the statistics and relatively low population besides countries in cluster 3.
# Cluster 2 (green)
summary(countries_cl[countries_cl$cluster==2,1:7]) %>% kable()
| Population | GDP_millions | Unemployment_rate | Exports.millions. | Imports.millions. | Internet_use..100. | cO2_emission | |
|---|---|---|---|---|---|---|---|
| Min. :0.5303 | Min. :1.336 | Min. :-0.61192 | Min. :0.8381 | Min. :1.800 | Min. :-0.2798 | Min. :1.329 | |
| 1st Qu.:0.8724 | 1st Qu.:2.566 | 1st Qu.:-0.01155 | 1st Qu.:1.1511 | 1st Qu.:2.317 | 1st Qu.:-0.2168 | 1st Qu.:2.229 | |
| Median :1.5914 | Median :2.983 | Median : 0.19525 | Median :2.4857 | Median :2.847 | Median : 0.2138 | Median :2.831 | |
| Mean :2.1686 | Mean :3.062 | Mean : 0.47876 | Mean :2.7992 | Mean :2.948 | Mean : 0.4127 | Mean :3.001 | |
| 3rd Qu.:2.8875 | 3rd Qu.:3.479 | 3rd Qu.: 0.68556 | 3rd Qu.:4.1337 | 3rd Qu.:3.478 | 3rd Qu.: 0.8433 | 3rd Qu.:3.602 | |
| Max. :4.9612 | Max. :4.946 | Max. : 2.13647 | Max. :5.3873 | Max. :4.295 | Max. : 1.5028 | Max. :5.013 |
This cluster consists of the most developed countries in African continent with highest Gross domestic Product and high international trade Imports and Exports. These countries have developed economies and are the top 4 countries in Africa with the highest GDP in 2021 statistics. Here, we also observe high Co2 emissions as compared to countries in other clusters.
# Cluster 3 (blue)
summary(countries_cl[countries_cl$cluster==3,1:7]) %>% kable()
| Population | GDP_millions | Unemployment_rate | Exports.millions. | Imports.millions. | Internet_use..100. | cO2_emission | |
|---|---|---|---|---|---|---|---|
| Min. :-0.6909 | Min. :-0.4639 | Min. :-1.1990 | Min. :-0.4165 | Min. :-0.5762 | Min. :-0.6955 | Min. :-0.4316 | |
| 1st Qu.:-0.5541 | 1st Qu.:-0.4265 | 1st Qu.:-0.7854 | 1st Qu.:-0.3845 | 1st Qu.:-0.5213 | 1st Qu.:-0.5026 | 1st Qu.:-0.4078 | |
| Median :-0.3245 | Median :-0.3725 | Median :-0.4385 | Median :-0.3201 | Median :-0.4175 | Median :-0.3297 | Median :-0.3822 | |
| Mean :-0.1384 | Mean :-0.2394 | Mean :-0.4125 | Mean :-0.2257 | Mean :-0.2209 | Mean :-0.2389 | Mean :-0.2517 | |
| 3rd Qu.:-0.1082 | 3rd Qu.:-0.2239 | 3rd Qu.:-0.2117 | 3rd Qu.:-0.2564 | 3rd Qu.:-0.1746 | 3rd Qu.:-0.1119 | 3rd Qu.:-0.2262 | |
| Max. : 2.4156 | Max. : 0.8236 | Max. : 0.6155 | Max. : 0.8576 | Max. : 2.1373 | Max. : 1.0991 | Max. : 1.5264 |
Most of the countries fall in cluster 3 with similar characteristics of low- middle developed economies. Countries including Zimbabwe, Zambia, Burkina Faso, Mali, Chad are closer to the cluster centroid of cluster 3 and it can be observed that these are among the least developed economies in the country data set with characteristics of low GDP output, lower International trade Imports and Exports.
# Cluster 4 (purple)
summary(countries_cl[countries_cl$cluster==4,1:7]) %>% kable()
| Population | GDP_millions | Unemployment_rate | Exports.millions. | Imports.millions. | Internet_use..100. | cO2_emission | |
|---|---|---|---|---|---|---|---|
| Min. :0.01886 | Min. :-0.36095 | Min. :-1.0255 | Min. :-0.2983 | Min. :-0.39169 | Min. :2.278 | Min. :-0.36686 | |
| 1st Qu.:0.04133 | 1st Qu.:-0.25874 | 1st Qu.:-0.9988 | 1st Qu.:-0.2948 | 1st Qu.:-0.32855 | 1st Qu.:2.892 | 1st Qu.:-0.29686 | |
| Median :0.06380 | Median :-0.15652 | Median :-0.9721 | Median :-0.2913 | Median :-0.26540 | Median :3.505 | Median :-0.22687 | |
| Mean :0.36223 | Mean :-0.16186 | Mean :-0.9054 | Mean :-0.2474 | Mean :-0.24270 | Mean :3.419 | Mean :-0.18538 | |
| 3rd Qu.:0.53392 | 3rd Qu.:-0.06232 | 3rd Qu.:-0.8454 | 3rd Qu.:-0.2219 | 3rd Qu.:-0.16821 | 3rd Qu.:3.989 | 3rd Qu.:-0.09464 | |
| Max. :1.00405 | Max. : 0.03188 | Max. :-0.7187 | Max. :-0.1526 | Max. :-0.07102 | Max. :4.472 | Max. : 0.03758 |
Cluster 4 have highest internet use per 100 inhabitants and generally higher population compared to countries in cluster 1 and 3.
It is worth noting that countries like Morocco, Angola, Ethiopia, Kenya and D.R. Congo are on the boundary of cluster 3 closer to cluster 2 because they have higher GDP output and International trade import and Exports than the rest of the countries in that cluster.These countries are among the top 10 countries with highest GDP output and can be considered as mid-developed economies.
Below is the link of 52 African countries ranked according to their GDP output 2021 stats:
(https://www.statista.com/statistics/1120999/gdp-of-african-countries-by-country/).
In this part the clusterisation is produced based on the PAM algorithm.
#PAM with 2 clusters
pam2 <- eclust(countries_scaled, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(pam2, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 4 0.23
## 2 2 48 0.72
grid.arrange(cp2, sp2, ncol=2)
#PAM with 3 clusters
pam3 <- eclust(countries_scaled, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 4 0.22
## 2 2 38 0.33
## 3 3 10 0.50
grid.arrange(cp3, sp3, ncol=2)
#PAM with 4 clusters
pam4 <- eclust(countries_scaled, k=4 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp4 <- fviz_cluster(pam4, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 4 clusters")
sp4 <- fviz_silhouette(pam4)
## cluster size ave.sil.width
## 1 1 4 0.21
## 2 2 35 0.43
## 3 3 10 0.46
## 4 4 3 0.51
grid.arrange(cp4, sp4, ncol=2)
There is a total of 5 Regions in African Continent. The table below shows the distribution of country Region among clusters when k is 4.
table(data$Region, pam4$cluster)
##
## 1 2 3 4
## EasternAfrica 0 12 2 2
## MiddleAfrica 0 7 1 1
## NorthernAfrica 2 2 2 0
## SouthernAfrica 1 1 4 0
## WesternAfrica 1 13 1 0
Summarizing PAM clustering, Silhouette statistic for PAM is almost the same as that for K-means and also negative average silhouette values are also observed in PAM clustering.
fviz_cluster(pam4, data = countries_scaled)
Using PAM, we observe that there is much clearer cluster visualization of countries especially in the case of country Gabon which lied on the boundary of a neighboring cluster using K-means. Using PAM algorithm, we also observe that country Tunisia now falls under a different cluster {Cluster 3 - blue} than the one previously allocated using K-means and there is now a clearer separation between these 2 neighboring clusters.
As for the differences in countries based on country statistics, the results are similar to the K-means cases, with highly developed economies clustered together.
As the last method for clusterization, hierarchical clustering will be used. This idea is based on setting the hierarchy of clusters depending on chosen way of calculating the similarity between clusters. In the below analysis I will use the agglomerative hierarchical clustering technique. In this technique all observations are initially in their own clusters and then iteratively similar clusters are merged with others until one cluster is formed.
In the hierarchical clustering method, it is necessary to compute the dissimilarity matrix and thus the linkage method needs to be specified first. There are multiple options for the linkage methods, but for this paper I will compare only two methods: Ward.D method and complete linkage. The first one is frequently claimed to be sensible by default, especially when we do not have any clear theoretical justifications for any other linkage criteria. The second method, however, does well in clustering when there is some kind of noise between clusters.
#Complete linkage
dist_countries <- dist(countries_scaled, method = "euclidean")
hc_countries <- hclust(dist_countries, method = "complete")
dend_countries <- as.dendrogram(hc_countries)
dend_countries_colored <- color_branches(dend_countries, h = 2)
plot(dend_countries_colored,main = " H Clustering across 52 African Countries")
rect.hclust(hc_countries, k=2, border='blue')
#ward D
dist_countries <- dist(countries_scaled, method = "euclidean")
hc_countries <- hclust(dist_countries, method = "ward.D")
dend_countries <- as.dendrogram(hc_countries)
dend_countries_colored <- color_branches(dend_countries, h = 4)
plot(dend_countries_colored, main = "Clustering across 52 African Countries")
rect.hclust(hc_countries, k=4, border='blue')
The obtained results using Ward’s method and complete linkage are different. Using complete linkage method, countries are grouped better using 2 clusters, which was suggested as the optimal number of clusters in K-means, however using Ward’s linkage method, countries are grouped into four clusters and this fully confirm the previous results obtained with K-means and PAM. Moreover, the structure of the dendrograms perfectly show how countries connect based on the similarities between their country statistics (GDP, Population, Co2 emissions) and ultimately how the next clusters connect together to result in one large cluster.
In this paper, I analyze the grouping of African countries depending on their similarities and differences in the basic country statistics of the year 2017. Based on fundamental clustering methods, it was shown that when clustering countries with 4 clusters using PAM clustering was most efficient as it provided 4 distinct clusters of countries based on their similarities. It was also observed that some countries were clustered based on the economic development of the country.
https://www.displayr.com/what-is-hierarchical-clustering/
https://www.datanovia.com/en/lessons/choosing-the-best-clustering-algorithms/
https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/
https://www.datanovia.com/en/blog/k-means-clustering-visualization-in-r-step-by-step-guide/
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/