Cluster Analysis

Introduction.

The aim of this paper is to analyze the similarities and differences in the basic country statistics of the African continent with the goal of finding a useful grouping. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. I will apply several clustering set of techniques for finding subgroups of observations within a country’s data set. These techniques include partitioning clustering (K-means, PAM) and hierarchical clustering.

Review of the Data set.

Country Statistics UN Dataset used in this analysis was found on UN website. The dataset has 230 countries from all continents with about 50 columns which describe country economic statistics in the year 2017. Our data of interest is to examine the African continent, so we filter countries from the African Region (Southern, Northern, Eastern, and Western) to get 52 African countries. The columns of the Data set of interest are as following:

• Population in thousands (2017)

• GDP: Gross domestic product (million current US$)

• Unemployment (% of labor force)

• International trade: Exports (million US$)

• International trade: Imports (million US$)

• Individuals using the Internet (per 100 inhabitants)

• Co2 emissions (million tons/tons per capita)

Exploratory Data Analysis.

Loading the relevant data analysis libraries.

load.libraries <- c( 'cluster', 'clustertend','corrplot', 'tidyverse','ggplot2','factoextra')
suppressPackageStartupMessages(library(dendextend))
suppressPackageStartupMessages(library(gridExtra))

## Warning: package 'gridExtra' was built under R version 4.1.2

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(data.table))

## Warning: package 'data.table' was built under R version 4.1.2

#loading the data set
countries_data <- read.csv("country_profile_variables.csv")

#Select data of interest and rename some columns
data <- countries_data[,c(1,2,4,7,16,20,21,42,45)]
new_names <- c("Country","Region","Population","GDP_millions","Unemployment_rate",
               "Exports(millions)","Imports(millions)","Internet_use(/100)","cO2_emission")
names(data) <- new_names

We filter countries from the African Region (Southern, Northern, Eastern, and Western)

library(data.table)
#Selecting African countries
data <- data[data$Region %like% "Africa", ]

The dataset now consists of 56 African countries however, there are columns with null variables such as ‘-~0.0’, ‘-99’, ‘…’, ‘~0’. These variables are converted into NA values and columns with NA values removed. Observations with NA values were omitted because there isn’t any good justification to support the assumptions. After omitting observations with NA’s, we are left with 52 African countries. Country Zimbabwe is from Southern African Region and hence it was changed to Southern Africa Region instead of Eastern Africa.

data[data$Country=="Zimbabwe",]$Region <- "SouthernAfrica"
data$GDP_millions <- ifelse(data$GDP_millions == -99,NA,data$GDP_millions)
data$Unemployment_rate <- ifelse(data$Unemployment_rate== '...'|data$Unemployment_rate== '-99',
                                    NA,data$Unemployment_rate)
data$`Exports(millions)`<- ifelse(data$`Exports(millions)`== '~0'|data$`Exports(millions)`=='-99',
                                     NA,data$`Exports(millions)`)
data$`Imports(millions)`<- ifelse(data$`Imports(millions)`== '-99'|data$`Imports(millions)`=='~0'
                                     |data$`Imports(millions)`=='...',NA,data$`Imports(millions)`)
data$cO2_emission <- ifelse(data$cO2_emission== -99,0,data$cO2_emission)
data <- na.omit(data)

Table below shows number of countries in each region.

#Shortening country name for clear visualization when clustering
data[data$Country=="Burkina Faso",]$Country <- "Burkina.F"
data[data$Country=="Sierra Leone",]$Country <- "Sierra.L"
data[data$Country=="Guinea-Bissau",]$Country <- "Guinea-B"
data[data$Country=="Sao Tome&Principe",]$Country <- "S.T&Principle"
data[data$Country=="United Republic of Tanzania",]$Country <- "Tanzania"
data[data$Country=="Democratic Republic of the Congo",]$Country <- "D.R Congo"
table(data$Region)

## 
##  EasternAfrica   MiddleAfrica NorthernAfrica SouthernAfrica  WesternAfrica 
##             16              9              6              6             15

Some numeric variables are treated as characters . We therefore, need to change such character variables to numeric using a for loop.

#Changing char variables to numeric
countries <- data[-c(1,2)]
for(item in colnames(countries)) {
  countries[[item]] <- as.numeric(countries[[item]])
}

Since considering that variables in the dataset are measured in different scales, it is recommended to scale the data in order to make those variables comparable. This avoids a problem in which some features come to dominate solely because they tend to have larger values than others.

countries_scaled <- as.data.frame(lapply(countries, scale))
View(countries_scaled)
row.names(countries_scaled) <- data[,1]

Correlation Matrix.

Here we look at the relationship between the variables using a correlation matrix. The correlation matrix share information about the distribution of each variable measured. The correlation plot shows us the correlation between all these different variables in the country dataset. There are fairly tightly linear relationships such as the correlation between GDP of the country and { Co2_emission, international trade Imports and Exports }. We can also observe the strongest positive linear correlation lies between between cO2 emission and International trade exports.

library(corrplot)

## corrplot 0.91 loaded

corrplot(cor(countries_scaled, use="complete"), method="number", type="upper", diag=F,
         tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="African Countries 2017", mar=c(0,0,1,0))

Clustering tendency.

Before proceeding with any clustering method, it is worth to assess the general clustering tendency of the data. For this purpose, Hopkins’s statistic and visual assessment were used.

suppressPackageStartupMessages(library(factoextra))

## Warning: package 'factoextra' was built under R version 4.1.2

## Warning: package 'ggplot2' was built under R version 4.1.2

get_clust_tendency(countries_scaled, 2, graph=TRUE, gradient=list(low = 'blue', high = 'white'),seed=1234)

## $hopkins_stat
## [1] 0.637119
## 
## $plot

In the context of the conducted analysis, the results are rather satisfactory. Hopkins’s statistic is equal to 0.637119, which is above 0.5 and thus, according to R Documentation, we can conclude that the data set is significantly clusterable.

The optimal number of clusters.

In the next step it is necessary to obtain the optimal number of clusters for each of partitional clustering method. Since the analyzed data set is rather small, therefore there is no need to consider CLARA which is intended for large data sets. However, for comparative purposes, both K-means and PAM will be implemented. The optimal number of clusters will be chosen primarily based on silhouette statistic.

Silhouette statistic.

f1 <- fviz_nbclust(countries_scaled, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(countries_scaled, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)

Analysis of Silhouette statistic.

The results indicate that the optimal number of clusters for the countries dataset was assessed as 2 for both K-means analysis and PAM analysis. The average silhouette width for K-means and PAM is therefore comparable in the case of 2 clusters. What is more, in both cases (K-means, PAM), the average silhouette width value for 3 clusters is only significantly lower than its value for the optimal number of clusters.

Total within-clusters sum of square.

To confirm the results, it is always good to look at an alternative method. Therefore, I check the stability of the above obtained results by using the WSS statistics.

f1 <- fviz_nbclust(countries_scaled, FUNcluster = kmeans, method = "wss") +
  geom_vline(xintercept = 2, linetype = 2)+
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(countries_scaled, FUNcluster = cluster::pam, method = "wss") + 
  geom_vline(xintercept = 4, linetype = 2)+
  ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)

Analysis of Total within-clusters sum of square

Summing up, in both cases (K-means and PAM) the division into 2 clusters seems to be the most promising. However, due to the subject of interest of the analysis and the obtained results, the case for 3 and 4 clusters will also be considered and analyzed.

K-means Clustering

K-means with 2 clusters

Clustering will be produced based on the K-means algorithm for the case with two, three and four clusters. It was decided to use Euclidean distance to calculate dissimilarities between observations.

km2 <- eclust(countries_scaled, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(km2, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
s2 <- fviz_silhouette(km2)

##   cluster size ave.sil.width
## 1       1   48          0.72
## 2       2    4          0.23

grid.arrange(c2, s2, ncol=2)

K-means with 3 clusters

km3 <- eclust(countries_scaled, k=3 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c3 <- fviz_cluster(km3, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s3 <- fviz_silhouette(km3)

##   cluster size ave.sil.width
## 1       1   11          0.47
## 2       2    4          0.22
## 3       3   37          0.32

grid.arrange(c3, s3, ncol=2)

K-means with 4 clusters

km4 <- eclust(countries_scaled, k=4 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c4 <- fviz_cluster(km4, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 4 clusters")
s4 <- fviz_silhouette(km4)

##   cluster size ave.sil.width
## 1       1    9          0.53
## 2       2    4          0.21
## 3       3   36          0.42
## 4       4    3          0.51

grid.arrange(c4, s4, ncol=2)

There is a total of 5 Regions in African Continent. The table below shows the distribution of country Region among clusters when kmeans is 4.

table(data$Region, km4$cluster)

##                 
##                   1  2  3  4
##   EasternAfrica   2  0 12  2
##   MiddleAfrica    1  0  7  1
##   NorthernAfrica  1  2  3  0
##   SouthernAfrica  4  1  1  0
##   WesternAfrica   1  1 13  0

Analysis

Silhouette statistic is higher in the case of 2 clusters however there is an imbalance in cluster membership as cluster 1 has 36 countries as compared to cluster 2 with just 4 countries. Both K-means with 3 clusters and 4 clusters have negative average silhouette value which means that there may be mis-allocation of some clusters. K-means with 4 clusters however, is slighter has a higher silhouette statistic than in the case of 3 clusters. It can also be easily noticed that the case of 4 clusters was created by splitting one of the clusters from the case of K-means with 2 clusters.

Visualization of country cluster with K-means = 4 cluster

fviz_cluster(km4, data = countries_scaled)

The tables below present the basic statistics for the characteristics of each cluster. It is worth recalling that the data has been scaled, hence negative values appear.

countries_cl <- as.data.frame(cbind(countries_scaled, km4$cluster))
colnames(countries_cl ) <- c(colnames(countries_scaled),"cluster")

# Cluster 1 (red)
summary(countries_cl[countries_cl$cluster==1,1:7]) %>% kable()

Population	GDP_millions	Unemployment_rate	Exports.millions.	Imports.millions.	Internet_use..100.	cO2_emission
Min. :-0.6696	Min. :-0.4572	Min. :1.082	Min. :-0.415946	Min. :-0.572688	Min. :-0.7474	Min. :-0.4301
1st Qu.:-0.6337	1st Qu.:-0.4456	1st Qu.:1.229	1st Qu.:-0.380925	1st Qu.:-0.537428	1st Qu.:-0.6835	1st Qu.:-0.4221
Median :-0.6276	Median :-0.3418	Median :1.883	Median :-0.258071	Median :-0.393973	Median :-0.5516	Median :-0.4120
Mean :-0.5311	Mean :-0.3495	Mean :1.739	Mean :-0.258833	Mean :-0.345516	Mean :-0.3673	Mean :-0.2649
3rd Qu.:-0.6187	3rd Qu.:-0.3100	3rd Qu.:2.016	3rd Qu.:-0.159699	3rd Qu.:-0.186788	3rd Qu.:-0.3597	3rd Qu.:-0.1511
Max. : 0.1852	Max. :-0.0904	Max. :2.630	Max. :-0.008795	Max. :-0.005722	Max. : 0.4157	Max. : 0.3236

Cluster 1 Analysis

Countries in cluster 1 generally have the highest unemployment rates as compared to other countries according to the statistics and relatively low population besides countries in cluster 3.

# Cluster 2 (green)
summary(countries_cl[countries_cl$cluster==2,1:7]) %>% kable()

Population	GDP_millions	Unemployment_rate	Exports.millions.	Imports.millions.	Internet_use..100.	cO2_emission
Min. :0.5303	Min. :1.336	Min. :-0.61192	Min. :0.8381	Min. :1.800	Min. :-0.2798	Min. :1.329
1st Qu.:0.8724	1st Qu.:2.566	1st Qu.:-0.01155	1st Qu.:1.1511	1st Qu.:2.317	1st Qu.:-0.2168	1st Qu.:2.229
Median :1.5914	Median :2.983	Median : 0.19525	Median :2.4857	Median :2.847	Median : 0.2138	Median :2.831
Mean :2.1686	Mean :3.062	Mean : 0.47876	Mean :2.7992	Mean :2.948	Mean : 0.4127	Mean :3.001
3rd Qu.:2.8875	3rd Qu.:3.479	3rd Qu.: 0.68556	3rd Qu.:4.1337	3rd Qu.:3.478	3rd Qu.: 0.8433	3rd Qu.:3.602
Max. :4.9612	Max. :4.946	Max. : 2.13647	Max. :5.3873	Max. :4.295	Max. : 1.5028	Max. :5.013

Cluster 2 Analysis

This cluster consists of the most developed countries in African continent with highest Gross domestic Product and high international trade Imports and Exports. These countries have developed economies and are the top 4 countries in Africa with the highest GDP in 2021 statistics. Here, we also observe high Co2 emissions as compared to countries in other clusters.

# Cluster 3 (blue)
summary(countries_cl[countries_cl$cluster==3,1:7]) %>% kable()

Population	GDP_millions	Unemployment_rate	Exports.millions.	Imports.millions.	Internet_use..100.	cO2_emission
Min. :-0.6909	Min. :-0.4639	Min. :-1.1990	Min. :-0.4165	Min. :-0.5762	Min. :-0.6955	Min. :-0.4316
1st Qu.:-0.5541	1st Qu.:-0.4265	1st Qu.:-0.7854	1st Qu.:-0.3845	1st Qu.:-0.5213	1st Qu.:-0.5026	1st Qu.:-0.4078
Median :-0.3245	Median :-0.3725	Median :-0.4385	Median :-0.3201	Median :-0.4175	Median :-0.3297	Median :-0.3822
Mean :-0.1384	Mean :-0.2394	Mean :-0.4125	Mean :-0.2257	Mean :-0.2209	Mean :-0.2389	Mean :-0.2517
3rd Qu.:-0.1082	3rd Qu.:-0.2239	3rd Qu.:-0.2117	3rd Qu.:-0.2564	3rd Qu.:-0.1746	3rd Qu.:-0.1119	3rd Qu.:-0.2262
Max. : 2.4156	Max. : 0.8236	Max. : 0.6155	Max. : 0.8576	Max. : 2.1373	Max. : 1.0991	Max. : 1.5264

Cluster 3 Analysis

Most of the countries fall in cluster 3 with similar characteristics of low- middle developed economies. Countries including Zimbabwe, Zambia, Burkina Faso, Mali, Chad are closer to the cluster centroid of cluster 3 and it can be observed that these are among the least developed economies in the country data set with characteristics of low GDP output, lower International trade Imports and Exports.

# Cluster 4 (purple)
summary(countries_cl[countries_cl$cluster==4,1:7]) %>% kable()

Population	GDP_millions	Unemployment_rate	Exports.millions.	Imports.millions.	Internet_use..100.	cO2_emission
Min. :0.01886	Min. :-0.36095	Min. :-1.0255	Min. :-0.2983	Min. :-0.39169	Min. :2.278	Min. :-0.36686
1st Qu.:0.04133	1st Qu.:-0.25874	1st Qu.:-0.9988	1st Qu.:-0.2948	1st Qu.:-0.32855	1st Qu.:2.892	1st Qu.:-0.29686
Median :0.06380	Median :-0.15652	Median :-0.9721	Median :-0.2913	Median :-0.26540	Median :3.505	Median :-0.22687
Mean :0.36223	Mean :-0.16186	Mean :-0.9054	Mean :-0.2474	Mean :-0.24270	Mean :3.419	Mean :-0.18538
3rd Qu.:0.53392	3rd Qu.:-0.06232	3rd Qu.:-0.8454	3rd Qu.:-0.2219	3rd Qu.:-0.16821	3rd Qu.:3.989	3rd Qu.:-0.09464
Max. :1.00405	Max. : 0.03188	Max. :-0.7187	Max. :-0.1526	Max. :-0.07102	Max. :4.472	Max. : 0.03758

Cluster 4 Analysis

Cluster 4 have highest internet use per 100 inhabitants and generally higher population compared to countries in cluster 1 and 3.

It is worth noting that countries like Morocco, Angola, Ethiopia, Kenya and D.R. Congo are on the boundary of cluster 3 closer to cluster 2 because they have higher GDP output and International trade import and Exports than the rest of the countries in that cluster.These countries are among the top 10 countries with highest GDP output and can be considered as mid-developed economies.

Below is the link of 52 African countries ranked according to their GDP output 2021 stats:

(https://www.statista.com/statistics/1120999/gdp-of-african-countries-by-country/).

PAM CLUSTERING

In this part the clusterisation is produced based on the PAM algorithm.

PAM with 2 clusters

#PAM with 2 clusters
pam2 <- eclust(countries_scaled, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(pam2, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)

##   cluster size ave.sil.width
## 1       1    4          0.23
## 2       2   48          0.72

grid.arrange(cp2, sp2, ncol=2)

PAM with 3 clusters

#PAM with 3 clusters
pam3 <- eclust(countries_scaled, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)

##   cluster size ave.sil.width
## 1       1    4          0.22
## 2       2   38          0.33
## 3       3   10          0.50

grid.arrange(cp3, sp3, ncol=2)

PAM with 4 clusters

#PAM with 4 clusters
pam4 <- eclust(countries_scaled, k=4 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp4 <- fviz_cluster(pam4, data=countries_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 4 clusters")
sp4 <- fviz_silhouette(pam4)

##   cluster size ave.sil.width
## 1       1    4          0.21
## 2       2   35          0.43
## 3       3   10          0.46
## 4       4    3          0.51

grid.arrange(cp4, sp4, ncol=2)

There is a total of 5 Regions in African Continent. The table below shows the distribution of country Region among clusters when k is 4.

table(data$Region, pam4$cluster)

##                 
##                   1  2  3  4
##   EasternAfrica   0 12  2  2
##   MiddleAfrica    0  7  1  1
##   NorthernAfrica  2  2  2  0
##   SouthernAfrica  1  1  4  0
##   WesternAfrica   1 13  1  0

Summarizing PAM clustering, Silhouette statistic for PAM is almost the same as that for K-means and also negative average silhouette values are also observed in PAM clustering.

Visualization of country cluster with PAM = 4 cluster

fviz_cluster(pam4, data = countries_scaled)

Analysis of PAM Clustering

Using PAM, we observe that there is much clearer cluster visualization of countries especially in the case of country Gabon which lied on the boundary of a neighboring cluster using K-means. Using PAM algorithm, we also observe that country Tunisia now falls under a different cluster {Cluster 3 - blue} than the one previously allocated using K-means and there is now a clearer separation between these 2 neighboring clusters.

As for the differences in countries based on country statistics, the results are similar to the K-means cases, with highly developed economies clustered together.

Hierarchical clustering

As the last method for clusterization, hierarchical clustering will be used. This idea is based on setting the hierarchy of clusters depending on chosen way of calculating the similarity between clusters. In the below analysis I will use the agglomerative hierarchical clustering technique. In this technique all observations are initially in their own clusters and then iteratively similar clusters are merged with others until one cluster is formed.

In the hierarchical clustering method, it is necessary to compute the dissimilarity matrix and thus the linkage method needs to be specified first. There are multiple options for the linkage methods, but for this paper I will compare only two methods: Ward.D method and complete linkage. The first one is frequently claimed to be sensible by default, especially when we do not have any clear theoretical justifications for any other linkage criteria. The second method, however, does well in clustering when there is some kind of noise between clusters.

Complete Linkage

#Complete linkage
dist_countries <- dist(countries_scaled, method = "euclidean")
hc_countries <- hclust(dist_countries, method = "complete")
dend_countries <- as.dendrogram(hc_countries) 
dend_countries_colored <- color_branches(dend_countries, h = 2)
plot(dend_countries_colored,main = " H Clustering across 52 African Countries")
rect.hclust(hc_countries, k=2, border='blue')

Ward’s method

#ward D
dist_countries <- dist(countries_scaled, method = "euclidean")
hc_countries <- hclust(dist_countries, method = "ward.D")
dend_countries <- as.dendrogram(hc_countries) 
dend_countries_colored <- color_branches(dend_countries, h = 4)
plot(dend_countries_colored, main = "Clustering across 52 African Countries")
rect.hclust(hc_countries, k=4, border='blue')

Analysis of hierarchical clustering

The obtained results using Ward’s method and complete linkage are different. Using complete linkage method, countries are grouped better using 2 clusters, which was suggested as the optimal number of clusters in K-means, however using Ward’s linkage method, countries are grouped into four clusters and this fully confirm the previous results obtained with K-means and PAM. Moreover, the structure of the dendrograms perfectly show how countries connect based on the similarities between their country statistics (GDP, Population, Co2 emissions) and ultimately how the next clusters connect together to result in one large cluster.

Conclusion

In this paper, I analyze the grouping of African countries depending on their similarities and differences in the basic country statistics of the year 2017. Based on fundamental clustering methods, it was shown that when clustering countries with 4 clusters using PAM clustering was most efficient as it provided 4 distinct clusters of countries based on their similarities. It was also observed that some countries were clustered based on the economic development of the country.

References

https://www.displayr.com/what-is-hierarchical-clustering/

https://www.datanovia.com/en/lessons/choosing-the-best-clustering-algorithms/

https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/

https://www.datanovia.com/en/blog/k-means-clustering-visualization-in-r-step-by-step-guide/

https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

Cluster Analysis of Country Statistics UN Dataset.

Xolani Keith Mpala

2/24/2022