Introduction
Cluster analysis is one of the multidimensional methods of unsupervised learning that classifies observations into subgroups. Observations possessing similar traits are sorted and clustered together in the same group. This statistical technique can be applied to various kinds of data and has various applications such as customer segmentation, clustering texts or films etc. One of the well-known clustering techniques is K - means and K - mediods. However, the focus of this paper is to employ cluster analysis and classify countries into sub groups on the basis of human development indicators. Classification of Countries will be done having similar characteristics of human development.
The main aim of this paper is to compare, present and employ various clustering and dimension reduction methods on human development indicators of selected countries. The comparison of different methods would help us to see how various methods behave and react to certain socio-economic data. In our data we have countries from all regions of the world. So, one might expect some sort of variation. Therefore, it would allow us to increase the quality of analysis. Data used in this study has been retrieved from the UN world explorer data base. Our data set has total of 189 countries as columns (variables) and five rows representing observations (indicators). The description of selected human development indicators is as follows:
Human development index (HDI) (hd_index, v1)
Life Expectancy at Birth (life_exp, v2)
Expected Years of Schooling (expct_sch, v3)
Mean Years of Schooling (mean_sch, v4)
Gross National Income (GNI) Per Capita (gni_capita, v5)
library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(clustertend)
library(ClusterR)
## Loading required package: gtools
library(readxl)
library(gridExtra)
library(clValid)
## Warning: package 'clValid' was built under R version 4.0.4
##
## Attaching package: 'clValid'
## The following object is masked from 'package:flexclust':
##
## clusters
## The following object is masked from 'package:modeltools':
##
## clusters
#Setting the working directory
setwd("D:")
#lets read the data set
data<- read_excel("D:\\dev.xlsx")
## New names:
## * `` -> ...1
# Dim function provides the dimesions of our data set and head function displays the first five rows of the data matrix
dim(data)
## [1] 5 190
head(data)
## # A tibble: 5 x 190
## ...1 Norway Switzerland Ireland Germany `Hong Kong, Chi~ Australia Iceland
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 hd_i~ 9.54e-1 0.946 9.42e-1 9.39e-1 0.939 0.938 9.38e-1
## 2 expc~ 1.81e+1 16.2 1.88e+1 1.71e+1 16.5 22.1 1.92e+1
## 3 life~ 8.23e+1 83.6 8.21e+1 8.12e+1 84.7 83.3 8.29e+1
## 4 mean~ 1.26e+1 13.4 1.25e+1 1.41e+1 12.0 12.7 1.25e+1
## 5 gni_~ 6.81e+4 59375. 5.57e+4 4.69e+4 60221. 44097. 4.76e+4
## # ... with 182 more variables: Sweden <dbl>, Singapore <dbl>,
## # Netherlands <dbl>, Denmark <dbl>, Finland <dbl>, Canada <dbl>, `New
## # Zealand` <dbl>, `United Kingdom` <dbl>, `United States` <dbl>,
## # Belgium <dbl>, Liechtenstein <dbl>, Japan <dbl>, Austria <dbl>,
## # Luxembourg <dbl>, Israel <dbl>, `Korea (Republic of)` <dbl>,
## # Slovenia <dbl>, Spain <dbl>, Czechia <dbl>, France <dbl>, Malta <dbl>,
## # Italy <dbl>, Estonia <dbl>, Cyprus <dbl>, Greece <dbl>, Poland <dbl>,
## # Lithuania <dbl>, `United Arab Emirates` <dbl>, Andorra <dbl>, `Saudi
## # Arabia` <dbl>, Slovakia <dbl>, Latvia <dbl>, Portugal <dbl>, Qatar <dbl>,
## # Chile <dbl>, `Brunei Darussalam` <dbl>, Hungary <dbl>, Bahrain <dbl>,
## # Croatia <dbl>, Oman <dbl>, Argentina <dbl>, `Russian Federation` <dbl>,
## # Belarus <dbl>, Kazakhstan <dbl>, Bulgaria <dbl>, Montenegro <dbl>,
## # Romania <dbl>, Palau <dbl>, Barbados <dbl>, Kuwait <dbl>, Uruguay <dbl>,
## # Turkey <dbl>, Bahamas <dbl>, Malaysia <dbl>, Seychelles <dbl>,
## # Serbia <dbl>, `Trinidad and Tobago` <dbl>, `Iran (Islamic Republic
## # of)` <dbl>, Mauritius <dbl>, Panama <dbl>, `Costa Rica` <dbl>,
## # Albania <dbl>, Georgia <dbl>, `Sri Lanka` <dbl>, Cuba <dbl>, `Saint Kitts
## # and Nevis` <dbl>, `Antigua and Barbuda` <dbl>, `Bosnia and
## # Herzegovina` <dbl>, Mexico <dbl>, Thailand <dbl>, Grenada <dbl>,
## # Brazil <dbl>, Colombia <dbl>, Armenia <dbl>, Algeria <dbl>, `North
## # Macedonia` <dbl>, Peru <dbl>, China <dbl>, Ecuador <dbl>, Azerbaijan <dbl>,
## # Ukraine <dbl>, `Dominican Republic` <dbl>, `Saint Lucia` <dbl>,
## # Tunisia <dbl>, Mongolia <dbl>, Lebanon <dbl>, Botswana <dbl>, `Saint
## # Vincent and the Grenadines` <dbl>, Jamaica <dbl>, `Venezuela (Bolivarian
## # Republic of)` <dbl>, Dominica <dbl>, Fiji <dbl>, Paraguay <dbl>,
## # Suriname <dbl>, Jordan <dbl>, Belize <dbl>, Maldives <dbl>, Tonga <dbl>,
## # Philippines <dbl>, `Moldova (Republic of)` <dbl>, ...
We can easily see that we have 5 rows and 190 columns. We do not require the first column as it has characters so lets transform our data set a bit to get the desired results.
working_data <- as.matrix(data[ ,2:190])
#We would like to cluster countries and to perform the row wise operation its better to take trasnpose of our data matrix.
working_data <- t(working_data)
round(working_data,digits=2)
## [,1] [,2] [,3] [,4] [,5]
## Norway 0.95 18.06 82.27 12.57 68058.62
## Switzerland 0.95 16.21 83.63 13.38 59374.73
## Ireland 0.94 18.79 82.10 12.53 55659.68
## Germany 0.94 17.10 81.18 14.13 46945.95
## Hong Kong, China (SAR) 0.94 16.51 84.69 12.04 60220.80
## Australia 0.94 22.10 83.28 12.68 44097.02
## Iceland 0.94 19.17 82.86 12.54 47566.45
## Sweden 0.94 18.83 82.65 12.43 47955.45
## Singapore 0.93 16.33 83.46 11.50 83792.67
## Netherlands 0.93 18.04 82.14 12.19 50012.59
## Denmark 0.93 19.07 80.78 12.59 48836.09
## Finland 0.93 19.32 81.74 12.44 41779.26
## Canada 0.92 16.09 82.32 13.32 43602.25
## New Zealand 0.92 18.84 82.14 12.68 35107.50
## United Kingdom 0.92 17.44 81.24 12.95 39507.29
## United States 0.92 16.27 78.85 13.41 56140.23
## Belgium 0.92 19.70 81.47 11.78 43820.84
## Liechtenstein 0.92 14.72 80.54 12.55 99732.14
## Japan 0.91 15.23 84.47 12.80 40799.01
## Austria 0.91 16.29 81.43 12.56 46230.57
## Luxembourg 0.91 14.23 82.10 12.20 65543.05
## Israel 0.91 15.99 82.82 12.96 33649.69
## Korea (Republic of) 0.91 16.39 82.85 12.19 36757.02
## Slovenia 0.90 17.42 81.17 12.27 32143.04
## Spain 0.89 17.88 83.43 9.82 35041.30
## Czechia 0.89 16.83 79.22 12.74 31597.07
## France 0.89 15.49 82.54 11.42 40510.78
## Malta 0.89 15.90 82.38 11.29 34795.18
## Italy 0.88 16.25 83.35 10.25 36141.43
## Estonia 0.88 16.06 78.57 13.03 30378.63
## Cyprus 0.87 14.67 80.83 12.10 33100.32
## Greece 0.87 17.34 82.07 10.54 24909.34
## Poland 0.87 16.43 78.54 12.29 27625.80
## Lithuania 0.87 16.50 75.74 12.96 29775.26
## United Arab Emirates 0.87 13.64 77.81 10.95 66911.66
## Andorra 0.86 13.30 81.79 10.16 48640.89
## Saudi Arabia 0.86 16.98 75.00 9.67 49338.41
## Slovakia 0.86 14.53 77.39 12.61 30671.87
## Latvia 0.85 15.98 75.17 12.83 26300.77
## Portugal 0.85 16.30 81.86 9.19 27935.38
## Qatar 0.85 12.18 80.10 9.67 110488.74
## Chile 0.85 16.53 80.04 10.45 21972.28
## Brunei Darussalam 0.84 14.38 75.72 9.10 76388.54
## Hungary 0.84 15.12 76.70 11.89 27144.21
## Bahrain 0.84 15.26 77.16 9.41 40399.12
## Croatia 0.84 14.96 78.34 11.41 23060.96
## Oman 0.83 14.66 77.63 9.73 37039.23
## Argentina 0.83 17.64 76.52 10.56 17611.22
## Russian Federation 0.82 15.54 72.39 12.02 25036.02
## Belarus 0.82 15.36 74.59 12.31 17038.53
## Kazakhstan 0.82 15.27 73.24 11.78 22167.70
## Bulgaria 0.82 14.81 74.93 11.81 19645.94
## Montenegro 0.82 15.03 76.77 11.39 17510.71
## Romania 0.82 14.26 75.92 10.98 23905.77
## Palau 0.81 15.55 73.68 12.40 16720.01
## Barbados 0.81 15.16 79.08 10.56 15912.28
## Kuwait 0.81 13.76 75.40 7.28 71164.22
## Uruguay 0.81 16.34 77.77 8.73 19434.85
## Turkey 0.81 16.44 77.44 7.67 24905.38
## Bahamas 0.81 12.82 73.75 11.53 28395.40
## Malaysia 0.80 13.47 76.00 10.16 27226.68
## Seychelles 0.80 15.45 73.33 9.67 25076.87
## Serbia 0.80 14.77 75.85 11.18 15217.70
## Trinidad and Tobago 0.80 12.96 73.38 11.03 28497.37
## Iran (Islamic Republic of) 0.80 14.73 76.48 10.01 18166.47
## Mauritius 0.80 14.97 74.86 9.43 22724.23
## Panama 0.80 12.88 78.33 10.17 20454.87
## Costa Rica 0.79 15.38 80.10 8.67 14789.93
## Albania 0.79 15.23 78.46 10.05 12299.80
## Georgia 0.79 15.43 73.60 12.81 9569.52
## Sri Lanka 0.78 13.97 76.81 11.05 11610.91
## Cuba 0.78 14.37 78.73 11.75 7811.36
## Saint Kitts and Nevis 0.78 13.61 74.56 8.50 26770.07
## Antigua and Barbuda 0.78 12.45 76.89 9.26 22201.23
## Bosnia and Herzegovina 0.77 13.79 77.26 9.69 12689.68
## Mexico 0.77 14.30 74.99 8.60 17628.12
## Thailand 0.76 14.65 76.93 7.73 16128.55
## Grenada 0.76 16.60 72.38 8.80 12683.83
## Brazil 0.76 15.40 75.67 7.84 14068.05
## Colombia 0.76 14.60 77.11 8.33 12895.59
## Armenia 0.76 13.17 74.94 11.79 9277.23
## Algeria 0.76 14.72 76.69 7.99 13639.43
## North Macedonia 0.76 13.46 75.69 9.68 12873.75
## Peru 0.76 13.85 76.52 9.22 12322.66
## China 0.76 13.89 76.70 7.90 16126.57
## Ecuador 0.76 14.92 76.80 8.99 10141.15
## Azerbaijan 0.75 12.40 72.86 10.48 15240.14
## Ukraine 0.75 15.07 71.95 11.34 7994.21
## Dominican Republic 0.74 14.14 73.89 7.94 15074.26
## Saint Lucia 0.74 13.87 76.06 8.49 11528.37
## Tunisia 0.74 15.10 76.50 7.17 10676.96
## Mongolia 0.73 14.21 69.69 10.17 10783.71
## Lebanon 0.73 11.29 78.88 8.70 11136.25
## Botswana 0.73 12.70 69.28 9.33 15951.33
## Saint Vincent and the Grenadines 0.73 13.57 72.42 8.62 11746.45
## Jamaica 0.73 13.14 74.37 9.80 7931.52
## Venezuela (Bolivarian Republic of) 0.73 12.82 72.13 10.32 9069.70
## Dominica 0.72 12.97 78.12 7.80 9245.16
## Fiji 0.72 14.43 67.34 10.88 9110.44
## Paraguay 0.72 12.69 74.13 8.45 11719.96
## Suriname 0.72 12.86 71.57 9.13 11932.99
## Jordan 0.72 11.88 74.40 10.45 8267.81
## Belize 0.72 13.12 74.50 9.80 7135.97
## Maldives 0.72 12.12 78.63 6.82 12549.26
## Tonga 0.72 14.30 70.80 11.21 5782.57
## Philippines 0.71 12.72 71.10 9.39 9539.70
## Moldova (Republic of) 0.71 11.61 71.81 11.56 6833.11
## Turkmenistan 0.71 10.89 68.07 9.78 16407.47
## Uzbekistan 0.71 12.01 71.57 11.52 6461.84
## Libya 0.71 12.79 72.72 7.56 11684.73
## Indonesia 0.71 12.92 71.51 7.98 11255.78
## Samoa 0.71 12.52 73.19 10.59 5884.84
## South Africa 0.70 13.67 63.86 10.24 11756.30
## Bolivia (Plurinational State of) 0.70 14.01 71.24 9.02 6849.20
## Gabon 0.70 12.90 66.19 8.32 15794.08
## Egypt 0.70 13.10 71.83 7.33 10743.81
## Marshall Islands 0.70 12.39 73.86 10.89 4633.48
## Viet Nam 0.69 12.69 75.32 8.20 6220.27
## Palestine, State of 0.69 12.84 73.89 9.10 5313.83
## Iraq 0.69 11.15 70.45 7.32 15364.96
## Morocco 0.68 13.07 76.45 5.51 7479.59
## Kyrgyzstan 0.67 13.36 71.32 10.88 3316.79
## Guyana 0.67 11.47 69.77 8.47 7615.42
## El Salvador 0.67 12.04 73.10 6.94 6973.46
## Tajikistan 0.66 11.41 70.88 10.67 3482.38
## Cabo Verde 0.65 11.87 72.78 6.24 6513.49
## Guatemala 0.65 10.62 74.06 6.47 7377.92
## Nicaragua 0.65 12.21 74.28 6.80 4789.84
## India 0.65 12.35 69.42 6.45 6828.60
## Namibia 0.65 12.63 63.37 6.94 9682.66
## Timor-Leste 0.63 12.40 69.26 4.54 7526.66
## Honduras 0.62 10.21 75.09 6.60 4258.35
## Kiribati 0.62 11.81 68.12 7.87 3917.43
## Bhutan 0.62 12.13 71.46 3.14 8609.12
## Bangladesh 0.61 11.20 72.32 6.06 4057.25
## Micronesia (Federated States of) 0.61 11.55 67.75 7.72 3700.10
## Sao Tome and Principe 0.61 12.69 70.17 6.44 3024.43
## Congo 0.61 11.60 64.29 6.50 5803.88
## Eswatini (Kingdom of) 0.61 11.38 59.40 6.75 9359.11
## Lao People's Democratic Republic 0.60 11.06 67.61 5.20 6316.52
## Vanuatu 0.60 11.42 70.32 6.84 2807.86
## Ghana 0.60 11.52 63.78 7.18 4098.86
## Zambia 0.59 12.06 63.51 7.10 3581.89
## Equatorial Guinea 0.59 9.20 58.40 5.55 17795.54
## Myanmar 0.58 10.32 66.87 4.95 5763.94
## Cambodia 0.58 11.34 69.57 4.84 3597.40
## Kenya 0.58 11.06 66.34 6.56 3051.69
## Nepal 0.58 12.20 70.48 4.86 2748.20
## Angola 0.57 11.78 60.78 5.13 5554.70
## Cameroon 0.56 12.75 58.92 6.29 3291.13
## Zimbabwe 0.56 10.45 61.20 8.34 2661.07
## Pakistan 0.56 8.46 67.11 5.16 5190.08
## Solomon Islands 0.56 10.22 72.83 5.54 2026.72
## Syrian Arab Republic 0.55 8.85 71.78 5.10 2725.19
## Papua New Guinea 0.54 10.00 64.26 4.62 3685.80
## Comoros 0.54 11.24 64.12 4.91 2426.39
## Rwanda 0.54 11.17 68.70 4.42 1958.61
## Nigeria 0.53 9.75 54.33 6.46 5085.54
## Tanzania (United Republic of) 0.53 8.01 65.02 6.01 2805.12
## Uganda 0.53 11.24 62.97 6.09 1752.21
## Mauritania 0.53 8.47 64.70 4.61 3746.08
## Madagascar 0.52 10.41 66.68 6.10 1403.92
## Benin 0.52 12.61 61.47 3.77 2134.59
## Lesotho 0.52 10.74 53.70 6.35 3243.84
## Côte d'Ivoire 0.52 9.63 57.42 5.19 3589.41
## Senegal 0.51 8.97 67.67 3.07 3255.99
## Togo 0.51 12.57 60.76 4.95 1592.54
## Sudan 0.51 7.74 65.10 3.72 3961.62
## Haiti 0.50 9.50 63.66 5.44 1664.89
## Afghanistan 0.50 10.14 64.49 3.93 1745.67
## Djibouti 0.50 6.48 66.58 4.00 3600.71
## Malawi 0.49 10.95 63.80 4.63 1159.12
## Ethiopia 0.47 8.71 66.24 2.80 1781.76
## Gambia 0.47 9.48 61.74 3.67 1489.57
## Guinea 0.47 9.01 61.19 2.71 2211.00
## Liberia 0.46 9.58 63.73 4.67 1040.09
## Yemen 0.46 8.66 66.10 3.20 1433.30
## Guinea-Bissau 0.46 10.50 58.00 3.30 1593.18
## Congo (Democratic Republic of the) 0.46 9.70 60.37 6.76 800.02
## Mozambique 0.45 9.75 60.16 3.54 1153.70
## Sierra Leone 0.44 10.18 54.31 3.60 1381.30
## Burkina Faso 0.43 8.91 61.17 1.59 1705.49
## Eritrea 0.43 5.01 65.94 3.90 1707.71
## Mali 0.43 7.60 58.89 2.35 1965.39
## Burundi 0.42 11.30 61.25 3.12 659.73
## South Sudan 0.41 5.00 57.60 4.85 1455.23
## Chad 0.40 7.47 53.98 2.41 1715.57
## Central African Republic 0.38 7.57 52.80 4.28 776.68
## Niger 0.38 6.47 62.02 2.03 912.04
# The data has been round off to two digits just for simplicity purposes.Below one can see the transformed data matrix withour labels. Because clustering is performed on data having no labels at all.
Before proceeding further lets now scale our data set and further see whether we have any missing values in our data set.
working_data <- scale(working_data)
sum(is.na(working_data))
## [1] 0
Prediagnostic Analysis to check the Clustering Tendancy
As we can see that we do not have any missing values so we can move forward. The next step is to check whether our data set has the tendancy to be clustered or not. To check the clusterablity we will use a built in R function get_clust tendency. We will get a Hopkin’s statistic and if the value is higher than 0.75 at the 90% confidence level this implies that we reject the null hypothesis that data is not clusterable.
clusterability <- get_clust_tendency(working_data, n = nrow(working_data)-1, graph = FALSE)
clusterability$hopkins_stat
## [1] 0.8596448
In our case we have Hopkin’s statistic of 0.796 which means we reject null hypothesis so our data is has the cluster tendancy. The other method to check cluster tendancy is to examine the distance matrix.
d<-dist(working_data)
fviz_dist(d, show_labels = FALSE)+ labs(title = "working_data")
This above is dissimilarity image the more dissimilarities we have the better the clusters will be. As we are good with our clustering tendancy results. Lets turn to clustering analysis.
Optimal Number of Clusters K Means and PAM (Silhouette Statistic)
In the next step, for each clustering method (K- means and Pam) the optimal number of clusters will be decided. As we have a rather moderate to small data set so there is no need to employ CLARA which is best suited for large data sets. However, in this study K-means ,Pam and Heirarichal clustering will be implemented for the comparative analysis. To, decide the optimal number of clusters silhouette statistics will be used.
f1 <- fviz_nbclust(working_data, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(working_data, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
As we can see for both the clustering algorithms (K-means & PAM) the optimal number of given clusters is 2. Furthermore, the average silhouette width is same for 2 clusters for both cases (K-means and Pam). What is more, the average silhouette width in both cases for 3 clusters remains the same.
Total Within- Cluster Sum of Squares
There is also an alternative method to check the stability of the aforementioned obtained results. Therefore,its always a good idea to see the alternative method by using the WSS statistics.
f3 <- fviz_nbclust(working_data, FUNcluster = kmeans, method = "wss") +
ggtitle("Optimal number of clusters \n K-means")
f4 <- fviz_nbclust(working_data, FUNcluster = cluster::pam, method = "wss") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f3, f4, ncol=2)
Above all, for both cases (K-means and PAM) the categorization into 2 clusters seems reasonable however, due to the subject of interest of the analysis and the above obtained results, the case for 3 clusters will also be considered.
First, categorization in 2 and 3 clusters will be done using k-means algorithm. It is as follows:
km2 <- eclust(working_data, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(km2, data=working_data, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
s2 <- fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 117 0.45
## 2 2 72 0.53
grid.arrange(c2, s2, ncol=2)
km2
## K-means clustering with 2 clusters of sizes 117, 72
##
## Cluster means:
## [,1] [,2] [,3] [,4] [,5]
## 1 0.6631316 0.5921902 0.6077436 0.6421912 0.4400564
## 2 -1.0775889 -0.9623091 -0.9875833 -1.0435607 -0.7150916
##
## Clustering vector:
## Norway Switzerland
## 1 1
## Ireland Germany
## 1 1
## Hong Kong, China (SAR) Australia
## 1 1
## Iceland Sweden
## 1 1
## Singapore Netherlands
## 1 1
## Denmark Finland
## 1 1
## Canada New Zealand
## 1 1
## United Kingdom United States
## 1 1
## Belgium Liechtenstein
## 1 1
## Japan Austria
## 1 1
## Luxembourg Israel
## 1 1
## Korea (Republic of) Slovenia
## 1 1
## Spain Czechia
## 1 1
## France Malta
## 1 1
## Italy Estonia
## 1 1
## Cyprus Greece
## 1 1
## Poland Lithuania
## 1 1
## United Arab Emirates Andorra
## 1 1
## Saudi Arabia Slovakia
## 1 1
## Latvia Portugal
## 1 1
## Qatar Chile
## 1 1
## Brunei Darussalam Hungary
## 1 1
## Bahrain Croatia
## 1 1
## Oman Argentina
## 1 1
## Russian Federation Belarus
## 1 1
## Kazakhstan Bulgaria
## 1 1
## Montenegro Romania
## 1 1
## Palau Barbados
## 1 1
## Kuwait Uruguay
## 1 1
## Turkey Bahamas
## 1 1
## Malaysia Seychelles
## 1 1
## Serbia Trinidad and Tobago
## 1 1
## Iran (Islamic Republic of) Mauritius
## 1 1
## Panama Costa Rica
## 1 1
## Albania Georgia
## 1 1
## Sri Lanka Cuba
## 1 1
## Saint Kitts and Nevis Antigua and Barbuda
## 1 1
## Bosnia and Herzegovina Mexico
## 1 1
## Thailand Grenada
## 1 1
## Brazil Colombia
## 1 1
## Armenia Algeria
## 1 1
## North Macedonia Peru
## 1 1
## China Ecuador
## 1 1
## Azerbaijan Ukraine
## 1 1
## Dominican Republic Saint Lucia
## 1 1
## Tunisia Mongolia
## 1 1
## Lebanon Botswana
## 1 1
## Saint Vincent and the Grenadines Jamaica
## 1 1
## Venezuela (Bolivarian Republic of) Dominica
## 1 1
## Fiji Paraguay
## 1 1
## Suriname Jordan
## 1 1
## Belize Maldives
## 1 1
## Tonga Philippines
## 1 1
## Moldova (Republic of) Turkmenistan
## 1 2
## Uzbekistan Libya
## 1 1
## Indonesia Samoa
## 1 1
## South Africa Bolivia (Plurinational State of)
## 1 1
## Gabon Egypt
## 2 2
## Marshall Islands Viet Nam
## 1 1
## Palestine, State of Iraq
## 1 2
## Morocco Kyrgyzstan
## 2 1
## Guyana El Salvador
## 2 2
## Tajikistan Cabo Verde
## 2 2
## Guatemala Nicaragua
## 2 2
## India Namibia
## 2 2
## Timor-Leste Honduras
## 2 2
## Kiribati Bhutan
## 2 2
## Bangladesh Micronesia (Federated States of)
## 2 2
## Sao Tome and Principe Congo
## 2 2
## Eswatini (Kingdom of) Lao People's Democratic Republic
## 2 2
## Vanuatu Ghana
## 2 2
## Zambia Equatorial Guinea
## 2 2
## Myanmar Cambodia
## 2 2
## Kenya Nepal
## 2 2
## Angola Cameroon
## 2 2
## Zimbabwe Pakistan
## 2 2
## Solomon Islands Syrian Arab Republic
## 2 2
## Papua New Guinea Comoros
## 2 2
## Rwanda Nigeria
## 2 2
## Tanzania (United Republic of) Uganda
## 2 2
## Mauritania Madagascar
## 2 2
## Benin Lesotho
## 2 2
## Côte d'Ivoire Senegal
## 2 2
## Togo Sudan
## 2 2
## Haiti Afghanistan
## 2 2
## Djibouti Malawi
## 2 2
## Ethiopia Gambia
## 2 2
## Guinea Liberia
## 2 2
## Yemen Guinea-Bissau
## 2 2
## Congo (Democratic Republic of the) Mozambique
## 2 2
## Sierra Leone Burkina Faso
## 2 2
## Eritrea Mali
## 2 2
## Burundi South Sudan
## 2 2
## Chad Central African Republic
## 2 2
## Niger
## 2
##
## Within cluster sum of squares by cluster:
## [1] 280.6001 117.0650
## (between_SS / total_SS = 57.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "silinfo"
## [11] "nbclust" "data"
We can see countries with high GNI per capita, high life expectancy at birth and higher mean schooling are grouped in same cluster as compared to countries that are low in human development indicators. Normally, EU-27 countries, Arab states and few south east asian coutries are grouped together in cluster 1 while least developing economies they are grouped together in same cluster 2. This could easily be seen in clustering vector.
K-Means with 3 clusters
km3 <- eclust(working_data, k=3 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c3 <- fviz_cluster(km3, data=working_data, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s3 <- fviz_silhouette(km3)
## cluster size ave.sil.width
## 1 1 81 0.46
## 2 2 61 0.46
## 3 3 47 0.37
grid.arrange(c3, s3, ncol=2)
km3
## K-means clustering with 3 clusters of sizes 81, 61, 47
##
## Cluster means:
## [,1] [,2] [,3] [,4] [,5]
## 1 0.2263475 0.1568418 0.2281714 0.2943044 -0.2533303
## 2 -1.2256656 -1.0536857 -1.1445775 -1.1737835 -0.7605953
## 3 1.2006692 1.0972477 1.0922839 1.0162157 1.4237462
##
## Clustering vector:
## Norway Switzerland
## 3 3
## Ireland Germany
## 3 3
## Hong Kong, China (SAR) Australia
## 3 3
## Iceland Sweden
## 3 3
## Singapore Netherlands
## 3 3
## Denmark Finland
## 3 3
## Canada New Zealand
## 3 3
## United Kingdom United States
## 3 3
## Belgium Liechtenstein
## 3 3
## Japan Austria
## 3 3
## Luxembourg Israel
## 3 3
## Korea (Republic of) Slovenia
## 3 3
## Spain Czechia
## 3 3
## France Malta
## 3 3
## Italy Estonia
## 3 3
## Cyprus Greece
## 3 3
## Poland Lithuania
## 3 3
## United Arab Emirates Andorra
## 3 3
## Saudi Arabia Slovakia
## 3 3
## Latvia Portugal
## 3 3
## Qatar Chile
## 3 3
## Brunei Darussalam Hungary
## 3 3
## Bahrain Croatia
## 3 1
## Oman Argentina
## 3 1
## Russian Federation Belarus
## 1 1
## Kazakhstan Bulgaria
## 1 1
## Montenegro Romania
## 1 1
## Palau Barbados
## 1 1
## Kuwait Uruguay
## 3 1
## Turkey Bahamas
## 1 1
## Malaysia Seychelles
## 1 1
## Serbia Trinidad and Tobago
## 1 1
## Iran (Islamic Republic of) Mauritius
## 1 1
## Panama Costa Rica
## 1 1
## Albania Georgia
## 1 1
## Sri Lanka Cuba
## 1 1
## Saint Kitts and Nevis Antigua and Barbuda
## 1 1
## Bosnia and Herzegovina Mexico
## 1 1
## Thailand Grenada
## 1 1
## Brazil Colombia
## 1 1
## Armenia Algeria
## 1 1
## North Macedonia Peru
## 1 1
## China Ecuador
## 1 1
## Azerbaijan Ukraine
## 1 1
## Dominican Republic Saint Lucia
## 1 1
## Tunisia Mongolia
## 1 1
## Lebanon Botswana
## 1 1
## Saint Vincent and the Grenadines Jamaica
## 1 1
## Venezuela (Bolivarian Republic of) Dominica
## 1 1
## Fiji Paraguay
## 1 1
## Suriname Jordan
## 1 1
## Belize Maldives
## 1 1
## Tonga Philippines
## 1 1
## Moldova (Republic of) Turkmenistan
## 1 1
## Uzbekistan Libya
## 1 1
## Indonesia Samoa
## 1 1
## South Africa Bolivia (Plurinational State of)
## 1 1
## Gabon Egypt
## 1 1
## Marshall Islands Viet Nam
## 1 1
## Palestine, State of Iraq
## 1 1
## Morocco Kyrgyzstan
## 1 1
## Guyana El Salvador
## 1 1
## Tajikistan Cabo Verde
## 1 1
## Guatemala Nicaragua
## 1 1
## India Namibia
## 2 2
## Timor-Leste Honduras
## 2 2
## Kiribati Bhutan
## 2 2
## Bangladesh Micronesia (Federated States of)
## 2 2
## Sao Tome and Principe Congo
## 2 2
## Eswatini (Kingdom of) Lao People's Democratic Republic
## 2 2
## Vanuatu Ghana
## 2 2
## Zambia Equatorial Guinea
## 2 2
## Myanmar Cambodia
## 2 2
## Kenya Nepal
## 2 2
## Angola Cameroon
## 2 2
## Zimbabwe Pakistan
## 2 2
## Solomon Islands Syrian Arab Republic
## 2 2
## Papua New Guinea Comoros
## 2 2
## Rwanda Nigeria
## 2 2
## Tanzania (United Republic of) Uganda
## 2 2
## Mauritania Madagascar
## 2 2
## Benin Lesotho
## 2 2
## Côte d'Ivoire Senegal
## 2 2
## Togo Sudan
## 2 2
## Haiti Afghanistan
## 2 2
## Djibouti Malawi
## 2 2
## Ethiopia Gambia
## 2 2
## Guinea Liberia
## 2 2
## Yemen Guinea-Bissau
## 2 2
## Congo (Democratic Republic of the) Mozambique
## 2 2
## Sierra Leone Burkina Faso
## 2 2
## Eritrea Mali
## 2 2
## Burundi South Sudan
## 2 2
## Chad Central African Republic
## 2 2
## Niger
## 2
##
## Within cluster sum of squares by cluster:
## [1] 72.34430 81.80587 80.44265
## (between_SS / total_SS = 75.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "silinfo"
## [11] "nbclust" "data"
There is a slight difference in average silhouette width of 0.44 (3 clusters) to 0.48 (2 clusters). K-means with 2 clusters does not have a negative average silhouette value but k-means with 3 clusters does have.
In this part the categorization is based on PAM algorithm. It would be done separately for 2 and 3 clusters.
pam2 <- eclust(working_data, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(pam2, data=working_data, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 124 0.44
## 2 2 65 0.57
grid.arrange(cp2, sp2, ncol=2)
PAM with 3 clusters
pam3 <- eclust(working_data, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=working_data, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
pam3
## Medoids:
## ID
## Austria 20 1.3286246 1.03496723 1.1970915 1.2800465 1.4099650
## North Macedonia 83 0.3048589 0.07936403 0.4274238 0.3461972 -0.2825410
## Papua New Guinea 155 -1.1297791 -1.09450728 -1.1029371 -1.2954771 -0.7487326
## Clustering vector:
## Norway Switzerland
## 1 1
## Ireland Germany
## 1 1
## Hong Kong, China (SAR) Australia
## 1 1
## Iceland Sweden
## 1 1
## Singapore Netherlands
## 1 1
## Denmark Finland
## 1 1
## Canada New Zealand
## 1 1
## United Kingdom United States
## 1 1
## Belgium Liechtenstein
## 1 1
## Japan Austria
## 1 1
## Luxembourg Israel
## 1 1
## Korea (Republic of) Slovenia
## 1 1
## Spain Czechia
## 1 1
## France Malta
## 1 1
## Italy Estonia
## 1 1
## Cyprus Greece
## 1 1
## Poland Lithuania
## 1 1
## United Arab Emirates Andorra
## 1 1
## Saudi Arabia Slovakia
## 1 1
## Latvia Portugal
## 1 1
## Qatar Chile
## 1 2
## Brunei Darussalam Hungary
## 1 2
## Bahrain Croatia
## 1 2
## Oman Argentina
## 1 2
## Russian Federation Belarus
## 2 2
## Kazakhstan Bulgaria
## 2 2
## Montenegro Romania
## 2 2
## Palau Barbados
## 2 2
## Kuwait Uruguay
## 1 2
## Turkey Bahamas
## 2 2
## Malaysia Seychelles
## 2 2
## Serbia Trinidad and Tobago
## 2 2
## Iran (Islamic Republic of) Mauritius
## 2 2
## Panama Costa Rica
## 2 2
## Albania Georgia
## 2 2
## Sri Lanka Cuba
## 2 2
## Saint Kitts and Nevis Antigua and Barbuda
## 2 2
## Bosnia and Herzegovina Mexico
## 2 2
## Thailand Grenada
## 2 2
## Brazil Colombia
## 2 2
## Armenia Algeria
## 2 2
## North Macedonia Peru
## 2 2
## China Ecuador
## 2 2
## Azerbaijan Ukraine
## 2 2
## Dominican Republic Saint Lucia
## 2 2
## Tunisia Mongolia
## 2 2
## Lebanon Botswana
## 2 2
## Saint Vincent and the Grenadines Jamaica
## 2 2
## Venezuela (Bolivarian Republic of) Dominica
## 2 2
## Fiji Paraguay
## 2 2
## Suriname Jordan
## 2 2
## Belize Maldives
## 2 2
## Tonga Philippines
## 2 2
## Moldova (Republic of) Turkmenistan
## 2 2
## Uzbekistan Libya
## 2 2
## Indonesia Samoa
## 2 2
## South Africa Bolivia (Plurinational State of)
## 2 2
## Gabon Egypt
## 2 2
## Marshall Islands Viet Nam
## 2 2
## Palestine, State of Iraq
## 2 2
## Morocco Kyrgyzstan
## 2 2
## Guyana El Salvador
## 2 2
## Tajikistan Cabo Verde
## 2 2
## Guatemala Nicaragua
## 2 2
## India Namibia
## 3 3
## Timor-Leste Honduras
## 3 3
## Kiribati Bhutan
## 3 3
## Bangladesh Micronesia (Federated States of)
## 3 3
## Sao Tome and Principe Congo
## 3 3
## Eswatini (Kingdom of) Lao People's Democratic Republic
## 3 3
## Vanuatu Ghana
## 3 3
## Zambia Equatorial Guinea
## 3 3
## Myanmar Cambodia
## 3 3
## Kenya Nepal
## 3 3
## Angola Cameroon
## 3 3
## Zimbabwe Pakistan
## 3 3
## Solomon Islands Syrian Arab Republic
## 3 3
## Papua New Guinea Comoros
## 3 3
## Rwanda Nigeria
## 3 3
## Tanzania (United Republic of) Uganda
## 3 3
## Mauritania Madagascar
## 3 3
## Benin Lesotho
## 3 3
## Côte d'Ivoire Senegal
## 3 3
## Togo Sudan
## 3 3
## Haiti Afghanistan
## 3 3
## Djibouti Malawi
## 3 3
## Ethiopia Gambia
## 3 3
## Guinea Liberia
## 3 3
## Yemen Guinea-Bissau
## 3 3
## Congo (Democratic Republic of the) Mozambique
## 3 3
## Sierra Leone Burkina Faso
## 3 3
## Eritrea Mali
## 3 3
## Burundi South Sudan
## 3 3
## Chad Central African Republic
## 3 3
## Niger
## 3
## Objective function:
## build swap
## 1.054234 1.037102
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
## [11] "nbclust"
sp3 <- fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 45 0.38
## 2 2 83 0.45
## 3 3 61 0.47
grid.arrange(cp3, sp3, ncol=2)
For both the clustering techniques K-means and Pam the silhouette statistics are approximately the same for both 2 clusters case scenario and 3 clusters case scenario. However, in case of Pam with 2 clusters we have couple of negative average silhouette values. And in case of Pam with 3 clusters we observe a small single negative average silhouette value. To summarize, both the methods have shown approximately the identical results in terms of silhouette statistics for both 2 and 3 cluster scenarios.
Hierarchical clustering as the last resort will be used.In the hierarchical clustering method, it is a necessary requirement to compute the dissimilarity matrix and thus it requires the specification of the linkage method. There are other options but i have decided to limit myself to complete linkage.
Complete linkage with 2 clusters
hc2 <- eclust(working_data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(hc2, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc2, k=2, border='blue')
Complete Linkage with 3 Clusters
hc3 <- eclust(working_data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="complete")
plot(hc3, cex=0.5, hang=-1)
rect.hclust(hc3, k=3, border='blue')
The results are quite similar to K-means and Pam. The division of countries does not change much what ever clustering technique and how many the number of clusters (2 or 3) we may use.
To examine the consistency of the above results, one can use the clvalid package. It encompasses the various stability measures which inspects the the stability of the technique by comparing basic clustering with clusters obtained after removing particular column of the data. They include:
The average proportion of non-overlap (APN)
The average distance (AD)
The average distance between means (ADM)
The figure of merit (FOM)
the smaller the values of above mentioned measures the more consistence our clustering results are.
clmethods <- c("hierarchical","kmeans","pam")
sty <- clValid(working_data, nClust=2:6, clMethods=clmethods, validation="stability", method="complete")
optimalScores(sty)
## Score Method Clusters
## APN 0.05616406 pam 3
## AD 1.29399625 pam 6
## ADM 0.14232400 pam 3
## FOM 0.50386420 kmeans 6
plot(sty)
summary(sty)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical APN 0.3508 0.2403 0.2849 0.2898 0.3811
## AD 2.4245 1.8205 1.6711 1.4618 1.4265
## ADM 1.5818 0.9299 0.8201 0.6399 0.7040
## FOM 0.6963 0.6251 0.5880 0.5574 0.5304
## kmeans APN 0.0781 0.0881 0.1769 0.1723 0.3291
## AD 1.9073 1.5011 1.4498 1.3151 1.3355
## ADM 0.2744 0.2265 0.3726 0.3255 0.5384
## FOM 0.6927 0.5658 0.5488 0.5252 0.5039
## pam APN 0.0614 0.0562 0.1416 0.2631 0.2871
## AD 1.8932 1.4744 1.3791 1.3447 1.2940
## ADM 0.2356 0.1423 0.2615 0.4584 0.5277
## FOM 0.6843 0.5541 0.5306 0.5131 0.5055
##
## Optimal Scores:
##
## Score Method Clusters
## APN 0.0562 pam 3
## AD 1.2940 pam 6
## ADM 0.1423 pam 3
## FOM 0.5039 kmeans 6
From the above results Pam algorithm is considered to be the most consistent one in our case based on the above measures. There is no consensus on the optimal number of clusters but we had no more then 3 ,so its remain to be ambitions one.
Dimension reduction techniques are used to transform data of high dimensions to low dimensional representation that still retains some meaningful information of the original data. By implementing dimension reduction techniques it is possible to interchange the initial factors for principal variables to evade overfitting of the model. At the begining, lets just see some basic statistics.
summary(working_data)
## V1 V2 V3 V4
## Min. :-2.23377 Min. :-2.78585 Min. :-2.6377 Min. :-2.2799
## 1st Qu.:-0.78102 1st Qu.:-0.64077 1st Qu.:-0.6906 1st Qu.:-0.7349
## Median : 0.09508 Median :-0.04222 Median : 0.1827 Median : 0.1312
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.77351 3rd Qu.: 0.67741 3rd Qu.: 0.7063 3rd Qu.: 0.8679
## Max. : 1.59308 Max. : 3.00420 Max. : 1.6328 Max. : 1.7906
## V5
## Min. :-0.9023
## 1st Qu.:-0.7347
## Median :-0.3466
## Mean : 0.0000
## 3rd Qu.: 0.4225
## Max. : 4.6704
Before implementing PCA its feasible to see the correlation coefficients among the variables.
cor<-cor(working_data, method="pearson")
print(cor, digits= 1)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0 0.9 0.9 0.9 0.8
## [2,] 0.9 1.0 0.8 0.8 0.6
## [3,] 0.9 0.8 1.0 0.8 0.7
## [4,] 0.9 0.8 0.8 1.0 0.6
## [5,] 0.8 0.6 0.7 0.6 1.0
The above results suggest that majority of the variables are storngly correlated with each other. To get a better overview lets plot the corrplot.
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor)
corrplot further supports the claim that indeed there exist strong positive correlation among the variables.
Choosing the optimal number of components
Kaiser’s stopping rule is a method which helps us to decide the optimal number of components to be included. Components with higher eigenvalue of 1 should be retained. It further enhances the analysis by ploting a scree plot. The eigenvalues are plotted on vertical axis and the components on horizontal axis. They are plotted in a descending order, from the largest to the smallest. The other approach is to consider the percentage of variance explained by each component. Normally, components are considered feasible if they explain approximately 70%-90% variation.
pca <- prcomp(working_data, center=TRUE, scale=TRUE)
eigen(cor(working_data))$values
## [1] 4.12116193 0.44470704 0.22883898 0.18702570 0.01826635
fviz_eig(pca, choice='eigenvalue')
In our case based on Kaiser’s rule, there is only 1 component that has a eigenvalue of greater then 1. So this should be chosen and the scree plot gives the same result as above.
fviz_eig(pca)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2.0301 0.66686 0.47837 0.43246 0.13515
## Proportion of Variance 0.8242 0.08894 0.04577 0.03741 0.00365
## Cumulative Proportion 0.8242 0.91317 0.95894 0.99635 1.00000
If we look at the above results we see that there is only one PC1 component that explains approximately 82% of variation and has an eigenvalue of greater then 1. The remaining ones do not explain much variation and there eigenvalues are even less then 1.
pcavar <- get_pca_var(pca)
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)
As we can see that except v5 (Gni_per_capita) that is located in third quadrant all other variables v1, v2, v3 and v4 are located in 2nd quadrant. One might conclude that variables v1, v2, v3 and v4 they contribute positively to v5 or in other words Gni_per_capita. The more you perform good in terms of human development indicators the higher the Gni_per_capita you will be able to enjoy.
fviz_contrib(pca, choice = "var", axes = 1)
fviz_contrib(pca, choice = "var", axes = 2)
fviz_contrib(pca, choice = "var", axes = 3)
fviz_contrib(pca, choice = "var", axes = 4)
fviz_contrib(pca, choice = "var", axes = 5)
From the above plots in all components v1 (hdi_index) is the major contributor. We can even plot the individual components. The below graph shows the observations and their quality among two main principle components.
fviz_pca_ind(pca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
)
In this paper, I examined the grouping of countries depending on the similarities and dissimilarities in their human development indicators. Three clustering algorithms were implemented with two different case scenarios. Above all, all methods had shown the same division of countries. Stability comparison measures suggest that PAM is the most suited technique for such type of data with decision to optimal number of clusters being ambiguous one. Additionally, PCA was implemented to reduce the dimensions. One might conclude that country that has ranked high on human development indicators expect to enjoy higher Gni_per_capita.
https://www.datanovia.com/en/lessons/choosing-the-best-clustering-algorithms/
https://en.wikipedia.org/wiki/Hierarchical_clustering
http://www.sthda.com/english/wiki/wiki.php?id_contents=7932#compute-clvalid
http://data.un.org/Explorer.aspx