The aim of the project is to perform clustering on the countries dataset to see any tendency of those countries. then, PCA dimension reduction method will be conducted to see how many variables are enough to explain the majority of information. After that, clustering will be performed again to see how the clusters changed
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tibble' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Warning: package 'flexclust' was built under R version 4.3.2
## Loading required package: grid
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
## Warning: package 'fpc' was built under R version 4.3.2
library(clustertend)
## Package `clustertend` is deprecated. Use package `hopkins` instead.
library(cluster)
## Warning: package 'cluster' was built under R version 4.3.2
library(ClusterR)
## Warning: package 'ClusterR' was built under R version 4.3.2
library(cleandata)
## Warning: package 'cleandata' was built under R version 4.3.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.2
## [1] "C:/Users/Piotr/Downloads"
Data<-read.csv("countries/countries.csv")
The dataset was taken from kaggle (https://www.kaggle.com/datasets/fernandol/countries-of-the-world) with variables like infant mortality, deathrate, birthrate, gpd per capita - all valuable information that will be used for clustering
Let’s separate character variables(country, region) with numeric variables (the rest)
Data2<-Data
Data2<-Data2%>%relocate(Climate, .after=Service)
The dataset contains commas instead of dots as a separator, let’s swap commas with dots and then replace character with numeric in order to cluster
replace_commas <- function(column) {
gsub(",", ".", column)
}
Data3 <- Data2 %>%
mutate(across(where(is.character), replace_commas))
convert_chr_to_numeric <- function(column) {
if(is.character(column)) {
as.numeric(column)
} else {
column
}
}
Data4 <- as.data.frame(lapply(Data3[3:20], convert_chr_to_numeric))
We will add back countries and regions for easier comparison later
Data4 <- cbind(Data4,Data3[1:2])
Remove rows with NAs
Data4<-na.omit(Data4)
Let’s use z scaling on all the numerical variables (except column “climate” which takes values from 0 to 4)
Data5<-scale(Data4[1:17])
Add “Climate” back to our now scaled dataset - we’re not adding Country and Region because they are characters
Data5<-cbind(Data5,Data4$Climate)
The data has been prepared.
Before we start clustering, we need to check our clustering tendency(using hopkins statistic) and the optimal amount of clusters (using elbow and silhouette methods)
get_clust_tendency(Data5,2,graph=TRUE)
## $hopkins_stat
## [1] 0.8693032
##
## $plot
Our hopkin’s statistic is equal to 0.87 which indicates very high clusterability tendency. This is a good sign
opt_silhouette<-Optimal_Clusters_KMeans(Data5, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
opt_elbow<-Optimal_Clusters_KMeans(Data5, max_clusters=10, plot_clusters = TRUE)
Both silhouette and elbow methods imply that it is optimal to use 2 clusters
We’re ready to perform clustering We will use the k-means clustering
cluster_km<-eclust(Data5,FUNcluster="kmeans",2)
fviz_silhouette(cluster_km)
## cluster size ave.sil.width
## 1 1 64 0.31
## 2 2 115 0.18
Average Silhouette is equal to 0.23, which indicates low clustering quality.
Let’s see what story the clusters can tell us, we will choose 10 random countries and regions, some of their characteristics and their clusters
Data4$cluster<-cluster_km$cluster
Data_not_scaled<-Data4
Data_not_scaled[sample(nrow(Data_not_scaled),10),c(1, 3, 13, 14,19,20,21)]
## Population Pop..Density..per.sq..mi.. Birthrate Deathrate
## 214 60609153 247.6 10.71 10.13
## 58 9183984 188.5 23.22 5.73
## 150 4076140 15.2 13.76 7.53
## 49 4075261 79.8 18.32 4.36
## 17 147365352 1023.4 29.80 8.27
## 225 21456188 40.6 42.89 8.30
## 194 439117 2.7 18.02 7.27
## 108 34707817 59.6 39.72 14.02
## 110 23113019 191.8 15.54 7.13
## 215 298444215 31.0 14.14 8.26
## Country Region cluster
## 214 United Kingdom WESTERN EUROPE 2
## 58 Dominican Republic LATIN AMER. & CARIB 2
## 150 New Zealand OCEANIA 2
## 49 Costa Rica LATIN AMER. & CARIB 2
## 17 Bangladesh ASIA (EX. NEAR EAST) 1
## 225 Yemen NEAR EAST 1
## 194 Suriname LATIN AMER. & CARIB 2
## 108 Kenya SUB-SAHARAN AFRICA 1
## 110 Korea. North ASIA (EX. NEAR EAST) 2
## 215 United States NORTHERN AMERICA 2
Based on those results, we can deduce that the cluster 2 belongs to more developed countries, where deathrate and infant mortality are low, literacy is high - tendencies found in developed or almost developed countries Cluster 1 belongs to (but not always) developing and/or poorer countries.
Now, let’s perform PCA now and perform the same K-means clustering on it
Principal Component Analysis is used to decrease the number of dimensions while containing most information a tool used to combat the curse of high dimensionality
pca <- prcomp(Data5, center=FALSE, scale=FALSE)
Let’s see how much of variance do those components explain:
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4177 2.1311 1.5743 1.36346 1.23752 1.16819 0.95443
## Proportion of Variance 0.2662 0.2068 0.1129 0.08466 0.06974 0.06214 0.04148
## Cumulative Proportion 0.2662 0.4730 0.5859 0.67051 0.74025 0.80240 0.84388
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.89013 0.83215 0.68612 0.68024 0.61478 0.5526 0.40269
## Proportion of Variance 0.03608 0.03153 0.02144 0.02107 0.01721 0.0139 0.00738
## Cumulative Proportion 0.87996 0.91150 0.93294 0.95401 0.97122 0.9851 0.99251
## PC15 PC16 PC17 PC18
## Standard deviation 0.31329 0.25625 0.02618 0.000386
## Proportion of Variance 0.00447 0.00299 0.00003 0.000000
## Cumulative Proportion 0.99698 0.99997 1.00000 1.000000
fviz_eig(pca, choice='eigenvalue')
fviz_eig(pca)
fviz_pca_var(pca)
The two first principals explain 47.3% of the variance which is not a lot. We will perform Kaiser rule to establish how many dimensions are explain the majority of the dataset
We will now see the eigenvalues of our dataset. According to kaiser rule, a good stopping point is to choose only those who have value of 1 or higher.
Data5.cov<-cov(Data5)
Data5.eigen<-eigen(Data5.cov)
Data5.eigen$value
## [1] 5.454601e+00 2.520027e+00 1.875547e+00 1.565978e+00 1.411058e+00
## [6] 9.208962e-01 8.123057e-01 7.162087e-01 5.019030e-01 4.675885e-01
## [11] 3.782903e-01 3.178624e-01 2.232605e-01 1.579442e-01 9.727503e-02
## [16] 6.522894e-02 6.855649e-04 1.480658e-07
Only those above 1 should be chosen. According to the output PC1 to PC5 are retained.
In our case, the first 5 components explain 75% of the variance
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
PC1 <- fviz_contrib(pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(pca, choice = "var", axes = 3)
PC4 <- fviz_contrib(pca, choice = "var", axes = 4)
PC5 <- fviz_contrib(pca, choice = "var", axes = 5)
grid.arrange(PC1, PC2)
grid.arrange(PC3, PC4)
grid.arrange(PC5)
Looking at both the graphs of Contribution of variables to Dim-1 and Dim-2 we can see that Infant mortality per 1000 births is the leading variable of the dataset with about 36% contribution for dim-1 and 60% for dim-2. For dim-1, some other variables also contriute like birthrate, phones per 1000 people, literacy, gdp per capita or agriculture.As for dim-2, the situation is similar but those variables contribute to a much less degree than infant mortality.
pca_after<-prcomp(Data5,center=FALSE, scale.=FALSE, rank. = 5)
Let’s perform the same steps as before, hopkins statistics as well as silhouette and elbow methods
get_clust_tendency(pca_after$x,2,graph=TRUE)
## $hopkins_stat
## [1] 0.5878839
##
## $plot
It’s equal to 0.59, way lower than before (0.87) and closer to 0.5 which is not a good sign
opt_silhouette_pca<-Optimal_Clusters_KMeans(pca_after$x, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
opt_elbow_pca<-Optimal_Clusters_KMeans(pca_after$x, max_clusters=10, plot_clusters = TRUE)
Based on the elbow and silhouette methods, we will choose 2 clusters again
cluster_pca_kmeans<-eclust(pca_after$x,FUNcluster="kmeans",2)
fviz_silhouette(cluster_pca_kmeans)
## cluster size ave.sil.width
## 1 1 67 0.38
## 2 2 112 0.28
Although the average silhouette width is higher (0.32) than it’s non-PCA counterpart (0.23) the low hopkins statistics and high number of PCs (even 5 PCs explain only 0.75 of the variance) might indicate that PCA is not the best dimension reduction method for this dataset
Clustering performed on this dataset helped with seeing some tendencies, like the division to poorer/developing countries (cluster 1) and developed/almost developed/richer countries (cluster 2) although there is a good tendency shown as well as hopkins statistics being high (0.87) the average width of the silhouette was low (0.23) which, in my opinion, might give some guidance but not clear conclusions Dimension reduction and clustering on that dataset was used to see if the results from this average width could be improved. While it did improve, hopkins statistic decreased significantly compared to its non-reduction Counterpart. This might indicate that while the dataset might be good to reduce dimensions, PCA might not be the appropriate method.
https://www.datanovia.com/en/lessons/assessing-clustering-tendency/
https://sanchitamangale12.medium.com/scree-plot-733ed72c8608
https://towardsdatascience.com/curse-of-dimensionality-an-intuitive-exploration-1fbf155e1411
ChatGPT (converting chr to numeric function and replace commas function)