Introduction

This paper is describing the growth of population in period between 2000 and 2015

setwd("C:/Users/user/Desktop/")
pop_data <- read.csv("population.csv", header = TRUE, sep = ",", dec = ".")
# install.packages("cluster")
# install.packages("factoextra")
# install.packages("flexclust")
# install.packages("clustertend")
library(factoextra)

## Loading required package: ggplot2

## Warning in as.POSIXlt.POSIXct(Sys.time()): unable to identify current timezone 'S':
## please set environment variable 'TZ'

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(FunCluster)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: cluster

## Warning: package 'cluster' was built under R version 4.0.4

## 
## Attaching package: 'FunCluster'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(cluster)
library(NbClust)
library(gridExtra)

For this project dataset was taken from Kaggle. https://www.kaggle.com/ahmettezcantekin/beginner-datasets. You will see below 6 first rows of the data, summary for each observation and how many dimensions includes the data.

head(pop_data)

##                                  Country.Name      X2000      X2001      X2002
## 1                                  Arab World  2.1134699  2.1205188  2.1270139
## 2                      Caribbean small states  0.6711183  0.6622290  0.5402110
## 3              Central Europe and the Baltics -0.6299302 -0.5479269 -0.6331929
## 4                  Early-demographic dividend  1.8303092  1.7698788  1.7076974
## 5                         East Asia & Pacific  0.9481125  0.9061266  0.8517398
## 6 East Asia & Pacific (excluding high income)  1.0038555  0.9512335  0.9009874
##        X2003      X2004      X2005      X2006      X2007      X2008      X2009
## 1  2.1572385  2.2204703  2.2988855  2.3851588  2.4492546  2.4698099  2.4244690
## 2  0.5304045  0.5452544  0.5739497  0.6070699  0.6339550  0.6498168  0.6492271
## 3 -0.2985992 -0.2608217 -0.2588209 -0.2403414 -0.4667851 -0.3581315 -0.1924478
## 4  1.6695721  1.6399840  1.6170382  1.5962809  1.5763388  1.5573478  1.5387424
## 5  0.8047454  0.7691550  0.7470558  0.7261821  0.6872294  0.6899835  0.6671707
## 6  0.8589611  0.8270735  0.8055887  0.7656534  0.7250663  0.7093308  0.6993534
##        X2010      X2011      X2012      X2013      X2014      X2015
## 1  2.3353031  2.2427948  2.1523840  2.0882829  2.0500005  2.0299910
## 2  0.6365386  0.6145974  0.5859472  0.5570909  0.5318852  0.5148971
## 3 -0.3627098 -0.3536919 -0.2291550 -0.2132018 -0.2097572 -0.1715435
## 4  1.5287338  1.5079634  1.4761123  1.4573988  1.4389271  1.4193355
## 5  0.6587379  0.6575313  0.6681693  0.6707207  0.6733926  0.6679191
## 6  0.6953641  0.7035156  0.7142279  0.7199682  0.7244116  0.7162827

summary(pop_data)

##  Country.Name           X2000            X2001             X2002        
##  Length:255         Min.   :-4.075   Min.   :-3.6618   Min.   :-1.9110  
##  Class :character   1st Qu.: 0.624   1st Qu.: 0.5769   1st Qu.: 0.5439  
##  Mode  :character   Median : 1.458   Median : 1.4374   Median : 1.3604  
##                     Mean   : 1.423   Mean   : 1.4314   Mean   : 1.4324  
##                     3rd Qu.: 2.237   3rd Qu.: 2.1976   3rd Qu.: 2.1910  
##                     Max.   : 5.598   Max.   : 6.7098   Max.   : 7.4165  
##      X2003            X2004             X2005             X2006        
##  Min.   :-1.475   Min.   :-2.1227   Min.   :-2.7140   Min.   :-3.3761  
##  1st Qu.: 0.500   1st Qu.: 0.5602   1st Qu.: 0.5328   1st Qu.: 0.5604  
##  Median : 1.334   Median : 1.3101   Median : 1.3084   Median : 1.3178  
##  Mean   : 1.427   Mean   : 1.4550   Mean   : 1.4843   Mean   : 1.5229  
##  3rd Qu.: 2.137   3rd Qu.: 2.2144   3rd Qu.: 2.3487   3rd Qu.: 2.4030  
##  Max.   : 7.409   Max.   : 9.2188   Max.   :13.3822   Max.   :16.6403  
##      X2007             X2008             X2009             X2010        
##  Min.   :-4.0062   Min.   :-4.1804   Min.   :-3.6712   Min.   :-2.5951  
##  1st Qu.: 0.5624   1st Qu.: 0.5904   1st Qu.: 0.5144   1st Qu.: 0.4926  
##  Median : 1.3031   Median : 1.3027   Median : 1.2495   Median : 1.2474  
##  Mean   : 1.5286   Mean   : 1.5172   Mean   : 1.4505   Mean   : 1.4128  
##  3rd Qu.: 2.3666   3rd Qu.: 2.2449   3rd Qu.: 2.2554   3rd Qu.: 2.2490  
##  Max.   :17.6248   Max.   :16.3928   Max.   :13.5901   Max.   :10.3984  
##      X2011             X2012             X2013             X2014        
##  Min.   :-2.6287   Min.   :-3.7247   Min.   :-4.3997   Min.   :-4.1919  
##  1st Qu.: 0.4454   1st Qu.: 0.4539   1st Qu.: 0.5237   1st Qu.: 0.4809  
##  Median : 1.2490   Median : 1.2043   Median : 1.2281   Median : 1.2267  
##  Mean   : 1.3835   Mean   : 1.3427   Mean   : 1.3411   Mean   : 1.2956  
##  3rd Qu.: 2.2364   3rd Qu.: 2.1620   3rd Qu.: 2.1031   3rd Qu.: 2.0991  
##  Max.   : 8.6589   Max.   : 9.9320   Max.   : 9.7155   Max.   : 8.0886  
##      X2015        
##  Min.   :-3.2294  
##  1st Qu.: 0.4794  
##  Median : 1.1959  
##  Mean   : 1.2744  
##  3rd Qu.: 2.0703  
##  Max.   : 5.8340

dim(pop_data)

## [1] 255  17

Preparation of the data

It is can be useful to check if data include missing values.

apply(pop_data, 2, function(x) any(is.na(x)))

## Country.Name        X2000        X2001        X2002        X2003        X2004 
##        FALSE        FALSE        FALSE        FALSE        FALSE        FALSE 
##        X2005        X2006        X2007        X2008        X2009        X2010 
##        FALSE        FALSE        FALSE        FALSE        FALSE        FALSE 
##        X2011        X2012        X2013        X2014        X2015 
##        FALSE        FALSE        FALSE        FALSE        FALSE

Before starting the analysis it can be seen that first column cannot be putted into analysis. That is why we need to cut it off.

data <- pop_data[2:15]
country_name<- pop_data[,1]

Clustering tendency

Before we will start with clustering it is good to verify if data can be clusterable and then apply clustering algorithms to produce some meaningfull results.

#install.packages("ClusterR")
library(ClusterR)

## Loading required package: gtools

library(FunCluster)
library(cluster)
library(NbClust)
library(gridExtra)
get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)

## $hopkins_stat
## [1] 0.9712952
## 
## $plot

From above results it can be seen that clusters are visible and subdivided into separate clearly defined squares. It means that data is highly clusterable. Hopkins statistics are far above 0.5.Result of hopkins statistic equal 0.9712952. Therefore we are rejecting null hypothesis that data is uniformly distributed. We will use only k-means and PAM clustering methods. We will not use Clara Method because we have rather on small data set.

Optimal amount of clusters

Before k-means, pam and hierarhical clusterization is performed it’s best to check an the amount of potential clusters which we can used. Therefore i choosed Optimal Clusters Kmeans methods using Silhouette criterion.

Silhuette method

K1<-Optimal_Clusters_KMeans(data, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

k1.1 = fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") + theme_minimal()
k1.1

k2 = fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_minimal()
k2

k3 = fviz_nbclust(data, FUNcluster = hcut, method = "silhouette") + theme_minimal()
k3

Summary of results

k-means method - 2
pam method - 2
hierarchical methods -2

K-Means Clustering

#install.packages("fpc")
library(fpc)
km<-eclust(data, "kmeans", hc_metric="euclidean",k=2)

a1 = fviz_silhouette(km)

##   cluster size ave.sil.width
## 1       1  177          0.55
## 2       2   78          0.43

b1 = fviz_cluster(km, data = data, elipse.type = "convex", main = "2 clusters") + theme_minimal()
grid.arrange(a1, b1, ncol=2)

To summarize our data wass plitted into 2 clusters where red cluster gathered more countries with higher population .

PAM clustering

k2 = fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_minimal()
k2

For pam clustering the best to choose 2

cm = eclust(data, "pam", k = 2, graph = FALSE)
pam1 = fviz_silhouette(cm)

##   cluster size ave.sil.width
## 1       1   97          0.37
## 2       2  158          0.54

pam2 = fviz_cluster(cm, data = data, elipse.type = "convex", main = "2 clusters") + theme_minimal()
grid.arrange(pam1, pam2, ncol = 2)

By comparing PAM method (2 clusters) with Kmeans method (2clusters) it can be noticed that average silhoutte of PAM method (0,48)is lower comparing to the Kmeans method which silhouette is equal (0,51). The higher statistics the better. Overall, quality of clusters is good, but we see some negative values for cluster 1 using pam method which most likely caused by outliers from both clusters. However this problem does not appeer in k-means method.

Statistics in clustered groups

We can check a distance of every single observation from the centroid of cluster. If the bin is higher than more distant locations of points within given cluster is undesirable.

#install.packages("cclust")
library(cclust)
library(flexclust)

## Loading required package: grid

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'flexclust'

## The following object is masked from 'package:cclust':
## 
##     cclust

a1<-cclust(data, 2, dist="euclidean")

## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'

## Also defined by 'flexclust'

## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'

## Also defined by 'flexclust'

stripes(a1)

a2<-cclust(data, 2, dist="manhattan")
stripes(a2)

Hierarchical Clustering

Now we will perform hierarchical clusterization.

library(stats)
library(ClustGeo)
fd = dist(data)
hc = hclust(fd, method = "complete")
plot(hc, main = "Cluster Dendogram")
rect.hclust(hc, k = 2)

clust.vec.2<-cutree(hc, k=2)
clust.vec.4<-cutree(hc, k=4)

Hierarhical clustering shows us 2 distincitive groups of observations. As we can see from the tree that it is quit hard to read the information, thus i will transpose the data set.

dm<-dist(t(data))
hc1<-hclust(dm, method="complete")
plot(hc1)
x<-rect.hclust(hc1, k=2)

We can see years within clusters.

library(factoextra)
fviz_cluster(list(data= data, cluster=clust.vec.2))

library(cluster)
plot(silhouette(clust.vec.2,fd))

Summary

Based on the results above we can see that gave us the best result so far when compared with others with average silhouette width of 0.82.

Sources:

1.Katarzyna Kopczewska. Unsupervised Learning. Presentation from the classes.

2.https://cran.r-project.org/web/packages/FunCluster/FunCluster.pdf

Clustering analysis

Tetiana Heorhiichuk