This paper is describing the growth of population in period between 2000 and 2015
setwd("C:/Users/user/Desktop/")
pop_data <- read.csv("population.csv", header = TRUE, sep = ",", dec = ".")
# install.packages("cluster")
# install.packages("factoextra")
# install.packages("flexclust")
# install.packages("clustertend")
library(factoextra)
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unable to identify current timezone 'S':
## please set environment variable 'TZ'
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FunCluster)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: cluster
## Warning: package 'cluster' was built under R version 4.0.4
##
## Attaching package: 'FunCluster'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(cluster)
library(NbClust)
library(gridExtra)
For this project dataset was taken from Kaggle. https://www.kaggle.com/ahmettezcantekin/beginner-datasets. You will see below 6 first rows of the data, summary for each observation and how many dimensions includes the data.
head(pop_data)
## Country.Name X2000 X2001 X2002
## 1 Arab World 2.1134699 2.1205188 2.1270139
## 2 Caribbean small states 0.6711183 0.6622290 0.5402110
## 3 Central Europe and the Baltics -0.6299302 -0.5479269 -0.6331929
## 4 Early-demographic dividend 1.8303092 1.7698788 1.7076974
## 5 East Asia & Pacific 0.9481125 0.9061266 0.8517398
## 6 East Asia & Pacific (excluding high income) 1.0038555 0.9512335 0.9009874
## X2003 X2004 X2005 X2006 X2007 X2008 X2009
## 1 2.1572385 2.2204703 2.2988855 2.3851588 2.4492546 2.4698099 2.4244690
## 2 0.5304045 0.5452544 0.5739497 0.6070699 0.6339550 0.6498168 0.6492271
## 3 -0.2985992 -0.2608217 -0.2588209 -0.2403414 -0.4667851 -0.3581315 -0.1924478
## 4 1.6695721 1.6399840 1.6170382 1.5962809 1.5763388 1.5573478 1.5387424
## 5 0.8047454 0.7691550 0.7470558 0.7261821 0.6872294 0.6899835 0.6671707
## 6 0.8589611 0.8270735 0.8055887 0.7656534 0.7250663 0.7093308 0.6993534
## X2010 X2011 X2012 X2013 X2014 X2015
## 1 2.3353031 2.2427948 2.1523840 2.0882829 2.0500005 2.0299910
## 2 0.6365386 0.6145974 0.5859472 0.5570909 0.5318852 0.5148971
## 3 -0.3627098 -0.3536919 -0.2291550 -0.2132018 -0.2097572 -0.1715435
## 4 1.5287338 1.5079634 1.4761123 1.4573988 1.4389271 1.4193355
## 5 0.6587379 0.6575313 0.6681693 0.6707207 0.6733926 0.6679191
## 6 0.6953641 0.7035156 0.7142279 0.7199682 0.7244116 0.7162827
summary(pop_data)
## Country.Name X2000 X2001 X2002
## Length:255 Min. :-4.075 Min. :-3.6618 Min. :-1.9110
## Class :character 1st Qu.: 0.624 1st Qu.: 0.5769 1st Qu.: 0.5439
## Mode :character Median : 1.458 Median : 1.4374 Median : 1.3604
## Mean : 1.423 Mean : 1.4314 Mean : 1.4324
## 3rd Qu.: 2.237 3rd Qu.: 2.1976 3rd Qu.: 2.1910
## Max. : 5.598 Max. : 6.7098 Max. : 7.4165
## X2003 X2004 X2005 X2006
## Min. :-1.475 Min. :-2.1227 Min. :-2.7140 Min. :-3.3761
## 1st Qu.: 0.500 1st Qu.: 0.5602 1st Qu.: 0.5328 1st Qu.: 0.5604
## Median : 1.334 Median : 1.3101 Median : 1.3084 Median : 1.3178
## Mean : 1.427 Mean : 1.4550 Mean : 1.4843 Mean : 1.5229
## 3rd Qu.: 2.137 3rd Qu.: 2.2144 3rd Qu.: 2.3487 3rd Qu.: 2.4030
## Max. : 7.409 Max. : 9.2188 Max. :13.3822 Max. :16.6403
## X2007 X2008 X2009 X2010
## Min. :-4.0062 Min. :-4.1804 Min. :-3.6712 Min. :-2.5951
## 1st Qu.: 0.5624 1st Qu.: 0.5904 1st Qu.: 0.5144 1st Qu.: 0.4926
## Median : 1.3031 Median : 1.3027 Median : 1.2495 Median : 1.2474
## Mean : 1.5286 Mean : 1.5172 Mean : 1.4505 Mean : 1.4128
## 3rd Qu.: 2.3666 3rd Qu.: 2.2449 3rd Qu.: 2.2554 3rd Qu.: 2.2490
## Max. :17.6248 Max. :16.3928 Max. :13.5901 Max. :10.3984
## X2011 X2012 X2013 X2014
## Min. :-2.6287 Min. :-3.7247 Min. :-4.3997 Min. :-4.1919
## 1st Qu.: 0.4454 1st Qu.: 0.4539 1st Qu.: 0.5237 1st Qu.: 0.4809
## Median : 1.2490 Median : 1.2043 Median : 1.2281 Median : 1.2267
## Mean : 1.3835 Mean : 1.3427 Mean : 1.3411 Mean : 1.2956
## 3rd Qu.: 2.2364 3rd Qu.: 2.1620 3rd Qu.: 2.1031 3rd Qu.: 2.0991
## Max. : 8.6589 Max. : 9.9320 Max. : 9.7155 Max. : 8.0886
## X2015
## Min. :-3.2294
## 1st Qu.: 0.4794
## Median : 1.1959
## Mean : 1.2744
## 3rd Qu.: 2.0703
## Max. : 5.8340
dim(pop_data)
## [1] 255 17
It is can be useful to check if data include missing values.
apply(pop_data, 2, function(x) any(is.na(x)))
## Country.Name X2000 X2001 X2002 X2003 X2004
## FALSE FALSE FALSE FALSE FALSE FALSE
## X2005 X2006 X2007 X2008 X2009 X2010
## FALSE FALSE FALSE FALSE FALSE FALSE
## X2011 X2012 X2013 X2014 X2015
## FALSE FALSE FALSE FALSE FALSE
Before starting the analysis it can be seen that first column cannot be putted into analysis. That is why we need to cut it off.
data <- pop_data[2:15]
country_name<- pop_data[,1]
Before we will start with clustering it is good to verify if data can be clusterable and then apply clustering algorithms to produce some meaningfull results.
#install.packages("ClusterR")
library(ClusterR)
## Loading required package: gtools
library(FunCluster)
library(cluster)
library(NbClust)
library(gridExtra)
get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9712952
##
## $plot
From above results it can be seen that clusters are visible and subdivided into separate clearly defined squares. It means that data is highly clusterable. Hopkins statistics are far above 0.5.Result of hopkins statistic equal 0.9712952. Therefore we are rejecting null hypothesis that data is uniformly distributed. We will use only k-means and PAM clustering methods. We will not use Clara Method because we have rather on small data set.
Before k-means, pam and hierarhical clusterization is performed it’s best to check an the amount of potential clusters which we can used. Therefore i choosed Optimal Clusters Kmeans methods using Silhouette criterion.
Silhuette method
K1<-Optimal_Clusters_KMeans(data, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
k1.1 = fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") + theme_minimal()
k1.1
k2 = fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_minimal()
k2
k3 = fviz_nbclust(data, FUNcluster = hcut, method = "silhouette") + theme_minimal()
k3
k-means method - 2
pam method - 2
hierarchical methods -2
#install.packages("fpc")
library(fpc)
km<-eclust(data, "kmeans", hc_metric="euclidean",k=2)
a1 = fviz_silhouette(km)
## cluster size ave.sil.width
## 1 1 177 0.55
## 2 2 78 0.43
b1 = fviz_cluster(km, data = data, elipse.type = "convex", main = "2 clusters") + theme_minimal()
grid.arrange(a1, b1, ncol=2)
To summarize our data wass plitted into 2 clusters where red cluster gathered more countries with higher population .
k2 = fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_minimal()
k2
For pam clustering the best to choose 2
cm = eclust(data, "pam", k = 2, graph = FALSE)
pam1 = fviz_silhouette(cm)
## cluster size ave.sil.width
## 1 1 97 0.37
## 2 2 158 0.54
pam2 = fviz_cluster(cm, data = data, elipse.type = "convex", main = "2 clusters") + theme_minimal()
grid.arrange(pam1, pam2, ncol = 2)
By comparing PAM method (2 clusters) with Kmeans method (2clusters) it can be noticed that average silhoutte of PAM method (0,48)is lower comparing to the Kmeans method which silhouette is equal (0,51). The higher statistics the better. Overall, quality of clusters is good, but we see some negative values for cluster 1 using pam method which most likely caused by outliers from both clusters. However this problem does not appeer in k-means method.
Statistics in clustered groups
We can check a distance of every single observation from the centroid of cluster. If the bin is higher than more distant locations of points within given cluster is undesirable.
#install.packages("cclust")
library(cclust)
library(flexclust)
## Loading required package: grid
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'flexclust'
## The following object is masked from 'package:cclust':
##
## cclust
a1<-cclust(data, 2, dist="euclidean")
## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'
## Also defined by 'flexclust'
## Found more than one class "kcca" in cache; using the first, from namespace 'kernlab'
## Also defined by 'flexclust'
stripes(a1)
a2<-cclust(data, 2, dist="manhattan")
stripes(a2)
Now we will perform hierarchical clusterization.
library(stats)
library(ClustGeo)
fd = dist(data)
hc = hclust(fd, method = "complete")
plot(hc, main = "Cluster Dendogram")
rect.hclust(hc, k = 2)
clust.vec.2<-cutree(hc, k=2)
clust.vec.4<-cutree(hc, k=4)
Hierarhical clustering shows us 2 distincitive groups of observations. As we can see from the tree that it is quit hard to read the information, thus i will transpose the data set.
Hierarhical clustering shows us 2 distincitive groups of observations. As we can see from the tree that it is quit hard to read the information, thus i will transpose the data set.
dm<-dist(t(data))
hc1<-hclust(dm, method="complete")
plot(hc1)
x<-rect.hclust(hc1, k=2)
We can see years within clusters.
library(factoextra)
fviz_cluster(list(data= data, cluster=clust.vec.2))
library(cluster)
plot(silhouette(clust.vec.2,fd))
Based on the results above we can see that gave us the best result so far when compared with others with average silhouette width of 0.82.
Sources:
1.Katarzyna Kopczewska. Unsupervised Learning. Presentation from the classes.
2.https://cran.r-project.org/web/packages/FunCluster/FunCluster.pdf