The data are the percentage employed in different industries in Europe countries during 1979. The purpose of examining this data is to get insight into patterns of employment (if any) amongst European countries in 1970s.
Variable Names:
The following packages have been used for the analysis:
library(ggplot2)
library(dplyr)
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(dendextend) # for comparing two dendrograms
library(corrplot)
The Country name column is changed to row name in the dataset.
employment <- read.csv("C:/Users/heman/Desktop/study/employment/employment.txt", sep="\t")
#get country name and scale variables
row.names(employment)<-employment$Country
employment <- employment[,-1]
The 9 variables are all numeric variables.
glimpse(employment)
## Observations: 26
## Variables: 9
## $ Agr <dbl> 3.3, 9.2, 10.8, 6.7, 23.2, 15.9, 7.7, 6.3, 2.7, 12.7, 13.0...
## $ Min <dbl> 0.9, 0.1, 0.8, 1.3, 1.0, 0.6, 3.1, 0.1, 1.4, 1.1, 0.4, 0.6...
## $ Man <dbl> 27.6, 21.8, 27.5, 35.8, 20.7, 27.6, 30.8, 22.5, 30.2, 30.2...
## $ PS <dbl> 0.9, 0.6, 0.9, 0.9, 1.3, 0.5, 0.8, 1.0, 1.4, 1.4, 1.3, 0.6...
## $ Con <dbl> 8.2, 8.3, 8.9, 7.3, 7.5, 10.0, 9.2, 9.9, 6.9, 9.0, 7.4, 8....
## $ SI <dbl> 19.1, 14.6, 16.8, 14.4, 16.8, 18.1, 18.5, 18.0, 16.9, 16.8...
## $ Fin <dbl> 6.2, 6.5, 6.0, 5.0, 2.8, 1.6, 4.6, 6.8, 5.7, 4.9, 5.5, 2.4...
## $ SPS <dbl> 26.6, 32.2, 22.6, 22.3, 20.8, 20.1, 19.2, 28.5, 28.3, 16.8...
## $ TC <dbl> 7.2, 7.1, 5.7, 6.1, 6.1, 5.7, 6.2, 6.8, 6.4, 7.0, 7.6, 6.7...
The summary of each variable is shown below.
summary(employment)
## Agr Min Man PS
## Min. : 2.70 Min. :0.100 Min. : 7.90 Min. :0.1000
## 1st Qu.: 7.70 1st Qu.:0.525 1st Qu.:23.00 1st Qu.:0.6000
## Median :14.45 Median :0.950 Median :27.55 Median :0.8500
## Mean :19.13 Mean :1.254 Mean :27.01 Mean :0.9077
## 3rd Qu.:23.68 3rd Qu.:1.800 3rd Qu.:30.20 3rd Qu.:1.1750
## Max. :66.80 Max. :3.100 Max. :41.20 Max. :1.9000
## Con SI Fin SPS
## Min. : 2.800 Min. : 5.20 Min. : 0.500 Min. : 5.30
## 1st Qu.: 7.525 1st Qu.: 9.25 1st Qu.: 1.225 1st Qu.:16.25
## Median : 8.350 Median :14.40 Median : 4.650 Median :19.65
## Mean : 8.165 Mean :12.96 Mean : 4.000 Mean :20.02
## 3rd Qu.: 8.975 3rd Qu.:16.88 3rd Qu.: 5.925 3rd Qu.:24.12
## Max. :11.500 Max. :19.10 Max. :11.300 Max. :32.40
## TC
## Min. :3.200
## 1st Qu.:5.700
## Median :6.700
## Mean :6.546
## 3rd Qu.:7.075
## Max. :9.400
employment <- scale(employment)
We notice that:
corrplot(cor(employment), order = "hclust")
On checking the distance matrix between the countries, Turkey and Yugoslavia are separated pretty far from other countries.
#distance matrix
distance <- get_dist(employment)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
We create clusters using K-means clustering.
We try different values of K in below plots.
Turkey and Yugoslavia seem to be pretty far away from other countries as we saw in distance matrix, creating a cluster of their own
#kmeans with diff k vals
k2 <- kmeans(employment, centers = 2, nstart = 25)
k3 <- kmeans(employment, centers = 3, nstart = 25)
k4 <- kmeans(employment, centers = 4, nstart = 25)
k5 <- kmeans(employment, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = employment) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = employment) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = employment) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = employment) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
Using the Total sum of Square method to check for in-cluster similarity, we see a bend at 3 clusters. Hence, we take 3 clusters according to this chart.
Using average silhouette width method, which gives how well separated the clusters are, we find maximum separation at cluster size 3.
Hence we proceed with making 3 clusters
#selecting number of clusters
set.seed(12420352)
fviz_nbclust(employment, kmeans, method = "wss")
fviz_nbclust(employment, kmeans, method = "silhouette")
We create 3 clusters on the dataset. the clusters created depends on the starting point hence we use 25 as the distinct number of starting points.
#creating final kmeans model
set.seed(12420352)
final <- kmeans(employment, 3, nstart = 25)
#comparing cluster wise
k_means_cluster <- cbind(employment, kmeans_cluster = final$cluster)
cluster <- aggregate(k_means_cluster,by=list(final$cluster),mean)
cluster
## Group.1 Agr Min Man PS Con
## 1 1 0.3526007 0.9496004 0.38775681 0.14568146 0.1121882
## 2 2 -0.4868128 -0.4549756 0.06757616 0.02939187 0.2717058
## 3 3 2.4840999 -0.1585972 -2.09163650 -0.81786082 -2.6223997
## SI Fin SPS TC kmeans_cluster
## 1 -0.9524484 -1.0332921 -0.4097020 0.50582974 1
## 2 0.6717788 0.4186614 0.4139257 0.01174741 2
## 3 -1.5644365 0.7838767 -1.6725978 -2.11729825 3
We also try to create clusters using hierarchical clustering method.
We get the following dendogram highlighting various clusters which can be created. Again Turkey and Yugoslavia have their own cluster.
# Dissimilarity matrix
d <- dist(employment, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)
Using average silhouette width method, which gives how well separated the clusters are, we find maximum separation at cluster size 2.
Using the Total sum of Square method to check for in-cluster similarity, we dont see a significant bend at until 3 or 4 clusters. But still its not decisive. Hence we proceed with making 2 clusters
#finding number of clusters
fviz_nbclust(employment, FUN = hcut, method = "wss")
fviz_nbclust(employment, FUN = hcut, method = "silhouette")
We see that one cluster conatins only 2 countries Yugoslavia and other cluster contains the rest.
#creating final model
seed.3clust = cutree(hc1,k=2)
table(seed.3clust)
## seed.3clust
## 1 2
## 24 2
We show the separation in the dendograms, and also show the plot of the 2 clusters.
#plots
plot(hc1, cex = 0.6, hang=-1)
rect.hclust(hc1, k = 2, border = 2:5)
fviz_cluster(list(data = employment, cluster = seed.3clust))