The data consists of the percentage employed in different industries in Europe countries during 1979. The purpose of examining this data is to get insight into patterns of employment (if any) in European countries in the time period of 1970s.
Variable Names:
Country: Name of countryAgr: Percentage employed in agricultureMin: Percentage employed in miningMan: Percentage employed in manufacturingPS: Percentage employed in power supply industriesCon: Percentage employed in constructionSI: Percentage employed in service industriesFin: Percentage employed in financeSPS: Percentage employed in social and personal servicesTC: Percentage employed in transport and communicationsThe following packages have been used for the analysis:
library(ggplot2)
library(dplyr)
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(dendextend) # for comparing two dendrograms
library(corrplot)
The Country name column is changed to row name in the dataset.
employment <- read.csv("C:/Users/rohit/Documents/RPubs_uploads/proj_5/employment/employment.txt", sep="\t")
#get country name and scale variables
row.names(employment)<-employment$Country
employment <- employment[,-1]
The 9 variables in the data are numeric variables.
glimpse(employment)
## Observations: 26
## Variables: 9
## $ Agr <dbl> 3.3, 9.2, 10.8, 6.7, 23.2, 15.9, 7.7, 6.3, 2.7, 12.7, 13.0...
## $ Min <dbl> 0.9, 0.1, 0.8, 1.3, 1.0, 0.6, 3.1, 0.1, 1.4, 1.1, 0.4, 0.6...
## $ Man <dbl> 27.6, 21.8, 27.5, 35.8, 20.7, 27.6, 30.8, 22.5, 30.2, 30.2...
## $ PS <dbl> 0.9, 0.6, 0.9, 0.9, 1.3, 0.5, 0.8, 1.0, 1.4, 1.4, 1.3, 0.6...
## $ Con <dbl> 8.2, 8.3, 8.9, 7.3, 7.5, 10.0, 9.2, 9.9, 6.9, 9.0, 7.4, 8....
## $ SI <dbl> 19.1, 14.6, 16.8, 14.4, 16.8, 18.1, 18.5, 18.0, 16.9, 16.8...
## $ Fin <dbl> 6.2, 6.5, 6.0, 5.0, 2.8, 1.6, 4.6, 6.8, 5.7, 4.9, 5.5, 2.4...
## $ SPS <dbl> 26.6, 32.2, 22.6, 22.3, 20.8, 20.1, 19.2, 28.5, 28.3, 16.8...
## $ TC <dbl> 7.2, 7.1, 5.7, 6.1, 6.1, 5.7, 6.2, 6.8, 6.4, 7.0, 7.6, 6.7...
The summary of each variable is shown below.
summary(employment)
## Agr Min Man PS
## Min. : 2.70 Min. :0.100 Min. : 7.90 Min. :0.1000
## 1st Qu.: 7.70 1st Qu.:0.525 1st Qu.:23.00 1st Qu.:0.6000
## Median :14.45 Median :0.950 Median :27.55 Median :0.8500
## Mean :19.13 Mean :1.254 Mean :27.01 Mean :0.9077
## 3rd Qu.:23.68 3rd Qu.:1.800 3rd Qu.:30.20 3rd Qu.:1.1750
## Max. :66.80 Max. :3.100 Max. :41.20 Max. :1.9000
## Con SI Fin SPS
## Min. : 2.800 Min. : 5.20 Min. : 0.500 Min. : 5.30
## 1st Qu.: 7.525 1st Qu.: 9.25 1st Qu.: 1.225 1st Qu.:16.25
## Median : 8.350 Median :14.40 Median : 4.650 Median :19.65
## Mean : 8.165 Mean :12.96 Mean : 4.000 Mean :20.02
## 3rd Qu.: 8.975 3rd Qu.:16.88 3rd Qu.: 5.925 3rd Qu.:24.12
## Max. :11.500 Max. :19.10 Max. :11.300 Max. :32.40
## TC
## Min. :3.200
## 1st Qu.:5.700
## Median :6.700
## Mean :6.546
## 3rd Qu.:7.075
## Max. :9.400
employment <- scale(employment)
The following points are observed, basis the correlation of variables:
corrplot(cor(employment), order = "hclust")
On checking the distance matrix between the countries, Turkey and Yugoslavia are separated pretty far from other countries.
#distance matrix
distance <- get_dist(employment)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
Clusters are created using K-means clustering algorithm. For this, different values of k are used and the corresponding results ploted below.
Turkey and Yugoslavia seem to be pretty far away from other countries as we saw in distance matrix, creating a cluster of their own
#kmeans with diff k vals
k2 <- kmeans(employment, centers = 2, nstart = 25)
k3 <- kmeans(employment, centers = 3, nstart = 25)
k4 <- kmeans(employment, centers = 4, nstart = 25)
k5 <- kmeans(employment, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = employment) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = employment) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = employment) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = employment) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
Using the Total sum of Square method to check for in-cluster similarity, we see a bend at 3 clusters. Hence, we take 3 clusters according to this chart.
Using average silhouette width method, which gives how well separated the clusters are, we find maximum separation at cluster size 3.
Hence we proceed with making 3 clusters
#selecting number of clusters
set.seed(12420246)
fviz_nbclust(employment, kmeans, method = "wss")
fviz_nbclust(employment, kmeans, method = "silhouette")
We create 3 clusters on the dataset. The clusters created depends on the starting point, hence we use 25 as the distinct number of starting points.
#creating final kmeans model
set.seed(12420246)
final <- kmeans(employment, 3, nstart = 25)
#comparing cluster wise
k_means_cluster <- cbind(employment, kmeans_cluster = final$cluster)
cluster <- aggregate(k_means_cluster,by=list(final$cluster),mean)
cluster
## Group.1 Agr Min Man PS Con
## 1 1 2.4840999 -0.1585972 -2.09163650 -0.81786082 -2.6223997
## 2 2 0.3526007 0.9496004 0.38775681 0.14568146 0.1121882
## 3 3 -0.4868128 -0.4549756 0.06757616 0.02939187 0.2717058
## SI Fin SPS TC kmeans_cluster
## 1 -1.5644365 0.7838767 -1.6725978 -2.11729825 1
## 2 -0.9524484 -1.0332921 -0.4097020 0.50582974 2
## 3 0.6717788 0.4186614 0.4139257 0.01174741 3
We also try to create clusters using hierarchical clustering method.
We get the following dendogram highlighting various clusters which can be created. Again Turkey and Yugoslavia have their own cluster.
# Dissimilarity matrix
d <- dist(employment, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)
Using average silhouette width method, which gives how well separated the clusters are, we find maximum separation at cluster size 2.
Using the Total sum of Square method to check for in-cluster similarity, we dont see a significant bend at until 3 or 4 clusters. But still its not decisive. Hence we proceed with making 2 clusters
#finding number of clusters
fviz_nbclust(employment, FUN = hcut, method = "wss")
fviz_nbclust(employment, FUN = hcut, method = "silhouette")
We see that one cluster conatins only 2 countries Yugoslavia and other cluster contains the rest.
#creating final model
seed.3clust = cutree(hc1,k=2)
table(seed.3clust)
## seed.3clust
## 1 2
## 24 2
We show the separation in the dendograms, and also show the plot of the 2 clusters.
#plots
plot(hc1, cex = 0.6, hang=-1)
rect.hclust(hc1, k = 2, border = 2:5)
fviz_cluster(list(data = employment, cluster = seed.3clust))