INTRODUCTION

Percentage employed in different industries in Europe countries during 1979 is analyzed to get insight into patterns of employment (if any) amongst European countries in 1970s. The data contains employment information for agriculture, mining, manufacturing, power supply, construction, service industries, finance, social and personal services, transport and communication industries. If you are intersted only in the report, you can find it in slideshare

library(tidyverse)
library(cluster) 
library(knitr)
library(kableExtra)
library(fpc)
library(factoextra)
library(DT)
jobs_data <- read.table("europeanJobs.txt", header = TRUE)
rownames(jobs_data) <- jobs_data$Country

From figure 1, we can see that most of the people were employed in manufacturing, agriculture, service indutries, social and personal services during the 1970s.

jobs_gathered <- gather(jobs_data ,"industry","percentage",-c("Country"))
industry_boxplot <- jobs_gathered %>%
                    ggplot(aes(x = industry , y = percentage )) +
                    geom_boxplot() +
                    xlab("Industry") + ylab("Percentage employed")+
                    geom_jitter(alpha = 0.4)
industry_boxplot

This starts to illustrate which states have large dissimilarities (red) versus those that appear to be fairly similar (teal). We can see that Turkey , Greece ,Yugoslavia are very different from rest of the group

distance <- get_dist(jobs_data)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

The data is standardised before clustering.

jobs_data[, -1] <- scale(jobs_data[, -1])

K MEANS CLUSTERING

K means is a simple unsupervised machine learning algorithm in which data points are grouped together based on similarity and the number of groups are denoted by k. Figure 2 illustrate which countries have large dissimilarities (red) versus those that appear to be fairly similar (teal). We can see that Turkey and Yugoslavia are very different from rest of the group. The data is standardized before clustering. To start with, we try to create 5 groups.

employment_cluster <- kmeans(jobs_data[,-1], 5)

The 5 clusters are visualized below.

clusplot(jobs_data, employment_cluster$cluster, color=TRUE, shade=TRUE, 
         labels=2, lines=0, main = "Clustering of employment percentage")

OPTIMAL NUMBER OF CLUSTERS

We need to find number of clusters for which the model is not overfitting but clusters the data as per the actual distribution.

In elbow method, percentage of variance explained as a function of the number of clusters is plotted. First clusters will add much information at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. This “elbow” cannot always be unambiguously identified. From figure 2, we have plotted sum of squares for number of clusters from 1 to 10 and we chose 3 as the number of clusters.

#Elbow Method for finding the optimal number of clusters
# Compute and plot wss for k = 2 to k = 15.

k.max <- 10

wss <- sapply(1:k.max, 
              function(k){ kmeans(jobs_data[,-1], k)$tot.withinss})
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

We verify the number of clusters based on average silhouette width as well. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. We chose the number of clusters that gives the maximum value for silhouette width. We get 3 again in this method

fviz_nbclust(jobs_data[,-1], kmeans, method = "silhouette") 

SELECTED CLUSTERING MODEL

We create a clustering model with 3 groups.

k3_model <- kmeans(jobs_data[,-1], centers = 3, nstart = 25)
fviz_cluster(k3_model, geom = c("point", "text"),  data = jobs_data[,-1]) + ggtitle("Clustering with 3 group")

k3_model
## K-means clustering with 3 clusters of sizes 8, 16, 2
## 
## Cluster means:
##          Agr        Min         Man          PS        Con         SI
## 1  0.3526007  0.9496004  0.38775681  0.14568146  0.1121882 -0.9524484
## 2 -0.4868128 -0.4549756  0.06757616  0.02939187  0.2717058  0.6717788
## 3  2.4840999 -0.1585972 -2.09163650 -0.81786082 -2.6223997 -1.5644365
##          Fin        SPS          TC
## 1 -1.0332921 -0.4097020  0.50582974
## 2  0.4186614  0.4139257  0.01174741
## 3  0.7838767 -1.6725978 -2.11729825
## 
## Clustering vector:
##        Belgium        Denmark         France       WGermany        Ireland 
##              2              2              2              2              2 
##          Italy     Luxembourg    Netherlands             UK        Austria 
##              2              2              2              2              2 
##        Finland         Greece         Norway       Portugal          Spain 
##              2              1              2              2              2 
##         Sweden    Switzerland         Turkey       Bulgaria Czechoslovakia 
##              2              2              3              1              1 
##       EGermany        Hungary         Poland        Rumania           USSR 
##              1              1              1              1              1 
##     Yugoslavia 
##              3 
## 
## Within cluster sum of squares by cluster:
## [1] 39.14856 65.48018 13.44198
##  (between_SS / total_SS =  47.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

HEIRARCHICAL CLUSTERING

Hierarchical clustering using Euclidean distance and complete Linkage is made. Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It does not require us to pre-specify the number of clusters to be generated as is required by the k-means approach. Maximum or complete linkage clusteringcomputes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

#heirarchical
# Dissimilarity matrix
euclidian_dist <- dist(jobs_data, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(euclidian_dist, method = "complete" )
fviz_nbclust(jobs_data, FUN = hcut, method = "wss")

fviz_nbclust(jobs_data, FUN = hcut, method = "silhouette")

The elbow method suggests 3 clusters and silhouette suggests 2. We have chosen 2 clusters and figure shows the cluster dendrogram. Both clustering methods produce similar results. Turkey and Yugoslavia are grouped together in both.

fviz_dend(hc1,k = 2,
          cex = 0.5, # label size
          k_colors = c( "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE, # Add rectangle around groups
          rect_border = c("#00AFBB", "#E7B800"), 
          rect_fill = TRUE)