Unsupervised learning

In this mini-project on unsupervised learning approach on european jobs dataset, I am exploring machine learning Methodologies such as Classic K-means & Hierachical Clustering

1.0 Clustering

What is Clustering?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Types of Clustering

Broadly speaking, clustering can be divided into two subgroups :

Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.

1.1 Data Import

eurjb<-read.csv("europeanJobsData.csv")
str(eurjb)

## 'data.frame':    26 obs. of  10 variables:
##  $ Country: Factor w/ 26 levels "Austria","Belgium",..: 2 5 8 25 11 12 13 14 23 1 ...
##  $ Agr    : num  3.3 9.2 10.8 6.7 23.2 15.9 7.7 6.3 2.7 12.7 ...
##  $ Min    : num  0.9 0.1 0.8 1.3 1 0.6 3.1 0.1 1.4 1.1 ...
##  $ Man    : num  27.6 21.8 27.5 35.8 20.7 27.6 30.8 22.5 30.2 30.2 ...
##  $ PS     : num  0.9 0.6 0.9 0.9 1.3 0.5 0.8 1 1.4 1.4 ...
##  $ Con    : num  8.2 8.3 8.9 7.3 7.5 10 9.2 9.9 6.9 9 ...
##  $ SI     : num  19.1 14.6 16.8 14.4 16.8 18.1 18.5 18 16.9 16.8 ...
##  $ Fin    : num  6.2 6.5 6 5 2.8 1.6 4.6 6.8 5.7 4.9 ...
##  $ SPS    : num  26.6 32.2 22.6 22.3 20.8 20.1 19.2 28.5 28.3 16.8 ...
##  $ TC     : num  7.2 7.1 5.7 6.1 6.1 5.7 6.2 6.8 6.4 7 ...

rownames(eurjb)<-eurjb$Country

kable(eurjb)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)

	Country	Agr	Min	Man	PS	Con	SI	Fin	SPS	TC
Belgium	Belgium	3.3	0.9	27.6	0.9	8.2	19.1	6.2	26.6	7.2
Denmark	Denmark	9.2	0.1	21.8	0.6	8.3	14.6	6.5	32.2	7.1
France	France	10.8	0.8	27.5	0.9	8.9	16.8	6.0	22.6	5.7
WGermany	WGermany	6.7	1.3	35.8	0.9	7.3	14.4	5.0	22.3	6.1
Ireland	Ireland	23.2	1.0	20.7	1.3	7.5	16.8	2.8	20.8	6.1
Italy	Italy	15.9	0.6	27.6	0.5	10.0	18.1	1.6	20.1	5.7
Luxembourg	Luxembourg	7.7	3.1	30.8	0.8	9.2	18.5	4.6	19.2	6.2
Netherlands	Netherlands	6.3	0.1	22.5	1.0	9.9	18.0	6.8	28.5	6.8
UK	UK	2.7	1.4	30.2	1.4	6.9	16.9	5.7	28.3	6.4
Austria	Austria	12.7	1.1	30.2	1.4	9.0	16.8	4.9	16.8	7.0
Finland	Finland	13.0	0.4	25.9	1.3	7.4	14.7	5.5	24.3	7.6
Greece	Greece	41.4	0.6	17.6	0.6	8.1	11.5	2.4	11.0	6.7
Norway	Norway	9.0	0.5	22.4	0.8	8.6	16.9	4.7	27.6	9.4
Portugal	Portugal	27.8	0.3	24.5	0.6	8.4	13.3	2.7	16.7	5.7
Spain	Spain	22.9	0.8	28.5	0.7	11.5	9.7	8.5	11.8	5.5
Sweden	Sweden	6.1	0.4	25.9	0.8	7.2	14.4	6.0	32.4	6.8
Switzerland	Switzerland	7.7	0.2	37.8	0.8	9.5	17.5	5.3	15.4	5.7
Turkey	Turkey	66.8	0.7	7.9	0.1	2.8	5.2	1.1	11.9	3.2
Bulgaria	Bulgaria	23.6	1.9	32.3	0.6	7.9	8.0	0.7	18.2	6.7
Czechoslovakia	Czechoslovakia	16.5	2.9	35.5	1.2	8.7	9.2	0.9	17.9	7.0
EGermany	EGermany	4.2	2.9	41.2	1.3	7.6	11.2	1.2	22.1	8.4
Hungary	Hungary	21.7	3.1	29.6	1.9	8.2	9.4	0.9	17.2	8.0
Poland	Poland	31.1	2.5	25.7	0.9	8.4	7.5	0.9	16.1	6.9
Rumania	Rumania	34.7	2.1	30.1	0.6	8.7	5.9	1.3	11.7	5.0
USSR	USSR	23.7	1.4	25.8	0.6	9.2	6.1	0.5	23.6	9.3
Yugoslavia	Yugoslavia	48.7	1.5	16.8	1.1	4.9	6.4	11.3	5.3	4.0

1.2 Test/Train Split

The 80-20 split of data

#scale
eurjb.s<-scale(eurjb[,2:10])
#test train data
set.seed(13383645)
train_percent <-0.80
index <- sample(nrow(eurjb.s),nrow(eurjb.s)*train_percent)
eurjb.s.train <- eurjb.s[index,]
eurjb.s.test <- eurjb.s[-index,]

Graphical depiction of distance matrix

fviz_dist helps Visualize the dissimilarity matrix. In the plot below, similar objects are close to one another. Blue color corresponds to small distance and orange color indicates big distance between observation.

get_dist: for computing a distance matrix between the rows of a data matrix. The default distance computed is the Euclidean; however, get_dist also supports other distanced such as Manhattan, Pearson correlation distance,Spearman correlation distance,Kendall correlation distance

distance <- get_dist(eurjb.s.train)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

1.3 K-means Clustering

The basic Idea

The basic idea of k-means clustering is to define clusters then minimize the total intra-cluster variation (known as total within-cluster variation). The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

\[W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2,\] where:

\(x_i\) is a data point belonging to the cluster \(C_k\)
\(\mu_i\) is the mean value of the points assigned to the cluster \(C_k\)

Packages used: arules, arulesviz, factor extra Method:kmeans() Distance Metric: Euclidean Distance

How it works?

Since K-means cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked.So,use the set.seed() function to guarantee that the results are reproducible. Additionally, this clustering approach can be sensitive to the initial selection of centroids.

The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one.

For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended.

Algorithm Steps K-means algorithm can be summarized as follows:

Specify the number of clusters (K) to be created (by the analyst)
Select randomly k objects from the data set as the initial cluster centers or means
Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables.
Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the cluster assignments stop changing(within sum of square stops changing from threshold value in 2 consecutive iterations) or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number of iterations.

Function Returns The function returns the cluster memberships, centroids, sums of squares (within, between, total), and cluster sizes.

Random test for k=2

# K-Means Cluster Analysis
fit <- kmeans(eurjb.s.train, 2) #2 cluster solution
#Display number of clusters in each cluster
table(fit$cluster, dnn=("Clusters"))

## Clusters
##  1  2 
##  5 15

fit

## K-means clustering with 2 clusters of sizes 5, 15
## 
## Cluster means:
##          Agr        Min        Man         PS        Con         SI
## 1 -0.1859426  1.6969895  0.7923082 0.83012873  0.1547263 -0.3929165
## 2 -0.1126145 -0.5228419 -0.2075354 0.03271443 -0.2422954  0.4507527
##          Fin        SPS         TC
## 1 -0.8195075 -0.2230130  0.5417630
## 2  0.3705599  0.1879467 -0.2296043
## 
## Clustering vector:
##     Luxembourg         France       Portugal    Netherlands        Finland 
##              1              2              2              2              2 
##    Switzerland       EGermany         Sweden        Belgium        Hungary 
##              2              1              2              2              1 
##       WGermany Czechoslovakia         Norway        Austria         Poland 
##              2              1              2              2              1 
##          Italy        Ireland         Turkey     Yugoslavia             UK 
##              2              2              2              2              2 
## 
## Within cluster sum of squares by cluster:
## [1]  17.8707 121.0600
##  (between_SS / total_SS =  20.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

cluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

centers A matrix of cluster centres.

totss
The total sum of squares.

withinss
Vector of within-cluster sum of squares, one component per cluster.

tot.withinss
Total within-cluster sum of squares, i.e. sum(withinss).

betweenss
The between-cluster sum of squares, i.e. totss - tot.withinss.

size
The number of points in each cluster.

iter
The number of (outer) iterations.

ifault
integer: indicator of a possible algorithm problem – for experts.

Cluster means for each attribute is depicted in summary

1.4 Visualization of Clusters

Plotiing Cluster using FPC

#plot in k means
plotcluster(eurjb.s.train, fit$cluster)

Plotiing Cluster using fviz_cluster

We want to view the result so We can use fviz_cluster. It is function can provide a nice graph of the clusters. Usually, we have more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

fviz_cluster(fit, data = eurjb.s.train)

k3 <- kmeans(eurjb.s.train, centers = 3, nstart = 25)
k4 <- kmeans(eurjb.s.train, centers = 4, nstart = 25)
k5 <- kmeans(eurjb.s.train, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(fit, geom = "point", data = eurjb.s.train) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = eurjb.s.train) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = eurjb.s.train) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = eurjb.s.train) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

Although this visual assessment tells us where true dilineations occur (or do not occur such as clusters 2 & 4 in the k = 5 graph) between clusters, it does not tell us what the optimal number of clusters is.

1.5 Determining Optimal Clusters

As you may recall the analyst specifies the number of clusters to use; preferably the analyst would like to use the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes:

Elbow method
Silhouette method
Gap statistic

(a) Elbow Method

Steps 1)Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.

2)For each k, calculate the total within-cluster sum of square (wss) Plot the curve of wss according to the number of clusters k.

3)The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

We can implement this in R with the following code. The results suggest that 4 is the optimal number of clusters as it appears to be the bend in the knee (or elbow).

set.seed(13383645)

# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(eurjb.s.train, k, nstart = 10 )$tot.withinss
}

# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15

# extract wss for 2-15 clusters
wss_values <- lapply(k.values, wss)

plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

Short-Cut to the above code Use nbclust

#finding optimal number of clusters
set.seed(13383645)
fviz_nbclust(eurjb.s.train, kmeans, method = "wss")

(b) Silhouette Method

The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.2

We can use the silhouette function in the cluster package to compuate the average silhouette width. The following code computes this approach for 1-15 clusters. The results show that 2 clusters maximize the average silhouette values with 4 clusters coming in as second optimal number of clusters.

# function to compute average silhouette for k clusters
avg_sil <- function(k) {
  km.res <- kmeans(eurjb.s.train, centers = k, nstart = 25)
  
  ss <- silhouette(km.res$cluster, dist(eurjb.s.train))
  mean(ss[, 3])
}

# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15

# extract avg silhouette for 2-15 clusters
avg_sil_values <- lapply(k.values, avg_sil)

plot(k.values, avg_sil_values,
       type = "b", pch = 19, frame = FALSE, 
       xlab = "Number of clusters K",
       ylab = "Average Silhouettes")

#finding optimal number of clusters
set.seed(13383645)
fviz_nbclust(eurjb.s.train, kmeans, method = "silhouette")

(c) Gap statistic

The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering).

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process.hat is, for each variable (xi) in the data set we compute its range[min(xi),max(xj)] and generate values for the n points uniformly from the interval min to max.

For the observed data and the the reference data, the total intracluster variation is computed using different values of k. The gap statistic for a given k is defined as follow:

Gap_n(k) = E^*_n{log(W_k)} - log(W_k)

Where E*n denotes the expectation under a sample size n from the reference distribution. E∗n is defined via bootstrapping (B) by generating B copies of the reference datasets and, by computing the average log(W∗k)

The gap statistic measures the deviation of the observed Wk value from its expected value under the null hypothesis. The estimate of the optimal clusters (^k) will be the value that maximizes Gapn(k). This means that the clustering structure is far away from the uniform distribution of points.

# compute gap statistic
set.seed(13383645)
gap_stat <- clusGap(eurjb.s.train, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 200)

#gap_stat

# Print the result
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = eurjb.s.train, FUNcluster = kmeans, K.max = 10, B = 200,     nstart = 25)
## B=200 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 1
##           logW   E.logW       gap     SE.sim
##  [1,] 2.925022 3.118381 0.1933588 0.05681814
##  [2,] 2.704118 2.848873 0.1447555 0.05494364
##  [3,] 2.481677 2.687948 0.2062709 0.05414343
##  [4,] 2.366974 2.547671 0.1806978 0.05563873
##  [5,] 2.211313 2.419279 0.2079659 0.05776668
##  [6,] 2.093985 2.297379 0.2033948 0.05867080
##  [7,] 1.980976 2.176374 0.1953982 0.06077398
##  [8,] 1.868596 2.054389 0.1857931 0.06411078
##  [9,] 1.734347 1.926873 0.1925257 0.06780857
## [10,] 1.607846 1.794276 0.1864299 0.07064066

#visualize
fviz_gap_stat(gap_stat)

Gap statistic suggestes optimal clusters should be =3.

But, as this method is based on bootstrap sampling and takes an average of bootstrap samples to compute the GAP & we here want clusters based on least intra cluster variance we can choose the optimum k suggested by elbow and silhouette methods

We choose k=2

newCL<-data.frame(eurjb.s.train)%>%
  mutate(Cluster = fit$cluster,
         country = row.names(eurjb.s.train))

#rownames(newCL)<-rownames(eurjb.s.train)

Clusters can also be applied to variables to see where each country stands for these 2 occupations, here I have only used Agriculture and Minning to compare

We can clearly see the difference and seperations of countries based on these 2 occupations

newCL %>%
  as_tibble() %>%
  ggplot(aes(Agr, Min, color = factor(Cluster), label = country)) +
  geom_text()

Cost-Benefit analysis between K means: K-means Benefits: * Can handle larger datasets that HC * Observations are not commited to a cluster hence, the clusters are improvised in the algorithm by re assignment

Cost: * Only works on continious * Need to provide K(Number of clusters) * Severly affected by Outliers

Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of clusters. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram

2.0 Hierachical Clustering

Hierarchical clustering can be divided into two main types: agglomerative and divisive.

Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). The result is a tree which can be plotted as a dendrogram.

Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below).

Note that agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering is good at identifying large clusters.

How do we measure the dissimilarity between two clusters of observations?

A number of different cluster agglomeration methods (i.e, linkage methods) have been developed to answer to this question. The most common types methods are:

Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.

Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.

Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.

Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

[Ward.D is generally accepted in most cases]

Data preprocessing steps to take care

No missing values: we dont have any
Scalled data: we scalled it above

#Calculate the distance matrix

#get_dist or dist both can be used here

# Dissimilarity matrix
seed.dist=dist(eurjb.s.train)
#Obtain clusters using the Wards method
seed.hclust.w=hclust(seed.dist, method="ward.D")
plot(seed.hclust.w,cex = 0.6, hang = -1, main="The Ward Method")

#Obtain clusters using the complete method
seed.hclust.c=hclust(seed.dist, method="complete")

plot(seed.hclust.c,cex = 0.6, hang = -1,main="The Complete Method")

Hierarchical clustering can be divided into two main types: agglomerative and divisive.

alternatively we can also use Agnes function for this

2.1 Agglomerative or Agnes

Bottom up approach
Agglomerative coefficient: which measures the amount of clustering structure (values closer to 1 suggest strong clustering structure).
specify the agglomeration method to be used (i.e. “complete”, “average”, “single”, “ward.D”)

# Compute with agnes
hc1 <- agnes(eurjb.s.train, method = "ward")
hc2 <- agnes(eurjb.s.train, method = "complete")


# Agglomerative coefficient
hc1$ac# ward's better .789

## [1] 0.7895138

hc2$ac#.725

## [1] 0.7253516

We can clearly see Ward is better at agnes clustering, le’s check out the other values as well

# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {
  agnes(eurjb.s.train, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.6483567 0.5760301 0.7253516 0.7895138

Ward method gives a better score hence we can choose the same

2.2 Divisive or Diana

Top to bottom approach
There are no methods like agnes in diana

# compute divisive hierarchical clustering
hc4 <- diana(eurjb.s.train)

# Divise coefficient; amount of clustering structure found
hc4$dc

## [1] 0.7175321

## [1] 0.7175

# plot dendrogram
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")

2.3 Working with Dendrograms

The height of the cut to the dendrogram controls the number of clusters obtained. It plays the same role as the k in k-means clustering. In order to identify sub-groups (i.e. clusters), we can cut the dendrogram with cutree

Out of all the dendrograms we made the best score, hence as part of next steps we work towards cutting the dendrograms

Ward’s method

# Ward's method
# Cut tree into 4 groups
#hc1 <- agnes(eurjb.s.train, method = "ward")
sub_grp4 <- cutree(hc1, k = 4)

# Number of members in each cluster
table(sub_grp4)

## sub_grp4
##  1  2  3  4 
## 14  4  1  1

Cut the dendogram: we choose ward methods d

plot(seed.hclust.w,cex = 0.6, hang = -1, main="The Ward Method")
rect.hclust(hc1, k = 4, border = 2:5)

visualize these groups

fviz_cluster(list(data = eurjb.s.train, cluster = sub_grp4))

Ward’s method

hc1 <- agnes(eurjb.s.train, method = "ward")
sub_grp2 <- cutree(hc1, k = 2)

# Number of members in each cluster
table(sub_grp2)

## sub_grp2
##  1  2 
## 18  2

Cut the dendogram: we choose ward methods d

plot(seed.hclust.w,cex = 0.6, hang = -1, main="The Ward Method")
rect.hclust(hc1, k = 2, border = 2:5)

visualize these groups

fviz_cluster(list(data = eurjb.s.train, cluster = sub_grp2))

Determining Optimal Clusters Similar to how we determined optimal clusters with k-means clustering, we can execute similar approaches for hierarchical clustering:

Elbow Method

fviz_nbclust(eurjb.s.train, FUN = hcut, method = "wss")

Average Silhouette Method

fviz_nbclust(eurjb.s.train, FUN = hcut, method = "silhouette")

Gap Statistic Method

gap_stat <- clusGap(eurjb.s.train, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Hence, we choose k=2

Apply the subgroups to data

e<-data.frame(eurjb.s.train) %>%  
  rownames_to_column('country') %>%
  mutate(cluster = sub_grp2) 

cluster1<-e%>%
  filter(cluster==1) 
  
cluster2<-e%>%
  filter(cluster==2) 

c1<-apply(cluster1[,2:10],2,mean) 
c2<-apply(cluster2[,2:10],2,mean) 

rbind(c1,c2)

##           Agr         Min        Man         PS        Con         SI
## c1 -0.4215072  0.05330626  0.2795435  0.3487268  0.1324444  0.4403101
## c2  2.4840999 -0.15859715 -2.0916365 -0.8178608 -2.6223997 -1.5644365
##            Fin        SPS         TC
## c1 -0.00593846  0.2805184  0.1944082
## c2  0.78387673 -1.6725978 -2.1172983

Observation: Here we can see the Mean of each attribute in each group is certainly different than the other group’s mean

Conclusion: Clustering can be a very useful tool for data analysis in the unsupervised setting. However, there are a number of issues that arise in performing clustering. In the case of hierarchical clustering, we need to be concerned about:

Dissimilarity measure that should be used?
linkage method should be used?
Where should we cut the dendrogram in order to obtain clusters?

Each of these decisions can have a strong impact on the results obtained. In practice, we try several different choices, and look for the one with the most useful or interpretable solution.

Any answer in clustering is not right or wrong, Any interpretation that makes more sense to a analyst can be considered

2.4 Compairing Dendrogams:

Yes! there is a provision to compare 2 Dendrograms side by side using function tanglegram

Comparision of dendrograms that we created from agnes with ward and complete methods

# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)

tanglegram(dend1, dend2)

entg<-entanglement(dend1,dend2)

Here we can see that poland,hungry,czechRepubilc, East Germany,Italy,Portugal,West Germany, Switzerland are not alligned in 2 dends.

Also,The quality of the alignment of the two trees can be measured using the function entangelment 0.528837

3.0 Conclusion

Clustering can be a very useful tool for data analysis in the unsupervised setting.However, both types discussed here have some benefits and drawbacks.

On one hand where k means require to provide clusters. but is immune to outliers or data size effects.

HC on the other hand easy to implement, and does not require any clusters number to start with. Although, requires to manay imp factors to consider such as method to calculate dissimilarity, approach to tree creation & cut parts.

With these methods, there is no single right answer - any solution that exposes some interesting aspects of the data should be considered.

Unsupervised Learning

K means and Hierachical Clustering

Vidhi Rathod

01 May 2020