Hierarchical Clustering

US Housing arrests data

Introduction

Machine Learning is effective in identifying structures and patterns in the data. The aim of this project is to apply Unsupervised LThe data is related to the US housing arrests.This data set contains statistics,in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. The main objective of the analysis is to classify the data into clusters based on the name of the State. The model used is hierarchical clustering.

Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other. Hierarchical clustering can be divided into two main types: agglomerativeand divisive.

Initial Packages required

# Packages
library(readr)
library(dplyr)
library(dendextend)
library(factoextra)
library(cluster)

Exploratory Data Analysis

The data is related to the US housing arrests.This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Format A data frame with 50 observations on 4 variables.

[,1] Murder numeric Murder arrests (per 100,000)
[,2] Assault numeric Assault arrests (per 100,000)
[,3] UrbanPop numeric Percent urban population
[,4] Rape numeric Rape arrests (per 100,000)

Initial Exploratory data anaysis is as seen below : We remove the null values seen in the summary and then scale the data. We need to scale the data as all the values are not on the same scale.

# EDA and cleaning/transformation
data <- USArrests
summary(data)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

sum(is.na(data))

## [1] 0

data2 <- scale(data)

Clustering

Determine optimal number of clusters

We will use fviz_nbclust() method to check optimum number of clusters using silhouette, wss and gap_stat.

Method 1 - Silhouette

#Clustering
set.seed(123)
#Determining the number of optimal clusters 
#Determining optimal number of Clusters (Cluster Evaluation Method 1)
fviz_nbclust(data2, FUN = hcut, method = "silhouette")

Method 2 - WSS

#Determining optimal number of Clusters (Cluster Evaluation Method 2)
fviz_nbclust(data2, FUN = hcut, method = "wss")

Method 3 - Gap Statisic

#Determining optimal number of Clusters (Cluster Evaluation method 3)
fviz_nbclust(data2, FUN = hcut, method = "gap_stat")

As seen above optimal number of clusters in silhouette was 2, but the optimum number of clusters in wss and gap statistic was 3. Hence, we will go with 3 clusters.

Create distance matrix

We now use manhattan distance formula to create a distance matrix. Using Manhattan distance, the silhouette plots obtained were better with higher co-efficient. Hence, we have used Manhattan distance.

#calculate manhattan distance

data2di <- dist(data2, method = "man")

Now that we have created our distance matrix we can create our hierarchical cluster with optimal number of clusters as 3.

We will try all the 3 methods single, complete and average.

Method 1 - Complete linkage

#complete
data2hc <- hclust(data2di, method = "complete")
data2as <- cutree(data2hc, k = 3)

dend_data <- as.dendrogram(data2hc)
cc <- color_branches(dend_data, k=3)
plot(cc)

sil <- silhouette(data2as, data2di)
fviz_silhouette(sil,palette= "jco",ggtheme = theme_minimal())

##   cluster size ave.sil.width
## 1       1    7          0.45
## 2       2   12          0.33
## 3       3   31          0.38

As seen in the plot above the average silhouette score is higher and only the grey cluster shows a minimal negative score. The negative score denotes a few observations are not in the right cluster.

Method 2 - Single linkage

#single

data2hc <- hclust(data2di, method = "single")
data2as <- cutree(data2hc, k = 3)

dend_data <- as.dendrogram(data2hc)
cc <- color_branches(dend_data, k=3)
plot(cc)

sil <- silhouette(data2as, data2di)
fviz_silhouette(sil,palette= "jco",ggtheme = theme_minimal())

##   cluster size ave.sil.width
## 1       1   48          0.14
## 2       2    1          0.00
## 3       3    1          0.00

As seen in the plot above the average silhouette score is lower and most of the observations have gone into the blue cluster. There is also a high negative score which denotes a a large number of observations are not in the right cluster.

Method 3 - Average linkage

#average

data2hc <- hclust(data2di, method = "average")
data2as <- cutree(data2hc, k = 3)

dend_data <- as.dendrogram(data2hc)
cc <- color_branches(dend_data, k=3)
plot(cc)

sil <- silhouette(data2as, data2di)
fviz_silhouette(sil,palette= "jco",ggtheme = theme_minimal())

##   cluster size ave.sil.width
## 1       1    7          0.45
## 2       2   12          0.33
## 3       3   31          0.38

As seen in the plot above the average silhouette score is higher and only the grey cluster shows a minimal negative score. The negative score denotes a slight misclassification. This means they are not in the right cluster.

Final Cluster

As seen in the plots above manhattan distance along with complete / average linkage gives the widest average silhouette width co-effecient. It also gives the least amount of negative silhouette score. The silhouette score of 0.38 indicates a weak structure of the clusters. We create our final cluster with these parameters.

res.hc <- eclust(data2, "hclust",hc_metric = "manhattan",hc_method = "complete",k=3)
fviz_silhouette(res.hc,palette= "jco",ggtheme = theme_minimal())

##   cluster size ave.sil.width
## 1       1    7          0.45
## 2       2   12          0.33
## 3       3   31          0.38

We then plot the final cluster visualization using the fviz_method

fviz_cluster(res.hc)

Conclusion

Based on the results from the above three hierarchical models, and by comparing the silhouette score of each model, we conclude that the hierarchical clustering works well in predicting the target when using manhattan distance along with complete or average linkage. In conclusion we believe that the Hierarchical Clustering model is at best a weak predictor based on the average Silhouette width. We need to try other clustering algorithms like k-means etc.to see if we get a better clustering fit.