Machine Learning is effective in identifying structures and patterns in the data. The aim of this project is to apply Unsupervised LThe data is related to the US housing arrests.This data set contains statistics,in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. The main objective of the analysis is to classify the data into clusters based on the name of the State. The model used is hierarchical clustering.
Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other. Hierarchical clustering can be divided into two main types: agglomerativeand divisive.
# Packages
library(readr)
library(dplyr)
library(dendextend)
library(factoextra)
library(cluster)
The data is related to the US housing arrests.This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
Format A data frame with 50 observations on 4 variables.
Initial Exploratory data anaysis is as seen below : We remove the null values seen in the summary and then scale the data. We need to scale the data as all the values are not on the same scale.
# EDA and cleaning/transformation
data <- USArrests
summary(data)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
sum(is.na(data))
## [1] 0
data2 <- scale(data)
Determine optimal number of clusters
We will use fviz_nbclust() method to check optimum number of clusters using silhouette, wss and gap_stat.
#Clustering
set.seed(123)
#Determining the number of optimal clusters
#Determining optimal number of Clusters (Cluster Evaluation Method 1)
fviz_nbclust(data2, FUN = hcut, method = "silhouette")
#Determining optimal number of Clusters (Cluster Evaluation Method 2)
fviz_nbclust(data2, FUN = hcut, method = "wss")
#Determining optimal number of Clusters (Cluster Evaluation method 3)
fviz_nbclust(data2, FUN = hcut, method = "gap_stat")
As seen above optimal number of clusters in silhouette was 2, but the optimum number of clusters in wss and gap statistic was 3. Hence, we will go with 3 clusters.
We now use manhattan distance formula to create a distance matrix. Using Manhattan distance, the silhouette plots obtained were better with higher co-efficient. Hence, we have used Manhattan distance.
#calculate manhattan distance
data2di <- dist(data2, method = "man")
Now that we have created our distance matrix we can create our hierarchical cluster with optimal number of clusters as 3.
We will try all the 3 methods single, complete and average.
#complete
data2hc <- hclust(data2di, method = "complete")
data2as <- cutree(data2hc, k = 3)
dend_data <- as.dendrogram(data2hc)
cc <- color_branches(dend_data, k=3)
plot(cc)
sil <- silhouette(data2as, data2di)
fviz_silhouette(sil,palette= "jco",ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 7 0.45
## 2 2 12 0.33
## 3 3 31 0.38
As seen in the plot above the average silhouette score is higher and only the grey cluster shows a minimal negative score. The negative score denotes a few observations are not in the right cluster.
#single
data2hc <- hclust(data2di, method = "single")
data2as <- cutree(data2hc, k = 3)
dend_data <- as.dendrogram(data2hc)
cc <- color_branches(dend_data, k=3)
plot(cc)
sil <- silhouette(data2as, data2di)
fviz_silhouette(sil,palette= "jco",ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 48 0.14
## 2 2 1 0.00
## 3 3 1 0.00
As seen in the plot above the average silhouette score is lower and most of the observations have gone into the blue cluster. There is also a high negative score which denotes a a large number of observations are not in the right cluster.
#average
data2hc <- hclust(data2di, method = "average")
data2as <- cutree(data2hc, k = 3)
dend_data <- as.dendrogram(data2hc)
cc <- color_branches(dend_data, k=3)
plot(cc)
sil <- silhouette(data2as, data2di)
fviz_silhouette(sil,palette= "jco",ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 7 0.45
## 2 2 12 0.33
## 3 3 31 0.38
As seen in the plot above the average silhouette score is higher and only the grey cluster shows a minimal negative score. The negative score denotes a slight misclassification. This means they are not in the right cluster.
As seen in the plots above manhattan distance along with complete / average linkage gives the widest average silhouette width co-effecient. It also gives the least amount of negative silhouette score. The silhouette score of 0.38 indicates a weak structure of the clusters. We create our final cluster with these parameters.
res.hc <- eclust(data2, "hclust",hc_metric = "manhattan",hc_method = "complete",k=3)
fviz_silhouette(res.hc,palette= "jco",ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 7 0.45
## 2 2 12 0.33
## 3 3 31 0.38
We then plot the final cluster visualization using the fviz_method
fviz_cluster(res.hc)
Based on the results from the above three hierarchical models, and by comparing the silhouette score of each model, we conclude that the hierarchical clustering works well in predicting the target when using manhattan distance along with complete or average linkage. In conclusion we believe that the Hierarchical Clustering model is at best a weak predictor based on the average Silhouette width. We need to try other clustering algorithms like k-means etc.to see if we get a better clustering fit.