Basic - Hierarchical clustering on US Arrests

Introduction

Clustering is the process of grouping data with similar features in large dataset. Data points in the same clusters or subgroup will exhibit same behavior, in other words observations of such data points will be similar to each other. Similarly, observations of data points from different cluster are much different from each other.

Hierarchical Clustering:

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.
The Agglomerative Clustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.

Understanding dataset

head(USArrests, 5)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6

summary(USArrests)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

print('Dimension of dataset is as followws:')

## [1] "Dimension of dataset is as followws:"

dim(USArrests)

## [1] 50  4

print('Column names in dataset is:')

## [1] "Column names in dataset is:"

colnames(USArrests)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

describe(USArrests)

## USArrests 
## 
##  4  Variables      50  Observations
## --------------------------------------------------------------------------------
## Murder 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       50        0       43        1    7.788    5.022    2.145    2.560 
##      .25      .50      .75      .90      .95 
##    4.075    7.250   11.250   13.320   15.400 
## 
## lowest :  0.8  2.1  2.2  2.6  2.7, highest: 13.2 14.4 15.4 16.1 17.4
## --------------------------------------------------------------------------------
## Assault 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       50        0       45        1    170.8    96.44    50.25    56.90 
##      .25      .50      .75      .90      .95 
##   109.00   159.00   249.00   279.60   297.30 
## 
## lowest :  45  46  48  53  56, highest: 285 294 300 335 337
## --------------------------------------------------------------------------------
## UrbanPop 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       50        0       36    0.999    65.54    16.74    44.00    45.00 
##      .25      .50      .75      .90      .95 
##    54.50    66.00    77.75    83.20    86.55 
## 
## lowest : 32 39 44 45 48, highest: 85 86 87 89 91
## --------------------------------------------------------------------------------
## Rape 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       50        0       48        1    21.23    10.48     8.75    10.67 
##      .25      .50      .75      .90      .95 
##    15.08    20.10    26.17    32.40    39.74 
## 
## lowest :  7.3  7.8  8.3  9.3  9.5, highest: 35.1 38.7 40.6 44.5 46.0
## --------------------------------------------------------------------------------

# checking for missing values
sum(is.na(USArrests))

## [1] 0

sum(is.null(USArrests))

## [1] 0

head(USArrests,5)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6

Hierarchical clustering with complete linkage and Euclidean distance

Linkage: Linkage defines how to calculate distance between clusters containing multiple data points. Different methods of linkages are as follows * Complete Linkage: largest distance between elements of two clusters * Single: smallest distance between elements of two clusters * Average: Average dissimilarity between all elements of two clusters * Centroid: Dissimilarity between the centroids

h_dist <- dist(USArrests,method = "euclidean")
hc.complete <- hclust(h_dist, method = "complete")
hc.average <- hclust(h_dist, method = "average")
hc.single <- hclust(h_dist, method = "single")


plot(hc.complete, main = "Complete Linkage",
    xlab = "", sub = "", cex = .9)

plot(hc.average, main = "Average Linkage",
     xlab = "", sub = "", cex = .9)

plot(hc.single, main = "Single Linkage",
     xlab = "", sub = "", cex = .9)

Above plot clustering of states using hierarchical clustering with complete linkage and eucledian distance calculation.

Clustering

In this step, we are going to restrict the clusters by 3 using cutree() function. States fall under 3 clusters in following manner: Cluster 1: 16 States Nevada, Michigan, New York, Illinois, California, New Mexico, Arizona, Maryland, South Carolina, Mississippi, Alaska, Louisiana, Alabama, Delaware, North Carolina, Florida. Cluster 2: 14 States Vermont, North Dakota, South Dakota, Maine, West Virginia, New Hampshire, Iowa, Wisconsin, Minnesota, Hawaii, Kansas, Indiana,Idaho, Montana, Kentucky, Nebraska, Pennsylvania, Connecticut, Utah, Ohio. Cluster 3: 20 States New Jersey, Massachusetts, Washington, Virginia, Oklahoma, Oregon, Wyoming, Rhode Island, Texas, Colorado, Georgia, Tennessee, Arkansas, Missouri.

hc_comp_tree = cutree(hc.complete, 3)

hc_comp_tree

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

clusplot(x = USArrests,
         clus = hc_comp_tree,
         lines = 0,
         shade = FALSE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = "US Arrests States Clusters")

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one

Scaling is performed over dataset by standardization method i.e. scaling the variables to have standard deviation of 1 and mean of 0.

# Scaling
df_us <- scale(USArrests)
sd(df_us)

## [1] 0.9924337

h_scale_dist <- dist(df_us,method = "euclidean")
hc_scale.complete <- hclust(h_scale_dist, method = "complete")
hc_scale.average <- hclust(h_scale_dist, method = "average")
hc_scale.single <- hclust(h_scale_dist, method = "single")

plot(hc_scale.complete, main = "Complete Linkage - After Scaling",
    xlab = "", sub = "", cex = .9)

plot(hc_scale.average, main = "Average Linkage - After Scaling",
     xlab = "", sub = "", cex = .9)

plot(hc_scale.single, main = "Single Linkage - After Scaling",
     xlab = "", sub = "", cex = .9)

hc_Scale_comp_tree <- cutree(hc_scale.complete, 3)

hc_Scale_comp_tree

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              2              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              3              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              2              3              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              3              3              3

clusplot(x = df_us,
         clus = hc_Scale_comp_tree,
         lines = 0,
         shade = FALSE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = "US Arrests States Clusters After Scaling")

From the above output, we can see that standard deviation is nearly 1 for the complete dataset. Above plot clearly explains the difference in clustering before and after scaling.

Effect of Scaling on the hierarchical clustering obtained

It is clear that clustering is much more accurate after scaling the dataset. We can clearly able to identify the clusters of every state without any convergence.
There is no overlapping issue after scaling but overlapping exists before scaling the dataset. Thus in order to achieve more accuracy in clustering, standardization is more important so that we can get much information about the structures of data points.

clusplot(x = USArrests,
         clus = hc_comp_tree,
         lines = 0,
         shade = FALSE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = "US Arrests States Clusters before scaling")

clusplot(x = df_us,
         clus = hc_Scale_comp_tree,
         lines = 0,
         shade = FALSE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = "US Arrests States Clusters After Scaling")