Clustering is the process of grouping data with similar features in large dataset. Data points in the same clusters or subgroup will exhibit same behavior, in other words observations of such data points will be similar to each other. Similarly, observations of data points from different cluster are much different from each other.
head(USArrests, 5)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
print('Dimension of dataset is as followws:')
## [1] "Dimension of dataset is as followws:"
dim(USArrests)
## [1] 50 4
print('Column names in dataset is:')
## [1] "Column names in dataset is:"
colnames(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
describe(USArrests)
## USArrests
##
## 4 Variables 50 Observations
## --------------------------------------------------------------------------------
## Murder
## n missing distinct Info Mean Gmd .05 .10
## 50 0 43 1 7.788 5.022 2.145 2.560
## .25 .50 .75 .90 .95
## 4.075 7.250 11.250 13.320 15.400
##
## lowest : 0.8 2.1 2.2 2.6 2.7, highest: 13.2 14.4 15.4 16.1 17.4
## --------------------------------------------------------------------------------
## Assault
## n missing distinct Info Mean Gmd .05 .10
## 50 0 45 1 170.8 96.44 50.25 56.90
## .25 .50 .75 .90 .95
## 109.00 159.00 249.00 279.60 297.30
##
## lowest : 45 46 48 53 56, highest: 285 294 300 335 337
## --------------------------------------------------------------------------------
## UrbanPop
## n missing distinct Info Mean Gmd .05 .10
## 50 0 36 0.999 65.54 16.74 44.00 45.00
## .25 .50 .75 .90 .95
## 54.50 66.00 77.75 83.20 86.55
##
## lowest : 32 39 44 45 48, highest: 85 86 87 89 91
## --------------------------------------------------------------------------------
## Rape
## n missing distinct Info Mean Gmd .05 .10
## 50 0 48 1 21.23 10.48 8.75 10.67
## .25 .50 .75 .90 .95
## 15.08 20.10 26.17 32.40 39.74
##
## lowest : 7.3 7.8 8.3 9.3 9.5, highest: 35.1 38.7 40.6 44.5 46.0
## --------------------------------------------------------------------------------
# checking for missing values
sum(is.na(USArrests))
## [1] 0
sum(is.null(USArrests))
## [1] 0
head(USArrests,5)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
Linkage: Linkage defines how to calculate distance between clusters containing multiple data points. Different methods of linkages are as follows * Complete Linkage: largest distance between elements of two clusters * Single: smallest distance between elements of two clusters * Average: Average dissimilarity between all elements of two clusters * Centroid: Dissimilarity between the centroids
h_dist <- dist(USArrests,method = "euclidean")
hc.complete <- hclust(h_dist, method = "complete")
hc.average <- hclust(h_dist, method = "average")
hc.single <- hclust(h_dist, method = "single")
plot(hc.complete, main = "Complete Linkage",
xlab = "", sub = "", cex = .9)
plot(hc.average, main = "Average Linkage",
xlab = "", sub = "", cex = .9)
plot(hc.single, main = "Single Linkage",
xlab = "", sub = "", cex = .9)
Above plot clustering of states using hierarchical clustering with complete linkage and eucledian distance calculation.
In this step, we are going to restrict the clusters by 3 using cutree() function. States fall under 3 clusters in following manner: Cluster 1: 16 States Nevada, Michigan, New York, Illinois, California, New Mexico, Arizona, Maryland, South Carolina, Mississippi, Alaska, Louisiana, Alabama, Delaware, North Carolina, Florida. Cluster 2: 14 States Vermont, North Dakota, South Dakota, Maine, West Virginia, New Hampshire, Iowa, Wisconsin, Minnesota, Hawaii, Kansas, Indiana,Idaho, Montana, Kentucky, Nebraska, Pennsylvania, Connecticut, Utah, Ohio. Cluster 3: 20 States New Jersey, Massachusetts, Washington, Virginia, Oklahoma, Oregon, Wyoming, Rhode Island, Texas, Colorado, Georgia, Tennessee, Arkansas, Missouri.
hc_comp_tree = cutree(hc.complete, 3)
hc_comp_tree
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
clusplot(x = USArrests,
clus = hc_comp_tree,
lines = 0,
shade = FALSE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = "US Arrests States Clusters")
Scaling is performed over dataset by standardization method i.e. scaling the variables to have standard deviation of 1 and mean of 0.
# Scaling
df_us <- scale(USArrests)
sd(df_us)
## [1] 0.9924337
h_scale_dist <- dist(df_us,method = "euclidean")
hc_scale.complete <- hclust(h_scale_dist, method = "complete")
hc_scale.average <- hclust(h_scale_dist, method = "average")
hc_scale.single <- hclust(h_scale_dist, method = "single")
plot(hc_scale.complete, main = "Complete Linkage - After Scaling",
xlab = "", sub = "", cex = .9)
plot(hc_scale.average, main = "Average Linkage - After Scaling",
xlab = "", sub = "", cex = .9)
plot(hc_scale.single, main = "Single Linkage - After Scaling",
xlab = "", sub = "", cex = .9)
hc_Scale_comp_tree <- cutree(hc_scale.complete, 3)
hc_Scale_comp_tree
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
clusplot(x = df_us,
clus = hc_Scale_comp_tree,
lines = 0,
shade = FALSE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = "US Arrests States Clusters After Scaling")
From the above output, we can see that standard deviation is nearly 1 for the complete dataset. Above plot clearly explains the difference in clustering before and after scaling.
clusplot(x = USArrests,
clus = hc_comp_tree,
lines = 0,
shade = FALSE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = "US Arrests States Clusters before scaling")
clusplot(x = df_us,
clus = hc_Scale_comp_tree,
lines = 0,
shade = FALSE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = "US Arrests States Clusters After Scaling")