Sushant Gote 17PHD1054
``
install.packages("cluster")
install?packages("dendextend")
install.packages('factoextra')
#Load the packages :
### Installling addtional Packages if required
library('cluster')
library('dendextend')
library('factoextra')
print('Done')
Hierachical Clustering
?ierarchical clustering can be divided into two main types: agglomerative and divisive.
Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-ele?ent cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). T?e result is a tree which can be plotted as a dendrogram.
Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all obje?ts are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster .
Note that agglomerative clustering is good at identifying small clust?rs. Divisive hierarchical clustering is good at identifying large clusters.
The merging or the division of clusters is performed according some (dis)similarity measure. In R softwrare, the Euclidean distance is used by default to measure the dissimilarity?between each pair of observations.
Tt’s easy to compute dissimilarity measure between two pairs of observations. It’s mentioned above that two clusters that are most similar are fused into a new big cluster.
A natural question is : How to measure the dis?imilarity between two clusters of observations?
A number of different cluster agglomeration methods (i.e, linkage methods) has been developed to answer to this question. The most common types methods are:
Maximum or complete linkage clustering: It compu?es all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clust?rs.
Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loo?e” clusters.
Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.
Comple?e linkage and Ward’s method are generally preferred.
Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
Ward’s minimum variance method: It mi?imizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
Data preparation and descriptive statistics
R dataset USArrest which contains statistics, in arrests per 100,000 residents for ?ssault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.
It contains 50 observations on 4 variables:
[,1] Murder numeric Murder arrests (per 100,000) [,2] Assault numeric Assault ?rrests (per 100,000) [,3] UrbanPop numeric Percent urban population [,4] Rape numeric Rape arrests (per 100,000)
# Load the data set
data("USArrests")
# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
df <- na.omit(USArrests)
# View the firt 6 rows of the data
head(df, n = 6)
Before hierarchical clustering, we can compute some descriptive statistics:
desc_stats <- data.frame(
Min = apply(df, 2, min), # minimum
Med = apply(df, 2, median), # median
Mean = apply(df, 2, mean), # mean
SD = apply(df, 2, sd), # Standard deviation
Max = apply(df, 2, max) # Maximum
)
desc_stats <- round(desc_stats, 1)
head(desc_stats)
Note that the variables have a large different means and va?iances. This is explained by the fact that the variables are measured in different units; Murder, Rape, and Assault are measured as the number of occurrences per 100 000 people, and UrbanPop is the percentage of the state’s population that lives in an urba? area.
They must be standardized (i.e., scaled) to make them comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.
df <- scale(df)
head(df)
Murder Assault UrbanPop Rape
Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
Colorado 0.02571456 0.3988593 0.8608085 1.864967207
print('Look @ this head')
[1] "Look @ this head"
R functions for hierarchical clustering
There are different functions available in R for computing hierarchical clustering. The commonly used functions are:
hclust() [in stats package] and agnes() [in cluster package] for agglomerative hierarch?cal clustering (HC) diana() [in cluster package] for divisive HC #### hclust() function hclust() is the built-in R function [in stats package] for computing hierarchical clustering.
The simplified format is:
hclust(d, method = “complete”)
d a dissimilar?ty structure as produced by the dist() function. method: The agglomeration method to be used. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
The dist() function is used to compute the Eu?lidean distance between observations. Finally, observations are clustered using Ward’s method.
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )
# Plot the obtained dendrogram
plot(res.hc, cex = 0.6, hang = -1)

agnes() and diana() functions
The R function agnes() [in cluster package] can be also used to compute agglomerative hierarchical clustering. The R function diana() [ in cluster package ] is?an example of divisive hierarchical clustering.
Agglomerative Nesting (Hierarchical Clustering)
agnes(x, metric = “euclidean”, stand = FALSE, method = “average”)
DIvisive ANAlysis Clustering
diana(x, metric = “euclidean”, stand = FALSE)
x: data ?atrix or data frame or dissimilarity matrix. In case of matrix and data frame, rows are observations and columns are variables. In case of a dissimilarity matrix, x is typically the output of daisy() or dist(). metric: the metric to be used for calculating?dissimilarities between observations. Possible values are “euclidean” and “manhattan”. stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtract?ng the variable’s mean value and dividing by the variable’s mean absolute deviation method: The clustering method. Possible values includes “average”, “single”, “complete”, “ward”.
The function agnes() returns an object of class “agnes” (see ?agnes.objec?) which has methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree(). The function diana() returns an object of class “diana” (see ?diana.object) which has also methods for the functions: print(), summary()? plot(), pltree(), as.dendrogram(), as.hclust() and cutree(). Compared to other agglomerative clustering methods such as hclust(), agnes() has the following features:
It yields the agglomerative coefficient (see agnes.object) which measures the amount of ?lustering structure found Apart from the usual tree it also provides the banner, a novel graphical display (see plot.agnes).
library("cluster")
# Compute agnes()
res.agnes <- agnes(df, method = "ward")
# Agglomerative coefficient
res.agnes$ac
[1] 0.934621
``?{r} # Plot the tree using pltree() pltree(res.agnes, cex = 0.6, hang = -1, main = “Dendrogram of Agnes”) print(‘Done’) ```
# Plot the tree using pltree()
pltree(res.agnes, cex = 0.6, hang = -1, main = "Dendrogram of Agnes")

print('Done')
[1] "Done"
It’s also possible to draw AGNES dendrogram ?sing the function plot.hclust() and the function plot.dendrogram() as follow:
pltree(res.agnes, cex = 0.6, hang = -1, main = "Dendrogram of Agnes")

plot.dendrogram()
# plot.hclust()
plot(as.hclust(res.agnes), cex = 0.6, hang = -1)

DIANA
# plot.dendrogram()
plot(as.dendrogram(res.agnes), cex = 0.6,
horiz = TRUE)

