``
install.packages("cluster")
install?packages("dendextend")
install.packages('factoextra')
#Load the packages :
### Installling addtional Packages if required
install.packages('kohonen')
Warning in install.packages :
unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/kohonen_3.0.8.zip'
Content type 'application/zip' length 2589429 bytes (2.5 MB)
downloaded 2.5 MB
package kohonen successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\user\AppData\Local\Temp\RtmpKCKoK5\downloaded_packages
library('kohonen')
package <U+393C><U+3E31>kohonen<U+393C><U+3E32> was built under R version 3.5.3
library('cluster')
library('dendextend')
library('factoextra?)
print('Done')
Hierarchical clustering can be divided into two main types: agglomerative and divisive.
Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, eac? object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one?single big cluster (root) (see figure below). The result is a tree which can be plotted as a dendrogram.
Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of A?NES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster .
Note that agglomerati?e clustering is good at identifying small clusters. Divisive hierarchical clustering is good at identifying large clusters.
The merging or the division of clusters is performed according some (dis)similarity measure. In R softwrare, the Euclidean distance?is used by default to measure the dissimilarity between each pair of observations.
Tt’s easy to compute dissimilarity measure between two pairs of observations. It’s mentioned above that two clusters that are most similar are fused into a new big cluster.? A natural question is : How to measure the dissimilarity between two clusters of observations?
A number of different cluster agglomeration methods (i.e, linkage methods) has been developed to answer to this question. The most common types methods are:
?aximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two ?lusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a l?nkage criterion. It tends to produce long, “loose” clusters.
Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities a? the distance between the two clusters.
Complete linkage and Ward’s method are generally preferred.
Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for c?uster 2.
Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
R dataset USArrest which contains s?atistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.
It contains 50 observations on 4 variables:
[,1] Murder numeric Murder arr?sts (per 100,000) [,2] Assault numeric Assault arrests (per 100,000) [,3] UrbanPop numeric Percent urban population [,4] Rape numeric Rape arrests (per 100,000)
# Load the data set
data("USArrests")
# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
df <- na.omit(USArrests)
# View the firt 6 rows of the data
head(df, n = 6)
desc_stats <- data.frame(
Min = apply(df, 2, min), # minimum
Med = apply(df, 2, median), # median
Mean = apply(df, 2, mean), # mean
SD = apply(df, 2, sd), # Standard deviation
Max = apply(df, 2, max) # Maximum
)
desc_stats <- round(desc_stats, 1)
head(desc_stats)
Note that t?e variables have a large different means and variances. This is explained by the fact that the variables are measured in different units; Murder, Rape, and Assault are measured as the number of occurrences per 100 000 people, and UrbanPop is the percentage?of the state’s population that lives in an urban area.
They must be standardized (i.e., scaled) to make them comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.
{r}?df <- scale(df) head(df) print('Look @ this head')
There are different functions available in R for computing hierarchical clustering. The commonly used functions are:
hclust() [in stats package] and agnes()?[in cluster package] for agglomerative hierarchical clustering (HC) diana() [in cluster package] for divisive HC #### hclust() function hclust() is the built-in R function [in stats package] for computing hierarchical clustering.
The simplified format is:? hclust(d, method = “complete”)
d a dissimilarity structure as produced by the dist() function. method: The agglomeration method to be used. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.?
The dist() function is used to compute the Euclidean distance between observations. Finally, observations are clustered using Ward’s method.
df <- scale(df)
head(df)
Murder Assault UrbanPop Rape
Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
Colorado 0.02571456 0.3988593 0.8608085 1.864967207
print('Look @ this head')
[1] "Look @ this head"
The R function agnes() [in cluster package] can be also used to compute agglomerative hierarchical clustering. ?he R function diana() [ in cluster package ] is an example of divisive hierarchical clustering.
agnes(x, metric = “euclidean”, stand = FALSE, method = “average”)
diana(x? metric = “euclidean”, stand = FALSE)
x: data matrix or data frame or dissimilarity matrix. In case of matrix and data frame, rows are observations and columns are variables. In case of a dissimilarity matrix, x is typically the output of daisy() or dist(?. metric: the metric to be used for calculating dissimilarities between observations. Possible values are “euclidean” and “manhattan”. stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are stan?ardized for each variable (column), by subtracting the variable’s mean value and dividing by the variable’s mean absolute deviation method: The clustering method. Possible values includes “average”, “single”, “complete”, “ward”.
The function agnes() retu?ns an object of class “agnes” (see ?agnes.object) which has methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree(). The function diana() returns an object of class “diana” (see ?diana.object) which has al?o methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree(). Compared to other agglomerative clustering methods such as hclust(), agnes() has the following features:
It yields the agglomerative coefficient ?see agnes.object) which measures the amount of clustering structure found Apart from the usual tree it also provides the banner, a novel graphical display (see plot.agnes).
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )
# Plot the obtained dendrogram
plot(res.hc, cex = 0.6, hang = -1)
library("cluster")
# Compute agnes()
res.agnes <- agnes(df, method = "ward")
# Agglomerative coefficient
res.agnes$ac
[1] 0.934621
# Plot the tree using pltree()
pltree(res.agnes, cex = 0.6, hang = -1, main = "Dendrogram of Agnes")
print('Done')
[1] "Done"
#?## It’s also possible to draw AGNES dendrogram using the function plot.hclust() and the function plot.dendrogram() as follow:
pltree(res.agnes, cex = 0.6, hang = -1, main = "Dendrogram of Agnes")
plot.dendrogram()
# plot.hclust()
plot(as.hclust(res.agnes), cex = 0.6, hang = -1)
# plot.dendrogram()
plot(as.dendrogram(res.agnes), cex = 0.6,
horiz = TRUE)
# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
# Number of members in each cluster
table(grp)
grp
1 2 3 4
7 12 19 12
library(factoextra)
fviz_cluster(list(data = df, cluster = grp))
plot(res.hc, cex = 0.6)
rect.hclust(res.hc, k = 4, border = 2:5)
library(factoextra)
fviz_cluster(list(data = df, cluster = grp))
# Compute diana()
res.diana <- diana(df)
print('Diana Computed')
[1] "Diana Computed"
# Plot the tree
pltree(res.diana, cex = 0.6, hang = -1,
main = "Dendrogram of Diana")
# Plot the tree
pltree(res.diana, cex = 0.6, hang = -1,
main = "Second time Dendrogram of Diana")
# Divise coefficient; amount of clustering structure found
res.diana$dc
[1] 0.8514345
## [1] 0.8514345
In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches, which?are themselves fused at a higher height.
The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are.
Note that, conclusions a?out the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused. We cannot use the proximity of two observations along the horizontal axis as a criteria of their similarity.
In o?der to identify sub-groups (i.e. clusters), Cuttingt the dendrogram at a certain height as described in the next section.
The height of the cut to the dendrogram controls the number of clusters obtained. It pl?ys the same role as the k in k-means clustering.
The function cutree() is used and it returns a vector containing the cluster number of each observation
# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
# Number of members in each cluster
table?grp)
# Get the names for the members of cluster 1
rownames(df)[grp == 1]
It’s also possible to draw the dendrogram with a border around the 4 clusters. The argument border is used to specify the border colors for the rectangles:
plot(as.dendrogram(res.diana), cex = 0.6,
horiz = TRUE)
plot(res.hc, cex = 0.6)
rect.hclust(res.hc, k = 4, border = 2:5)
plot(res.hc, cex = 0.6)
rect.hclust(res.hc, k = 4, border = 2:5)
Using the function fviz_cluster() [in factoextra], visualizing the result in a scatter plot. Observations are represented by points in the plot, using principal components. A frame is drawn around each cluster.
plot(res.hc, cex = 0.6)?rect.hclust(res.hc, k = 4, border = 2:5)
library(factoextra)
fviz_cluster(list(data = df, cluster = grp))
plot(res.hc, cex = 0.6)
rect.hclust(res.hc, k = 4, border = 2:5)
The function cutree() can be used also to cut the tree generated wit? agnes() and diana() as follow:
library(factoextra)
fviz_cluster(list(data = df, cluster = grp))
# Cut agnes() tree into 4 groups
cutree(res.agnes, k = 4)
Alabama Alaska Arizona Arkansas
1 2 2 3
California Colorado Connecticut Delaware
2 2 3 3
Florida Georgia Hawaii Idaho
2 1 3 4
Illinois Indiana Iowa Kansas
2 3 4 3
Kentucky Louisiana Maine Maryland
3 1 4 2
Massachusetts Michigan Minnesota Mississippi
3 2 4 1
Missouri Montana Nebraska Nevada
3 4 4 2
New Hampshire New Jersey New Mexico New York
4 3 2 2
North Carolina North Dakota Ohio Oklahoma
1 4 3 3
Oregon Pennsylvania Rhode Island South Carolina
3 3 3 1
South Dakota Tennessee Texas Utah
4 1 2 3
Vermont Virginia Washington West Virginia
4 3 3 4
Wisconsin Wyoming
4 3
# Cut diana() tree into 4 groups
cutree(as.hclust(res.diana), k = 4)
Alabama Alaska Arizona Arkansas
1 2 2 3
California Colorado Connecticut Delaware
2 2 3 3
Florida Georgia Hawaii Idaho
2 1 3 4
Illinois Indiana Iowa Kansas
2 3 4 3
Kentucky Louisiana Maine Maryland
4 1 4 2
Massachusetts Michigan Minnesota Mississippi
3 2 4 1
Missouri Montana Nebraska Nevada
2 4 4 2
New Hampshire New Jersey New Mexico New York
4 3 2 2
North Carolina North Dakota Ohio Oklahoma
1 4 3 3
Oregon Pennsylvania Rhode Island South Carolina
3 3 3 1
South Dakota Tennessee Texas Utah
4 1 2 3
Vermont Virginia Washington West Virginia
4 3 3 4
Wisconsin Wyoming
4 3
# Cut diana() tree into 4 groups
cutree(as.hclust(res.diana), k = 4)
Alabama Alaska Arizona Arkansas
1 2 2 3
California Colorado Connecticut Delaware
2 2 3 3
Florida Georgia Hawaii Idaho
2 1 3 4
Illinois Indiana Iowa Kansas
2 3 4 3
Kentucky Louisiana Maine Maryland
4 1 4 2
Massachusetts Michigan Minnesota Mississippi
3 2 4 1
Missouri Montana Nebraska Nevada
2 4 4 2
New Hampshire New Jersey New Mexico New York
4 3 2 2
North Carolina North Dakota Ohio Oklahoma
1 4 3 3
Oregon Pennsylvania Rhode Island South Carolina
3 3 3 1
South Dakota Tennessee Texas Utah
4 1 2 3
Vermont Virginia Washington West Virginia
4 3 3 4
Wisconsin Wyoming
4 3
# Cut diana() tree into 4 groups
cutree(as.hclust(res.diana), k = 4)
Alabama Alaska Arizona Arkansas
1 2 2 3
California Colorado Connecticut Delaware
2 2 3 3
Florida Georgia Hawaii Idaho
2 1 3 4
Illinois Indiana Iowa Kansas
2 3 4 3
Kentucky Louisiana Maine Maryland
4 1 4 2
Massachusetts Michigan Minnesota Mississippi
3 2 4 1
Missouri Montana Nebraska Nevada
2 4 4 2
New Hampshire New Jersey New Mexico New York
4 3 2 2
North Carolina North Dakota Ohio Oklahoma
1 4 3 3
Oregon Pennsylvania Rhode Island South Carolina
3 3 3 1
South Dakota Tennessee Texas Utah
4 1 2 3
Vermont Virginia Washington West Virginia
4 3 3 4
Wisconsin Wyoming
4 3
The Kohonen packages allows us to visualise the count of how many samples are mapped to each node on the map. This metric can be used as a measure of map quality - ideally the sample distribution is relatively uniform. Large values in some map ar?as suggests that a larger map would be benificial. Empty nodes indicate that your map size is too big for the number of samples. Aim for at least 5-10 samples per node when choosing map size.
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix <- as.matrix(scale(df))
# Create the SOM Grid - you generally have to specify the size of the
# training grid prior to training the SOM. Hexagonal and Circular
# topologies are possible
som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
# Finally, train the SOM, options for the number of iterations,
# the learning rates, and the neighbourhood are available
som_model <- som(data_train_matrix, somgrid(5, 5, "hexagonal"))
plot(som_model, type="changes")
Often referred to as the “U-Matrix”, this visualisation is of the distance between each node and i?s neighbours. Typically viewed with a grayscale palette, areas of low neighbour distance indicate groups of nodes that are similar. Areas with large distances indicate the nodes are much more dissimilar - and indicate natural boundaries between node cluste?s. The U-Matrix can be used to identify clusters within the SOM map. #### U-matrix visualisation plot(som_model, type=“dist.neighbours”, main = “SOM neighbour distances”)
# Node count plot
plot(som_model, type="count", main="Node Counts")
print('many samples are mapped to each node on the map. This metric can be used as a measure of map quality')
[1] "many samples are mapped to each node on the map. This metric can be used as a measure of map quality"
The node weight vectors, or “codes”, are made up of normalised values of the original variables used to generate the SOM. Each node’s weight vector is represent?tive / similar of the samples mapped to that node. By visualising the weight vectors across the map, we can see patterns in the distribution of samples and variables. The default visualisation of the weight vectors is a “fan diagram”, where individual fan ?epresentations of the magnitude of each variable in the weight vector is shown for each node.
#### U-matrix visualisation
plot(som_model, type="dist.neighbours", main = "SOM neighbour distances")
print('Weight vector View')
Heatmaps are perhaps the most important visua?isation possible for Self-Organising Maps. The use of a weight space view as in (4) that tries to view all dimensions on the one diagram is unsuitable for a high-dimensional (>7 variable) SOM. A SOM heatmap allows the visualisation of the distribution of a?single variable across the map. Typically, a SOM investigative process involves the creation of multiple heatmaps, and then the comparison of these heatmaps to identify interesting areas on the map. It is important to remember that the individual sample po?itions do not move from one visualisation to another, the map is simply coloured by different variables. The default Kohonen heatmap is created by using the type “heatmap”, and then providing one of the variables from the set of node weights. In this case ?e visualise the average education level on the SOM.
# Weight Vector View
plot(som_model, type="codes")
print('Weight vector View')
[1] "Weight vector View"
# Kohonen Heatmap creation
plot(som_model, type = "property", property = getCodes(som_model)[,4], main=colnames(getCodes(som_model))[4], palette.name=coolBlueHotRed)
Clustering can be performed on the SOM nodes to isolate groups of samples with similar metrics. Manual identification of clusters is completed by exploring the heatmaps for a number of variab?es and drawing up a “story” about the different areas on the map. An estimate of the number of clusters that would be suitable can be ascertained using a kmeans algorithm and examing for an “elbow-point” in the plot of “within cluster sum of squares”. The?Kohonen package documentation shows how a map can be clustered using hierachical clustering. The results of the clustering can be visualised using the SOM plot function again.
```