Please make sure the following libraries are installed prior to class:
library(tidyverse) # v2.0.0
library(kohonen) # v3.0.10
library(aweSOM) # v2.0.3
library(dslabs) # v0.7.4
Our topic today is based on Chapter 14.4 of The Elements of Statistical Learning (ESL). Self-organizing maps (SOMs) are a type of unsupervised neural network useful for visualizing and clustering high-dimensional data. It works as an unsupervised neural network where we map neurons (nodes) to a 2D grid, which can take the shape of either rectangles or hexagons. SOM works on twp stages: Training and mapping.
To better understand the concepts from SOM, we will use an example with colors to visualize how the clustering works here.
First, let’s generate a data frame of random RGB vectors
sample.size <- 10000
sample.rgb <- as.data.frame(matrix(nrow = sample.size, ncol = 3))
colnames(sample.rgb) <- c('R', 'G', 'B')
sample.rgb$R <- sample(0:255, sample.size, replace = T)
sample.rgb$G <- sample(0:255, sample.size, replace = T)
sample.rgb$B <- sample(0:255, sample.size, replace = T)
grid.size <- ceiling(sample.size ^ (1/2.5))
som.grid <- somgrid(xdim = grid.size, ydim = grid.size, topo = 'hexagonal', toroidal = T)
som.model <- som(data.matrix(sample.rgb), grid = som.grid)
library(kohonen)
## extract some data to make it easier to use
som.events <- som.model$codes[[1]]
som.events.colors <- rgb(som.events[,1], som.events[,2], som.events[,3], maxColorValue = 255)
som.dist <- as.matrix(dist(som.events))
plot(som.model,
type = 'mapping',
bg = som.events.colors[sample.int(length(som.events.colors), size = length(som.events.colors))],
keepMargins = F,
col = NA,
main = '')
plot(som.model,
type = 'mapping',
bg = som.events.colors,
keepMargins = F,
col = NA,
main = '')
clusterMeanDist <- function(clusters){
cluster.means = c()
for(c in unique(clusters)){
temp.members <- which(clusters == c)
if(length(temp.members) > 1){
temp.dist <- som.dist[temp.members,]
temp.dist <- temp.dist[,temp.members]
cluster.means <- append(cluster.means, mean(temp.dist))
}else(cluster.means <- 0)
}
return(mean(cluster.means))
}
library(vegan)
try.k <- 2:100
cluster.dist.eval <- as.data.frame(matrix(ncol = 3, nrow = (length(try.k))))
colnames(cluster.dist.eval) <- c('k', 'kmeans', 'hclust')
for(i in 1:length(try.k)) {
cluster.dist.eval[i, 'k'] <- try.k[i]
cluster.dist.eval[i, 'kmeans'] <- clusterMeanDist(kmeans(som.events, centers = try.k[i], iter.max = 20)$cluster)
cluster.dist.eval[i, 'hclust'] <- clusterMeanDist(cutree(hclust(vegdist(som.events)), k = try.k[i]))
}
plot(cluster.dist.eval[, 'kmeans'] ~ try.k,
type = 'l')
lines(cluster.dist.eval[, 'hclust'] ~ try.k,
col = 'red')
legend('topright',
legend = c('k-means', 'hierarchical'),
col = c('black', 'red'),
lty = c(1, 1))
We will be using the USArrest dataset.
Let’s take USArrests as an example.
# Prepare data
arrests_mat = data.matrix(USArrests)
arrests_train = scale(arrests_mat)
First, we need to initialize the grid. We can use a shortcut \(k = 5 \times \sqrt{n}\) to use the number
of observations n to estimate the number of nodes
k. Can you think of other ways to do that?
n = nrow(arrests_train)
(k = round(5 * sqrt(n)))
## [1] 35
(xdim = round(sqrt(n)))
## [1] 7
(ydim = k / round(sqrt(n)))
## [1] 5
Since our k is rounded to 35, we can factor it to
determine the dimensions of our nodes — in this case, 7 and 5. Use
somInit to build the model. Consult the documentation to
use kohonen::som to train the model and
aweSOMplotto visualise it. What patterns can you
notice?
# Initialize and train SOM
init = somInit(arrests_train, xdim, ydim)
arrests_som = kohonen::som(arrests_train, grid = kohonen::somgrid(xdim, ydim, "hexagonal"),
rlen = 100, alpha = c(0.05, 0.01), radius = c(2.65,-2.65),
dist.fcts = "sumofsquares", init = init)
# Evaluate and visualize
aweSOMplot(som = arrests_som, type ="Circular", data = arrests_mat)
somQuality provides metrics to evaluate the model. What
do these metric mean? How can we interpret the model?
somQuality(arrests_som, arrests_train)
##
## ## Quality measures:
## * Quantization error : 0.1593698
## * (% explained variance) : 95.93
## * Topographic error : 0.08
## * Kaski-Lagus error : 1.255052
##
## ## Number of obs. per map cell:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 2 1 3 2 1 1 3 2 0 1 2 1 3 1 3 1 3 1 0 0 1 1 0 1 1 2
## 27 28 29 30 31 32 33 34 35
## 2 2 2 2 1 0 2 1 1
Quantization error: Average squared distance between the data points and the map’s prototypes to which they are mapped. Lower is better. Percentage of explained variance: Similar to other clustering methods, the share of total variance that is explained by the clustering (equal to 1 minus the ratio of quantization error to total variance). Higher is better. Topographic error: Measures how well the topographic structure of the data is preserved on the map. It is computed as the share of observations for which the best-matching node is not a neighbor of the second-best matching node on the map. Lower is better: 0 indicates excellent topographic representation (all best and second-best matching nodes are neighbors), 1 is the maximum error (best and second-best nodes are never neighbors). Kaski-Lagus error: Combines aspects of the quantization and topographic error. It is the sum of the mean distance between points and their best-matching prototypes, and of the mean geodesic distance (pairwise prototype distances following the SOM grid) between the points and their second-best matching prototype.*
While our observations are within neurons, we can cluster the neurons
into superclasses. Use aweSOMscreeplot to estimate a number
of clusters.
aweSOMscreeplot(som = arrests_som, method = "pam")
From the screeplot we can estimate the number of superclasses to be 2. We can incorporate them into the visualisation.
Use hclust and cutree to perform
hierarchical clustering on the neurons.
superclust_hclust <- hclust(dist(arrests_som$codes[[1]]), "complete")
superclasses_hclust <- cutree(superclust_hclust, 2)
aweSOMplot(som = arrests_som, type ="Circular", data = arrests_mat, superclass = superclasses_hclust)
The clustering is interesting. The graph below shows the distance between neighboring neurons. How does the quality of the model change after incorporating superclasses?
aweSOMplot(som = arrests_som, type ="Hitmap", data = arrests_mat, superclass = superclasses_hclust)
somQuality(arrests_som, arrests_train)
##
## ## Quality measures:
## * Quantization error : 0.1593698
## * (% explained variance) : 95.93
## * Topographic error : 0.08
## * Kaski-Lagus error : 1.255052
##
## ## Number of obs. per map cell:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 2 1 3 2 1 1 3 2 0 1 2 1 3 1 3 1 3 1 0 0 1 1 0 1 1 2
## 27 28 29 30 31 32 33 34 35
## 2 2 2 2 1 0 2 1 1
Change the number of nodes k. What impact does it have on the clustering output and visualization?
Perform clustering using pam() or hierarchical clustering on the SOM vectors. How do the results compare?
How do previous models compare to SOM (PCA, K-means)