R Libraries and Versions

Please make sure the following libraries are installed prior to class:

library(tidyverse)   # v2.0.0
library(kohonen)     # v3.0.10
library(aweSOM)      # v2.0.3
library(dslabs)      # v0.7.4

Introduction

Our topic today is based on Chapter 14.4 of The Elements of Statistical Learning (ESL). Self-organizing maps (SOMs) are a type of unsupervised neural network useful for visualizing and clustering high-dimensional data. It works as an unsupervised neural network where we map neurons (nodes) to a 2D grid, which can take the shape of either rectangles or hexagons. SOM works on twp stages: Training and mapping.

Example with colors

To better understand the concepts from SOM, we will use an example with colors to visualize how the clustering works here.

First, let’s generate a data frame of random RGB vectors

sample.size <- 10000

sample.rgb <- as.data.frame(matrix(nrow = sample.size, ncol = 3))
colnames(sample.rgb) <- c('R', 'G', 'B')

sample.rgb$R <- sample(0:255, sample.size, replace = T)
sample.rgb$G <- sample(0:255, sample.size, replace = T)
sample.rgb$B <- sample(0:255, sample.size, replace = T)

grid.size <- ceiling(sample.size ^ (1/2.5))
som.grid <- somgrid(xdim = grid.size, ydim = grid.size, topo = 'hexagonal', toroidal = T)
som.model <- som(data.matrix(sample.rgb), grid = som.grid)
library(kohonen)

## extract some data to make it easier to use

som.events <- som.model$codes[[1]]
som.events.colors <- rgb(som.events[,1], som.events[,2], som.events[,3], maxColorValue = 255)
som.dist <- as.matrix(dist(som.events))

plot(som.model,
     type = 'mapping',
     bg = som.events.colors[sample.int(length(som.events.colors), size = length(som.events.colors))],
     keepMargins = F,
     col = NA,
     main = '')

plot(som.model,
     type = 'mapping',
     bg = som.events.colors,
     keepMargins = F,
     col = NA,
     main = '')

clusterMeanDist <- function(clusters){
  cluster.means = c()

  for(c in unique(clusters)){
    temp.members <- which(clusters == c)

    if(length(temp.members) > 1){
      temp.dist <- som.dist[temp.members,]
      temp.dist <- temp.dist[,temp.members]
      cluster.means <- append(cluster.means, mean(temp.dist))
    }else(cluster.means <- 0)
  }

  return(mean(cluster.means))

}

library(vegan)

try.k <- 2:100

cluster.dist.eval <- as.data.frame(matrix(ncol = 3, nrow = (length(try.k))))
colnames(cluster.dist.eval) <- c('k', 'kmeans', 'hclust')

for(i in 1:length(try.k)) {
  cluster.dist.eval[i, 'k'] <- try.k[i]
  cluster.dist.eval[i, 'kmeans'] <- clusterMeanDist(kmeans(som.events, centers = try.k[i], iter.max = 20)$cluster)
  cluster.dist.eval[i, 'hclust'] <- clusterMeanDist(cutree(hclust(vegdist(som.events)), k = try.k[i]))
}

plot(cluster.dist.eval[, 'kmeans'] ~ try.k,
     type = 'l')

lines(cluster.dist.eval[, 'hclust'] ~ try.k,
      col = 'red')

legend('topright',
       legend = c('k-means', 'hierarchical'),
       col = c('black', 'red'),
       lty = c(1, 1))

We will be using the USArrest dataset.

Example using USArrest

Let’s take USArrests as an example.

# Prepare data
arrests_mat = data.matrix(USArrests)
arrests_train = scale(arrests_mat)

First, we need to initialize the grid. We can use a shortcut \(k = 5 \times \sqrt{n}\) to use the number of observations n to estimate the number of nodes k. Can you think of other ways to do that?

n = nrow(arrests_train)
(k = round(5 * sqrt(n)))
## [1] 35
(xdim = round(sqrt(n)))
## [1] 7
(ydim = k / round(sqrt(n)))
## [1] 5

Since our k is rounded to 35, we can factor it to determine the dimensions of our nodes — in this case, 7 and 5. Use somInit to build the model. Consult the documentation to use kohonen::som to train the model and aweSOMplotto visualise it. What patterns can you notice?

# Initialize and train SOM

init = somInit(arrests_train, xdim, ydim)
arrests_som = kohonen::som(arrests_train, grid = kohonen::somgrid(xdim, ydim, "hexagonal"), 
                         rlen = 100, alpha = c(0.05, 0.01), radius = c(2.65,-2.65), 
                         dist.fcts = "sumofsquares", init = init)

# Evaluate and visualize

aweSOMplot(som = arrests_som, type ="Circular", data = arrests_mat)

somQuality provides metrics to evaluate the model. What do these metric mean? How can we interpret the model?

somQuality(arrests_som, arrests_train)
## 
## ## Quality measures:
##  * Quantization error     :  0.1593698 
##  * (% explained variance) :  95.93 
##  * Topographic error      :  0.08 
##  * Kaski-Lagus error      :  1.255052 
##  
## ## Number of obs. per map cell:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  2  1  3  2  1  1  3  2  0  1  2  1  3  1  3  1  3  1  0  0  1  1  0  1  1  2 
## 27 28 29 30 31 32 33 34 35 
##  2  2  2  2  1  0  2  1  1

Quantization error: Average squared distance between the data points and the map’s prototypes to which they are mapped. Lower is better. Percentage of explained variance: Similar to other clustering methods, the share of total variance that is explained by the clustering (equal to 1 minus the ratio of quantization error to total variance). Higher is better. Topographic error: Measures how well the topographic structure of the data is preserved on the map. It is computed as the share of observations for which the best-matching node is not a neighbor of the second-best matching node on the map. Lower is better: 0 indicates excellent topographic representation (all best and second-best matching nodes are neighbors), 1 is the maximum error (best and second-best nodes are never neighbors). Kaski-Lagus error: Combines aspects of the quantization and topographic error. It is the sum of the mean distance between points and their best-matching prototypes, and of the mean geodesic distance (pairwise prototype distances following the SOM grid) between the points and their second-best matching prototype.*

While our observations are within neurons, we can cluster the neurons into superclasses. Use aweSOMscreeplot to estimate a number of clusters.

aweSOMscreeplot(som = arrests_som, method = "pam")

From the screeplot we can estimate the number of superclasses to be 2. We can incorporate them into the visualisation.

Use hclust and cutree to perform hierarchical clustering on the neurons.

superclust_hclust <- hclust(dist(arrests_som$codes[[1]]), "complete")
superclasses_hclust <- cutree(superclust_hclust, 2)

aweSOMplot(som = arrests_som, type ="Circular", data = arrests_mat, superclass = superclasses_hclust)

The clustering is interesting. The graph below shows the distance between neighboring neurons. How does the quality of the model change after incorporating superclasses?

aweSOMplot(som = arrests_som, type ="Hitmap", data = arrests_mat, superclass = superclasses_hclust)

somQuality(arrests_som, arrests_train)
## 
## ## Quality measures:
##  * Quantization error     :  0.1593698 
##  * (% explained variance) :  95.93 
##  * Topographic error      :  0.08 
##  * Kaski-Lagus error      :  1.255052 
##  
## ## Number of obs. per map cell:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  2  1  3  2  1  1  3  2  0  1  2  1  3  1  3  1  3  1  0  0  1  1  0  1  1  2 
## 27 28 29 30 31 32 33 34 35 
##  2  2  2  2  1  0  2  1  1

Possible questions

  1. Change the number of nodes k. What impact does it have on the clustering output and visualization?

  2. Perform clustering using pam() or hierarchical clustering on the SOM vectors. How do the results compare?

  3. How do previous models compare to SOM (PCA, K-means)