This file adds to r4ds/extra/clustering.Rmd. Modified by Daisuke Adachi.

## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.2.5  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.1       ✔ stringr 1.3.1  
## ✔ readr   1.1.1       ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Clusters

Cluster algorithms are automated tools that seek out clusters in n-dimensional space for you. Base R provides two easy to use clustering algorithms: hierarchical clustering and k means clustering.

Hierarchical clustering

Hierarchical clustering uses a simple algorithm to locate groups of points that are near each other in n-dimensional space:

  1. Identify the two points that are closest to each other
  2. Combine these points into a cluster
  3. Treat the new cluster as a point
  4. Repeat until all of the points are grouped into a single cluster

You can visualize the results of the algorithm as a dendrogram, and you can use the dendrogram to divide your data into any number of clusters. The figure below demonstrates how the algorithm would proceed in a two dimensional dataset.

To use hierarchical clustering in R, begin by selecting the numeric columns from your data; you can only apply hierarchical clustering to numeric data. Then apply the dist() function to the data and pass the results to hclust(). dist() computes the distances between your points in the n dimensional space defined by your numeric vectors. hclust() performs the clustering algorithm.

small_iris <- sample_n(iris, 50)
  
iris_hclust <- small_iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>% 
  dist() %>% 
  hclust(method = "complete")

Here Daisuke checks the distance matrices a bit. The output of dist function is called object dist.

dist <- small_iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>% 
  dist()
str(dist)
##  'dist' num [1:1225] 0.346 4.158 3.184 1.952 3.65 ...
##  - attr(*, "Size")= int 50
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "euclidean"
##  - attr(*, "call")= language dist(x = .)

Since I’d like to calculate the matrix by myself, would be great if I could convert the matirix into dist object by myself.

dist.mat <- dist %>% as.matrix
str(dist.mat)
##  num [1:50, 1:50] 0 0.346 4.158 3.184 1.952 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:50] "1" "2" "3" "4" ...
##   ..$ : chr [1:50] "1" "2" "3" "4" ...
dist.mat.as.dist <- dist.mat %>% as.dist
str(dist.mat.as.dist)
##  'dist' num [1:1225] 0.346 4.158 3.184 1.952 3.65 ...
##  - attr(*, "Labels")= chr [1:50] "1" "2" "3" "4" ...
##  - attr(*, "Size")= int 50
##  - attr(*, "call")= language as.dist.default(m = .)
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
dist.mat.as.dist %>% hclust(method = 'average')
## 
## Call:
## hclust(d = ., method = "average")
## 
## Cluster method   : average 
## Number of objects: 50

Use plot() to visualize the results as a dendrogram. Each observation in the dataset will appear at the bottom of the dendrogram labeled by its rowname. You can use the labels argument to set the labels to something more informative.

plot(iris_hclust, labels = small_iris$Species)

To see how near two data points are to each other, trace the paths of the data points up through the tree until they intersect. The y value of the intersection displays how far apart the points are in n-dimensional space. Points that are close to each other will intersect at a small y value, points that are far from each other will intersect at a large y value. Groups of points that are near each other will look like “leaves” that all grow on the same “branch.” The ordering of the x axis in the dendrogram is somewhat arbitrary (think of the tree as a mobile, each horizontal branch can spin around meaninglessly).

You can split your data into any number of clusters by drawing a horizontal line across the tree. Each vertical branch that the line crosses will represent a cluster that contains all of the points downstream from the branch. Move the line up the y axis to intersect fewer branches (and create fewer clusters), move the line down the y axis to intersect more branches and (create more clusters).

cutree() provides a useful way to split data points into clusters. Give cutree the output of hclust() as well as the number of clusters that you want to split the data into. cutree() will return a vector of cluster labels for your dataset. To visualize the results, map the output of cutree() to an aesthetic.

(clusters <- cutree(iris_hclust, 3))
##  [1] 1 1 2 2 1 2 2 3 1 2 2 2 3 1 1 1 2 2 2 2 3 1 2 1 1 3 2 2 2 2 1 3 2 2 2
## [36] 1 1 1 2 1 3 1 2 2 3 3 1 1 1 2
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point(aes(color = factor(clusters)))

You can modify the hierarchical clustering algorithm by setting the method argument of hclust to one of “complete”, “single”, “average”, or “centroid”. The method determines how to measure the distance between two clusters or a lone point and a cluster, a measurement that affects the outcome of the algorithm.

  • complete - Measures the greatest distance between any two points in the separate clusters. Tends to create distinct clusters and subclusters.

  • single - Measures the smallest distance between any two points in the separate clusters. Tends to add points one at a time to existing clusters, creating ambiguously defined clusters.

  • average - Measures the average distance between all combinations of points in the separate clusters. Tends to add points one at a time to existing clusters.

  • centroid - Measures the distance between the average location of the points in each cluster.

small_iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>% 
  dist() %>% 
  hclust(method = "single") %>% 
  plot(labels = small_iris$Species)

Asking questions about clustering

Ask the same questions about clusters that you find with hclust() and kmeans() that you would ask about clusters that you find with a graph. Ask yourself:

  • Do the clusters seem to identify real differences between your points? How can you tell?

  • Are the points within each cluster similar in some way?

  • Are the points in separate clusters different in some way?

  • Might there be a mismatch between the number of clusters that you found and the number that exist in real life? Are only a couple of the clusters meaningful? Are there more clusters in the data than you found?

  • How stable are the clusters if you rerun the algorithm?

Keep in mind that both algorithms will always return a set of clusters, whether your data appears clustered or not. As a result, you should always be skeptical about the results. They can be quite insightful, but there is no reason to treat them as a fact without doing further research.