These note are primarily taken from the Cluster Analysis in R. and Unsupervised Learning in R DataCamp courses.
Unsupervised machine learning searches for structure in unlabeled data (data without a response variable). The goal of unsupervised learning is to find homogenous subgroups (clusters), and to find patterns (usually through dimensionality reduction). Examples of unsupervised machine learning include k-means clustering, hierarchical cluster analysis (HCA), and principal component analysis (PCA). Unsupervised machine learning often has no single goal beyond gaining insight.
Supervised machine learning makes predictions with labeled data. Supervised learning uses regression for quantitative outcomes and classification for qualitative outcomes.1 Reinforcement machine learning is a third type of learning where the machine learns by operating in an environment. Examples of supervised machine learning include decision trees, random forests, and lasso regression.
Central to clustering is the concept of distance. Two observations are similar if the distance between their features is relatively small. There are many ways to define distance2 See the options in the Distance Matrix Computation (dist
) documentation., but the two most common are Euclidean and binary.
Euclidean distance is the distance between two quantitative vectors: \(d = \sqrt{\sum{(x_i - y_i)^2}}\). Binary distance is the distance between two binary vectors. It is 1 minus the proportion of shared features.3 The binary distance is also called the Jaccard distance. See Wikipedia aricle.
In R, the dist(df, method = c("euclidean", "binary", ...))
function calculates the distances between observations. When calculating a Euclidean distance, the features should be on similar scales. If they are not, standardize their values as \((x - \bar{x})) / sd(x)\) so that each feature has a mean of 0 and standard deviation of 1.4 Not only should the metrics be of the same units, but also the means and standard deviations should be similar. A quick way to check if scaling is necesary is the colmeans()
function and apply(df, 2, sd)
. The scale()
function is a generic function that scales the columns of a matrix. When calculating a binary distance, the categorical features should be binary. If they are not, create dummy variables. The dummy.data.frame()
function from the dummies
package converts factor variables into dummy representations.5 Question: What if observations contain both quantitative and categorical features? In practice, the clustering algorithm will calculate the distances, so you will not use dist()
. However, the conditions on scales at binary features still apply.
Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which builds a hierarchy of clusters. One usually presents the HCA results in a dendrogram. The HCA process is:
Calculate the distance between each observation with dist(df, method = c("euclidean", "binary")
. dist()
returns an object of class dist
.
Cluster the distances with hclust(dist, method = c("complete", "single", "average", "centroid")
. hclust
groups the two closest observations into a cluster. hclust
then calculates the cluster distance to the remaining observations. If the shortest distance is between two observations, hclust
defines a second cluster, otherwise hclust
adds the observation as a new level to the cluster. The process repeats until all observations belong to a single cluster. The “distance” to a cluster requires definition. The “complete” distance is the distance to the furthest member of the cluster. The “single” distance is the distance to the closest member of the cluster. The “average” distance is the average distance to all members of the cluster. The “centroid” distance is the distance between the centroids of each cluster.^[As a rule of thumb, “complete” and “average” tend to produce more balanced trees and are most common. Pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters. This is useful for identifying outliers. hclust()
returns a value of class hclust
.
Evaluate the hclust
tree with a dendogram, principal component analysis (PCA), and/or summary statistics. The vertical lines in a dendogram indicate the distance between nodes and their associated cluster. Dendograms are difficult to visualize When the number of features is greater than two.6 One work-around is to plot just two dimensions at a time.
“Cut” the hierarchical tree into the desired number of clusters (k
) or height h
with cutree(hclust, k = NULL, h = NULL)
. cutree()
returns a vector of cluster memberships. Attach this vector back to the original dataframe for visualization and summary statistics.
Calculate summary statistics and draw conclusions. Useful summary statistics are typically membership count, and feature averages (or proportions).
The pokemon
dataset contains observations of 800 pokemons7 More information on the dataset at https://www.kaggle.com/abcsds/pokemon on 6 dimensions. The data is unlabeled, meaning there is no response variable, just features. The features here are six pokeon ability measures.
library(readr)
pokemon <- read_csv(url("https://assets.datacamp.com/production/course_1815/datasets/Pokemon.csv"))
pokemon$Name <- NULL
pokemon$Type1 <- NULL
pokemon$Type2 <- NULL
pokemon$Total <- NULL
pokemon$Generation <- NULL
pokemon$Legendary <- NULL
head(pokemon)
## # A tibble: 6 x 7
## Number HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## <int> <int> <int> <int> <int> <int> <int>
## 1 1 45 49 49 65 65 45
## 2 2 60 62 63 80 80 60
## 3 3 80 82 83 100 100 80
## 4 3 80 100 123 122 120 80
## 5 4 39 52 43 60 50 65
## 6 5 58 64 58 80 65 80
Before conducting k-means, check whether any preprocessing is required: Are there any NAs? If so, drop these observations, or impute values. Are all of the features comparable? If not, standardize the variables. Are the features multi-nomial? If so, create binary variables. In this case, the means and standard deviations are similar, but I am scaling anyway for the exercise.
library(dendextend) # for color_branches()
# Means and SDs
colMeans(pokemon[, -c(1)])
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
apply(pokemon[, -c(1)], MARGIN = 2, FUN = sd)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
# Scale the data
pokemon.scaled <- scale(pokemon)
# Create the full tree
hc_model <- hclust(dist(pokemon.scaled), method = "complete")
# Inspect the tree to choose a size.
plot(color_branches(as.dendrogram(hc_model),
k = 7))
abline(h = 7, col = "red")
The dendogram suggests the optimal number of clusters is seven. Build a cluster with k = 7
means. Attach the cluster assignment vector back to the original dataframe for visualization and/or summary statistics.
library(dplyr) # for mutate()
library(tidyr) # for gather()
library(ggplot2)
pokemon <- mutate(pokemon, cluster = cutree(hc_model, k = 7))
# View the resulting model
pokemon %>%
group_by(cluster) %>%
summarise_all(funs(mean(.))) %>%
select(-c(2)) %>%
knitr::kable(caption = "Cluster Centers")
Cluster Centers
cluster | HitPoints | Attack | Defense | SpecialAttack | SpecialDefense | Speed |
---|---|---|---|---|---|---|
1 | 52.02711 | 56.78614 | 54.91867 | 51.34337 | 53.05422 | 51.15663 |
2 | 78.06884 | 85.93478 | 80.06522 | 85.77899 | 85.34420 | 83.23188 |
3 | 83.79268 | 119.20732 | 85.53659 | 117.71951 | 94.53659 | 105.04878 |
4 | 79.86316 | 101.75789 | 114.28421 | 72.09474 | 78.25263 | 54.41053 |
5 | 167.41667 | 70.66667 | 47.66667 | 57.75000 | 72.25000 | 48.16667 |
6 | 20.00000 | 10.00000 | 230.00000 | 10.00000 | 230.00000 | 5.00000 |
7 | 50.00000 | 165.00000 | 35.00000 | 165.00000 | 35.00000 | 150.00000 |
pokemon %>%
gather(key = "Ability",
value = "Score",
c(HitPoints, Attack, Defense,
SpecialAttack, SpecialDefense, Speed)) %>%
ggplot(aes(x = factor(Ability), y = Score, color = factor(cluster))) +
geom_point(aes(group = Number))
Cluster 1 has the lowest values in all features. Cluster 5 has very high HitPoints. Cluster 3 has very high Special Attack.
Cluster analysis is useful for spacial data, qualitative data, and time-series data. With time series data, the time periods are the features. Typically, this requires the transposition of the data set so that the dates are columns. Otherwise, the same rules apply.
Hierarchical clustering has some advantages over k-means. It can use any distance method - not just euclidean. The results are stable - k-means can produce different results each time. While they can both be evaluated with the silhouette and elbow plots, hierachical clustering can also be evaluated with a dendogram. But hierarchical clusters has one significant drawback: it is computationally complex compared to k-means. For this last reason, k-means is more common.