Background: Unsupervised vs Supervised Machine Learning

Unsupervised machine learning searches for structure in unlabeled data (data without a response variable). The goal of unsupervised learning is to find homogenous subgroups (clusters), and to find patterns (usually through dimensionality reduction). Examples of unsupervised machine learning include k-means clustering, hierarchical cluster analysis (HCA), and principal component analysis (PCA). Unsupervised machine learning often has no single goal beyond gaining insight.

Supervised machine learning makes predictions with labeled data. Supervised learning uses regression for quantitative outcomes and classification for qualitative outcomes.11 Reinforcement machine learning is a third type of learning where the machine learns by operating in an environment. Examples of supervised machine learning include decision trees, random forests, and lasso regression.

Background: Calculating Distances

Central to clustering is the concept of distance. Two observations are similar if the distance between their features is relatively small. There are many ways to define distance22 See the options in the Distance Matrix Computation (dist) documentation., but the two most common are Euclidean and binary.

Euclidean distance is the distance between two quantitative vectors: \(d = \sqrt{\sum{(x_i - y_i)^2}}\). Binary distance is the distance between two binary vectors. It is 1 minus the proportion of shared features.33 The binary distance is also called the Jaccard distance. See Wikipedia aricle.

In R, the dist(df, method = c("euclidean", "binary", ...)) function calculates the distances between observations. When calculating a Euclidean distance, the features should be on similar scales. If they are not, standardize their values as \((x - \bar{x})) / sd(x)\) so that each feature has a mean of 0 and standard deviation of 1.44 Not only should the metrics be of the same units, but also the means and standard deviations should be similar. A quick way to check if scaling is necesary is the colmeans() function and apply(df, 2, sd). The scale() function is a generic function that scales the columns of a matrix. When calculating a binary distance, the categorical features should be binary. If they are not, create dummy variables. The dummy.data.frame() function from the dummies package converts factor variables into dummy representations.55 Question: What if observations contain both quantitative and categorical features? In practice, the clustering algorithm will calculate the distances, so you will not use dist(). However, the conditions on scales at binary features still apply.

HCA Clustering

Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which builds a hierarchy of clusters. One usually presents the HCA results in a dendrogram. The HCA process is:

Calculate the distance between each observation with dist(df, method = c("euclidean", "binary"). dist() returns an object of class dist.
Cluster the distances with hclust(dist, method = c("complete", "single", "average", "centroid"). hclust groups the two closest observations into a cluster. hclust then calculates the cluster distance to the remaining observations. If the shortest distance is between two observations, hclust defines a second cluster, otherwise hclust adds the observation as a new level to the cluster. The process repeats until all observations belong to a single cluster. The “distance” to a cluster requires definition. The “complete” distance is the distance to the furthest member of the cluster. The “single” distance is the distance to the closest member of the cluster. The “average” distance is the average distance to all members of the cluster. The “centroid” distance is the distance between the centroids of each cluster.^[As a rule of thumb, “complete” and “average” tend to produce more balanced trees and are most common. Pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters. This is useful for identifying outliers. hclust() returns a value of class hclust.
Evaluate the hclust tree with a dendogram, principal component analysis (PCA), and/or summary statistics. The vertical lines in a dendogram indicate the distance between nodes and their associated cluster. Dendograms are difficult to visualize When the number of features is greater than two.66 One work-around is to plot just two dimensions at a time.
“Cut” the hierarchical tree into the desired number of clusters (k) or height h with cutree(hclust, k = NULL, h = NULL). cutree() returns a vector of cluster memberships. Attach this vector back to the original dataframe for visualization and summary statistics.
Calculate summary statistics and draw conclusions. Useful summary statistics are typically membership count, and feature averages (or proportions).

Example

The pokemon dataset contains observations of 800 pokemons77 More information on the dataset at https://www.kaggle.com/abcsds/pokemon on 6 dimensions. The data is unlabeled, meaning there is no response variable, just features. The features here are six pokeon ability measures.

library(readr)

pokemon <- read_csv(url("https://assets.datacamp.com/production/course_1815/datasets/Pokemon.csv"))
pokemon$Name <- NULL
pokemon$Type1 <- NULL
pokemon$Type2 <- NULL
pokemon$Total <- NULL
pokemon$Generation <- NULL
pokemon$Legendary <- NULL

head(pokemon)

## # A tibble: 6 x 7
##   Number HitPoints Attack Defense SpecialAttack SpecialDefense Speed
##    <int>     <int>  <int>   <int>         <int>          <int> <int>
## 1      1        45     49      49            65             65    45
## 2      2        60     62      63            80             80    60
## 3      3        80     82      83           100            100    80
## 4      3        80    100     123           122            120    80
## 5      4        39     52      43            60             50    65
## 6      5        58     64      58            80             65    80

Before conducting k-means, check whether any preprocessing is required: Are there any NAs? If so, drop these observations, or impute values. Are all of the features comparable? If not, standardize the variables. Are the features multi-nomial? If so, create binary variables. In this case, the means and standard deviations are similar, but I am scaling anyway for the exercise.

library(dendextend)  # for color_branches()

# Means and SDs
colMeans(pokemon[, -c(1)])

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750

apply(pokemon[, -c(1)], MARGIN = 2, FUN = sd)

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047

# Scale the data
pokemon.scaled <- scale(pokemon)

# Create the full tree
hc_model <- hclust(dist(pokemon.scaled), method = "complete")

# Inspect the tree to choose a size.
plot(color_branches(as.dendrogram(hc_model), 
                    k = 7))
abline(h = 7, col = "red")

The dendogram suggests the optimal number of clusters is seven. Build a cluster with k = 7 means. Attach the cluster assignment vector back to the original dataframe for visualization and/or summary statistics.

library(dplyr)  # for mutate()
library(tidyr)  # for gather()
library(ggplot2)

pokemon <- mutate(pokemon, cluster = cutree(hc_model, k = 7))

# View the resulting model
pokemon %>% 
  group_by(cluster) %>% 
  summarise_all(funs(mean(.))) %>%
  select(-c(2)) %>%
  knitr::kable(caption = "Cluster Centers")

Cluster Centers

cluster	HitPoints	Attack	Defense	SpecialAttack	SpecialDefense	Speed
1	52.02711	56.78614	54.91867	51.34337	53.05422	51.15663
2	78.06884	85.93478	80.06522	85.77899	85.34420	83.23188
3	83.79268	119.20732	85.53659	117.71951	94.53659	105.04878
4	79.86316	101.75789	114.28421	72.09474	78.25263	54.41053
5	167.41667	70.66667	47.66667	57.75000	72.25000	48.16667
6	20.00000	10.00000	230.00000	10.00000	230.00000	5.00000
7	50.00000	165.00000	35.00000	165.00000	35.00000	150.00000

pokemon %>%
  gather(key = "Ability", 
         value = "Score", 
         c(HitPoints, Attack, Defense, 
           SpecialAttack, SpecialDefense, Speed)) %>%
  ggplot(aes(x = factor(Ability), y = Score, color = factor(cluster))) + 
  geom_point(aes(group = Number))

Cluster 1 has the lowest values in all features. Cluster 5 has very high HitPoints. Cluster 3 has very high Special Attack.

Cluster Analysis with Time-Series Data

Cluster analysis is useful for spacial data, qualitative data, and time-series data. With time series data, the time periods are the features. Typically, this requires the transposition of the data set so that the dates are columns. Otherwise, the same rules apply.

K-Means vs HCA

Hierarchical clustering has some advantages over k-means. It can use any distance method - not just euclidean. The results are stable - k-means can produce different results each time. While they can both be evaluated with the silhouette and elbow plots, hierachical clustering can also be evaluated with a dendogram. But hierarchical clusters has one significant drawback: it is computationally complex compared to k-means. For this last reason, k-means is more common.

Unsupervised Learning with HCA

Hierarchical Cluster Analysis Using R.

Michael Foley

2019-04-23