These note are primarily taken from the Cluster Analysis in R DataCamp course.
Machine learning is the use of statistical methods to search for patterns (unsupervised machine learning) or make predictions (supervised machine learning). Examples of supervised machine learning include decision trees, random forests, and lasso regression. Examples of unsupervised machine learning include cluster analysis, of which the two most common are hierarchical clustering and k-means clustering.
Cluster analysis is a form of exploratory data analysis which divides observations into meaningful groups that share features related to an observed metric. Central to clustering is the concept of distance. There are many ways to define distance1 See the options in the Distance Matrix Computation (dist
) documentation., but the two most common are Euclidean and and binary.
Euclidean distance is usual distance between two quantitative vectors: \(d = \sqrt{\sum{(x_i - y_i)^2}}\). Two observations are similar if the Euclidean distance between their features is relatively small. Binary distance is the distance between two binary vectors. It is 1 minus the number of times two observations share a feature divided by the number of features.2 The binary distance is also called the Jaccard distance. See Wikipedia aricle. Two observations are similar if their binary distance is relatively small.
In R, the dist(df, method = c("euclidean", "binary", ...))
function calculates the distances between observations. When calculating a Euclidean distance, the features should be on similar scales. If they are not, standardize their values as \((x - \bar{x})) / sd(x)\) so that each feature has a mean of 0 and standard deviation of 1. The scale()
function is a generic function that scales the columns of a matrix. When calculating a binary distance, the categorical features should be binary. If they are not, “dummify” into dummy variables. The dummy.data.frame()
function from the dummies
package converts factor variables into dummy representations.3 Question: What if observations contain both quantitative and categorical features?
Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which builds a hierarchy of clusters. One usually presents the HCA results in a dendrogram. The HCA process is:
Calculate the distance between each observation with dist(df, method = c("euclidean", "binary")
. dist()
returns an object of class dist
.
Cluster the distances with hclust(dist, method = c("complete", "single", "average")
. hclust
groups the two closest observations into a cluster. hclust
then calculates the cluster distance to the remaining observations. If the shortest distance is between two observations, hclust
defines a second cluster, otherwise hclust
adds the observation as a new level to the cluster. The process repeats until all observations belong to a cluster. The “distance” to a cluster requires definition. The “complete” distance is the distance to the furthest member of the cluster. The “single” distance is the distance to the closest member of the cluster. The “average” distance is the average distance to all members of the cluster.4 Question: Under what circumstances are each method most appropriate? hclust()
returns a value of class hclust
.
Evaluate the hclust
tree with a dendogram, principal component analysis (PCA), and/or summary statistics. The vertical lines in a dendogram indicate the distance between nodes and their associated cluster. Dendograms are difficult to visualize When the number of features is greater than two.5 One work-around is to plot just two dimensions at a time.
“Cut” the hierarchical tree into the desired number of clusters (k
) or height h
with cutree(hclust, k = NULL, h = NULL)
. cutree()
returns a vector of cluster memberships. One usually attaches this vector back to the original dataframe for visualization and summary statistics.
Calculate summary statistics and draw conclusions. Useful summary statistics are typically membership count, and feature averages (or proportions).
A wholesale distributor summarizes n = 45
customers’ spending on Milk
, Grocery
, and Frozen
food. Assign the customers into meaningful clusters using HCA.
customers_spend <- readRDS(url("https://assets.datacamp.com/production/repositories/1219/datasets/3558d2b5564714d85120cb77a904a2859bb3d03e/ws_customers.rds"))
Before conducting HCA, check whether any preprocessing is required: Are there any NAs? If so, drop these observations, or impute values. Are all of the features comparable? If not, standardize the variables. Are the featurse multi-nomial? If so, dummify them into binary variables. In this case, there are no NAs, and all of the features a quantitative with similar scale (dollars spent). No pre-processing is required.
summary(customers_spend)
## Milk Grocery Frozen
## Min. : 333 Min. : 1330 Min. : 264
## 1st Qu.: 1375 1st Qu.: 2743 1st Qu.: 824
## Median : 2335 Median : 5332 Median : 1455
## Mean : 4831 Mean : 7830 Mean : 2870
## 3rd Qu.: 5302 3rd Qu.:10790 3rd Qu.: 3046
## Max. :25071 Max. :26839 Max. :15601
Calculate the Euclidean distance between customers.
dist_customers <- dist(customers_spend, method = "euclidean")
Generate a complete linkage analysis. Visually inspect and choose a number of clusters k
or tree height h
that makes most sense. A tree height of $15,000 seems to make sense here.
library(dendextend) # for color_branches()
hc_customers <- hclust(dist_customers, method = "complete")
plot(color_branches(as.dendrogram(hc_customers),
h = 15000))
Cut the tree to a height of h = 15000
. Attach the cluster assignment vector back to the original dataframe for visualization and/or summary statistics.
library(dplyr)
segment_customers <- mutate(customers_spend,
cluster = cutree(hc_customers,
h = 15000))
Use summary statistics to explore the cluster characteristics. Cluster 1 spends more on Milk than other clusters. Cluster 3 spends more on Grocery, and customer 4 spends more on Frozen. Cluster 2 is the largest cluster and does not spend excessively in any category.
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
# observations per cluster
count(segment_customers, cluster)
## # A tibble: 4 x 2
## cluster n
## <int> <int>
## 1 1 5
## 2 2 29
## 3 3 5
## 4 4 6
segment_customers %>%
group_by(cluster) %>%
summarise_all(funs(mean(.)))
## # A tibble: 4 x 4
## cluster Milk Grocery Frozen
## <int> <dbl> <dbl> <dbl>
## 1 1 16950. 12891. 991.
## 2 2 2513. 5229. 1796.
## 3 3 10452. 22551. 1355.
## 4 4 1250. 3917. 10889.
segment_customers %>%
mutate(cust_id = 1:45) %>%
gather(key = "product",
value = "spending",
c(Milk, Grocery, Frozen)) %>%
ggplot(aes(x = factor(product), y = spending, color = factor(cluster))) +
geom_point(aes(group = cust_id))
The K-means clustering algorithm randomly selects k
centroids within the data space. Each observation is assigned to its closest (euclidean distance) centroid. The centroids are then re-positioned at the centers of their clusters. The observations are reassigned and centroids repositioned repeatedly until the clusters stabilize. The k-means process is:
Determine the appropriate number of means with an elbow plot or using the silhouette method. The elbow method calculates the total within-cluster sum of squared distances for all possible values of k
. As k
increases, the sum of squares decreases, but at a declining rate. The elbow method choose the “elbow” in the curve - the point at which the curve flattens. The silhouette method calculates the within-cluster distance \(C(i)\) for each observation, and the distance to the observation’s closest cluster \(N(i)\). The silhouette width is \(S = 1 - C(i) / N(i)\) for \(C(i) < N(i)\) and \(S = N(i) / C(i) - 1\) for \(C(i) > N(i)\). A value close to 1 means the observation is well-matched to its current cluster; 0 is on the border between the two clusters; and -1 means the observation is better-matched to the other cluster.
Cluster the observations using the selected number k
of clusters with kmeans(df, centers = k
. kmeans()
returns a value of class kmeans
. One usually attaches this vector back to the original dataframe for visualization and summary statistics.
Calculate summary statistics and draw conclusions. Useful summary statistics are typically membership count, and feature averages (or proportions).
From the example above, a wholesale distributor summarizes n = 45
customers’ spending on Milk
, Grocery
, and Frozen
food. Assign the customers into meaningful clusters using k-means.
As with HCA, before conducting k-means, check whether any preprocessing is required: Are there any NAs? If so, drop these observations, or impute values. Are all of the features comparable? If not, standardize the variables. Are the featurse multi-nomial? If so, dummify them into binary variables. In this case, there are no NAs, and all of the features a quantitative with similar scale (dollars spent). No pre-processing is required.
Calculate the within-cluster sum of squared distances for k = 1:10
clusters. kmeans()
returns an object of class kmeans
, a list in which one of the components is the model sum of squares tot.withinss
. To calculate the sum of squares for all possible models, use map_dlb()
. Here, the elbow occurs at k = 2
clusters.
library(purrr) # for map_dbl
## Warning: package 'purrr' was built under R version 3.4.4
tot_withinss <- map_dbl(1:10, function(k) {
model <- kmeans(customers_spend, centers = k)
model$tot.withinss
})
data.frame(k = 1:10,
tot_withinss = tot_withinss) %>%
ggplot(aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
Try the silhoette method. pam()
returns an object of class pam
, a list in which one of the components is the average width silinfo$avg.width
. This value is roughly equivalent to the average silhouette width (S = -1 (poor match), S = 0 (border), S = 1 (good match)). Here, the largest average silhoette width occurs at k = 2
clusters.
library(cluster) # for pam
sil_width <- map_dbl(2:10, function(k){
model <- pam(x = customers_spend, k = k)
model$silinfo$avg.width
})
data.frame(k = 2:10,
sil_width = sil_width) %>%
ggplot(aes(x = k, y = sil_width)) +
geom_col() +
scale_x_continuous(breaks = 2:10)
The maximum width is at k = 2
, so build a cluster with k = 2
means. Attach the cluster assignment vector back to the original dataframe for visualization and/or summary statistics.
set.seed(42) # kmeans is unstable. Make result consistent.
model_customers <- kmeans(customers_spend, centers = 2)
segment_customers <- mutate(customers_spend,
cluster = model_customers$cluster)
Use summary statistics to explore the cluster characteristics. Cluster 1 spends more on frozen than the other cluster. Cluster 2 spends more on Mil and Grocery.
# Calculate the size of each cluster
count(segment_customers, cluster)
## # A tibble: 2 x 2
## cluster n
## <int> <int>
## 1 1 35
## 2 2 10
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(funs(mean(.)))
## # A tibble: 2 x 4
## cluster Milk Grocery Frozen
## <int> <dbl> <dbl> <dbl>
## 1 1 2296. 5004. 3354.
## 2 2 13701. 17721. 1173.
segment_customers %>%
mutate(cust_id = 1:45) %>%
gather(key = "product",
value = "spending",
c(Milk, Grocery, Frozen)) %>%
ggplot(aes(x = factor(product), y = spending, color = factor(cluster))) +
geom_point(aes(group = cust_id))
Both the HCA and the k-means results are valid, but which one is appropriate requires the inputs of a subject matter expert. Generating clusters is a science, but interpreting them is an art.
Cluster analysis is useful for spacial data, qualitative data, and as explored here, time-series data. With time series data, the time periods are the features. Typically, this requires the transposition of the data set so that the dates are columns. Otherwise, the same rules apply.
Data set oes
contains the average incomes of 20 occupations from 2001-2016. Which occupations cluster together in their trends?
The data set is conveniently defined with the dates as columns.
oes <- readRDS(url("https://assets.datacamp.com/production/repositories/1219/datasets/1e1ec9f146a25d7c71a6f6f0f46c3de7bcefd36c/oes.rds"))
First decide whether the data requires any preprocessing (scaling, dummifying, imputing or removing NAs). No preprocessing is required here.
Perform an HCA first. Create a Euclidean distance matrix. Perform hierarchical clustering using average linkage. It looks reasonable to start with 3 clusters formed at a height of 100,000.
dist_oes <- dist(oes, method = "euclidean")
hc_oes <- hclust(dist_oes, method = "average")
plot(color_branches(as.dendrogram(hc_oes),
h = 100000))
Extract clusters from the dendogram.
library(tibble) # for rownames_to_column
# Move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')
clust_oes <- mutate(df_oes,
cluster = cutree(hc_oes,
h = 100000))
# Create a tidy data frame by gathering the year and values into two columns
gathered_oes <- gather(data = clust_oes,
key = year,
value = mean_salary, -occupation, -cluster)
Explore the clusters. It looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years.
# View the clustering assignments by sorting the cluster assignment vector
knitr::kable(sort(cutree(hc_oes, h = 100000)))
x | |
---|---|
Management | 1 |
Legal | 1 |
Business Operations | 2 |
Computer Science | 2 |
Architecture/Engineering | 2 |
Life/Physical/Social Sci. | 2 |
Healthcare Practitioners | 2 |
Community Services | 3 |
Education/Training/Library | 3 |
Arts/Design/Entertainment | 3 |
Healthcare Support | 3 |
Protective Service | 3 |
Food Preparation | 3 |
Grounds Cleaning & Maint. | 3 |
Personal Care | 3 |
Sales | 3 |
Office Administrative | 3 |
Farming/Fishing/Forestry | 3 |
Construction | 3 |
Installation/Repair/Maint. | 3 |
Production | 3 |
Transportation/Moving | 3 |
# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
geom_line(aes(group = occupation))
Now perform k-means clustering. We’ve already determined that no preprocessing is necesary. Determine the appropriate number of means with an elbow plot andsilhouette method. The elbow plot reveals a kink at k = 2
.
tot_withins <- map_dbl(1:10, function(k){
model <- kmeans(x = oes, centers = k)
model$tot.withinss
})
data.frame(k = 1:10,
tot_withinss = tot_withinss) %>%
ggplot(aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
The maximum average silhouette width is at k = 7
(although k = 2
is close).
sil_width <- map_dbl(2:10, function(k){
model <- pam(oes, k = k)
model$silinfo$avg.width
})
data.frame(k = 2:10,
sil_width = sil_width) %>%
ggplot(aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
Hierarchical clustering has some advantages over k-means. It can use any distance method - not just euclidean. The results are stable - k-means can produce different results each time. While they can both be evaluated with the silhouette and elbow plots, hierachical clustering can also be evaluated with a dendogram. But hierarchical clusters has one significant drawback: it is computationally complex compared to k-means. For this last reason, k-means is more common.