Quite often, data analysis involves comparing different entities using a set of measurements on several variables. For example, patients in a hospital can be compared based on their length of stay in the hospital, their diagnoses, the number of medications they are taking, etc.
I was recently looking into a problem that called for clustering patients into groups. K-means clustering - possibly the most widely-known clustering algorithm - only works when all variables are numeric. However, we often want to cluster observations on both numeric and categorical variables. For relatively small datasets, this can be done with hierarchical clustering methods using Gower’s similarity coefficient. For larger datasets, the computational costs of hierarchical clustering are too large, and an alternative clustering method such as k-prototypes should be considered.1
For details on the mechanics of Gower’s measure, please see Gower, 1971.2 Here’s the gist of it:
To compare 2 people, A and B, on a variable X1, you first have to check whether a comparison is possible - i.e. measurements of X1 exist for both person A and person B. If a comparison is possible, then you assign a score for how similar these 2 people are. For example, if you want to compare Amy and Bob in terms of the variable “Nationality”, then you could assign a value of 1 if they are both American. Call this X1_similarity.
The overall similarity score between Amy and Bob is the sum of all the individual variable similarities (X1_similarity + X2_similarity + …) divided by the total possible comparisons (the number of variables for which data exists for both Amy and Bob).
Let’s examine the Auto dataset from the package ISLR. Here’s a sample of rows to show what it looks like:
df1.auto <- ISLR::Auto %>%
# recode origin as a factor
mutate(origin = case_when(
origin == 1 ~ "American",
origin == 2 ~ "European",
origin == 3 ~ "Japanese"
) %>% as.factor())
## Warning: package 'bindrcpp' was built under R version 3.4.4
# str(df1.auto)
# summary(df1.auto)
df1.auto %>%
sample_n(15) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
248 | 19.9 | 8 | 260 | 110 | 3365 | 15.5 | 78 | American | oldsmobile cutlass salon brougham |
138 | 14.0 | 8 | 302 | 140 | 4638 | 16.0 | 74 | American | ford gran torino (sw) |
195 | 24.5 | 4 | 98 | 60 | 2164 | 22.1 | 76 | American | chevrolet woody |
208 | 19.0 | 4 | 120 | 88 | 3270 | 21.9 | 76 | European | peugeot 504 |
254 | 25.1 | 4 | 140 | 88 | 2720 | 15.4 | 78 | American | ford fairmont (man) |
343 | 32.3 | 4 | 97 | 67 | 2065 | 17.8 | 81 | Japanese | subaru |
206 | 20.0 | 4 | 130 | 102 | 3150 | 15.7 | 76 | European | volvo 245 |
351 | 33.7 | 4 | 107 | 75 | 2210 | 14.4 | 81 | Japanese | honda prelude |
87 | 13.0 | 8 | 350 | 145 | 3988 | 13.0 | 73 | American | chevrolet malibu |
60 | 20.0 | 4 | 140 | 90 | 2408 | 19.5 | 72 | American | chevrolet vega |
199 | 18.0 | 6 | 250 | 78 | 3574 | 21.0 | 76 | American | ford granada ghia |
238 | 30.0 | 4 | 97 | 67 | 1985 | 16.4 | 77 | Japanese | subaru dl |
219 | 33.5 | 4 | 85 | 70 | 1945 | 16.8 | 77 | Japanese | datsun f-10 hatchback |
221 | 17.0 | 8 | 260 | 110 | 4060 | 19.0 | 77 | American | oldsmobile cutlass supreme |
217 | 36.0 | 4 | 79 | 58 | 1825 | 18.6 | 77 | European | renault 5 gtl |
For simplicity, let’s focus on a small subset of the data:
set.seed(3)
df2.auto.subset <- df1.auto %>%
select(name,
mpg,
cylinders,
displacement) %>%
sample_n(3)
# print:
df2.auto.subset %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
name | mpg | cylinders | displacement | |
---|---|---|---|---|
66 | amc ambassador sst | 17.0 | 8 | 304 |
316 | audi 4000 | 34.3 | 4 | 97 |
151 | plymouth valiant custom | 19.0 | 6 | 225 |
Now we’ll compare Gower’s dissimilarity measure with Euclidean distances:
# create matrix object for convenience:
m1.auto.numeric <- select(df2.auto.subset,
-name) %>%
as.matrix()
rownames(m1.auto.numeric) <- df2.auto.subset %>% pull(name)
m1.euclid.dis <- dist(m1.auto.numeric,
method = "euclidean")
m1.euclid.dis
## amc ambassador sst audi 4000
## audi 4000 207.76017
## plymouth valiant custom 79.05062 128.92668
# str(m1.euclid.dis)
So, in terms of similarity:
Let’s compare this with the results from Gower’s dissimilarity:
m1.gower.dis <- daisy(m1.auto.numeric,
metric = "gower")
m1.gower.dis
## Dissimilarities :
## amc ambassador sst audi 4000
## audi 4000 1.0000000
## plymouth valiant custom 0.3324165 0.6675835
##
## Metric : mixed ; Types = I, I, I
## Number of objects : 3
Same conclusion:
First we’ll add in a categorical column: the origin of the cars
set.seed(3)
df3.auto.subset.cat <- df1.auto %>%
select(name,
mpg,
cylinders,
displacement,
origin) %>%
sample_n(3)
# print:
df3.auto.subset.cat %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
name | mpg | cylinders | displacement | origin | |
---|---|---|---|---|---|
66 | amc ambassador sst | 17.0 | 8 | 304 | American |
316 | audi 4000 | 34.3 | 4 | 97 | European |
151 | plymouth valiant custom | 19.0 | 6 | 225 | American |
Now we’ll use Gower distances on mixed data:
# for convenience, add rownames:
rownames(df3.auto.subset.cat) <- df3.auto.subset.cat %>%
pull(name) %>%
substr(1,10)
m2.gower.dis <- daisy(df3.auto.subset.cat,
metric = "gower")
m2.gower.dis
## Dissimilarities :
## amc ambass audi 4000
## audi 4000 1.0000000
## plymouth v 0.3994499 0.8005501
##
## Metric : mixed ; Types = N, I, I, I, N
## Number of objects : 3
As expected, this hasn’t changed the similarity scores between the cars by much.
Let’s mess with the data a bit. We’ll add a lot of categorical fields that make the audi and the amc similar. Then we’ll calculate Gower’s dissimilarity again.
set.seed(3)
df4.auto.new <- df1.auto %>%
select(name,
mpg,
cylinders,
displacement,
origin) %>%
sample_n(3) %>%
mutate(category.var.1 = as.factor(c("yes",
"yes",
"no")),
category.var.2 = as.factor(c("yes",
"yes",
"no")),
category.var.3 = as.factor(c("yes",
"yes",
"no")),
category.var.4 = as.factor(c("yes",
"yes",
"no")),
category.var.5 = as.factor(c("yes",
"yes",
"no")))
# add rownames for convenience:
rownames(df4.auto.new) <- df4.auto.new %>%
pull(name) %>%
substr(1,10)
# Calculate Gower's similarity indexes:
m3.gower.newdata <- daisy(df4.auto.new,
metric = "gower")
m3.gower.newdata
## Dissimilarities :
## amc ambass audi 4000
## audi 4000 0.5000000
## plymouth v 0.6997249 0.9002751
##
## Metric : mixed ; Types = N, I, I, I, N, N, N, N, N, N
## Number of objects : 3
As expected, we have managed to make the audi and the amc the most similar. This means that Gower’s measure is behaving how we would intuitively expect it to.
There are too many rows to visualize all car names in a single tree, but it’s still useful to see the vertical distances between clusters.3
# assign rownames for convenience:
rownames.start <- df1.auto %>%
pull(name) %>%
substr(1,6)
rownames <- paste(rownames.start,
1:nrow(df1.auto),
sep = "-")
rownames(df1.auto) <- rownames
# find distance matrix:
m4.auto.gower.dist <- daisy(df1.auto,
metric = "gower")
# now use hclust:
c1.auto.cluster <- hclust(m4.auto.gower.dist)
# plot dendrogram:
# Far too many rows to visualize in one tree
# plot(c1.auto.cluster,
# hang = -1,
# cex = 0.6,
# cex.lab = 0.1)
# That's not easy to interpret. Let's try an alternative plotting method:
d1.auto.dendrogram <- as.dendrogram(c1.auto.cluster)
# Far too many rows to visualize in one tree
plot(d1.auto.dendrogram,
horiz = FALSE,
leaflab = "none")
Now let’s take a look at the list of cars at different levels of the dendrogram.4
Given the vertical distances in the dendrogram, it looks like 2 to 4 clusters is probably most meaningful. Let’s start by looking at 2 clusters.
groups.2 <- cutree(c1.auto.cluster, 2)
# str(groups.2)
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.
list1.2clusters <- sapply(unique(groups.2),
function(x){df1.auto$name[groups.2 == x]})
# list1.2clusters[[1]]
# list1.2clusters[[2]]
df5.2cl.summary <- table(groups.2,
df1.auto$origin) %>%
as.data.frame.matrix() %>%
mutate(cluster = c("cluster1",
"cluster2")) %>%
select(cluster,
everything())
# print:
df5.2cl.summary %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster | American | European | Japanese |
---|---|---|---|
cluster1 | 104 | 0 | 0 |
cluster2 | 141 | 68 | 79 |
So all the cars in Cluster 1 are American, while those in Cluster 2 are a mix of American, European and Japanese. How do these clusters differ in terms of the other variables?
df6.auto.numeric <- df1.auto %>%
select(mpg:acceleration)
aggregate(df6.auto.numeric,
list(groups.2),
median) %>%
rename(cluster.num = Group.1) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster.num | mpg | cylinders | displacement | horsepower | weight | acceleration |
---|---|---|---|---|---|---|
1 | 14.25 | 8 | 350.0 | 150 | 4137.5 | 13.00 |
2 | 26.00 | 4 | 120.5 | 86 | 2503.0 | 16.15 |
It’s fairly clear that Cluster 1 is large, fuel-inefficient cars (all American), whereas Cluster 2 is all other cars.
groups.3 <- cutree(c1.auto.cluster, 3)
# str(groups.3)
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.
list2.3clusters <- sapply(unique(groups.3),
function(x){df1.auto$name[groups.3 == x]})
df7.3cl.summary <- table(groups.3,
df1.auto$origin) %>%
as.data.frame.matrix() %>%
mutate(cluster = c("cluster1",
"cluster2",
"cluster3")) %>%
select(cluster,
everything())
# print:
df7.3cl.summary %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster | American | European | Japanese |
---|---|---|---|
cluster1 | 104 | 0 | 0 |
cluster2 | 141 | 0 | 79 |
cluster3 | 0 | 68 | 0 |
# Let's examine numeric variables by cluster:
aggregate(df6.auto.numeric,
list(groups.3),
median) %>%
rename(cluster.num = Group.1) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster.num | mpg | cylinders | displacement | horsepower | weight | acceleration |
---|---|---|---|---|---|---|
1 | 14.25 | 8 | 350 | 150.0 | 4137.5 | 13.0 |
2 | 25.50 | 4 | 135 | 88.0 | 2568.5 | 16.4 |
3 | 26.00 | 4 | 105 | 76.5 | 2240.0 | 15.6 |
Very interesting. Cluster 1 is the large American cars, Cluster 3 is European Cars, and Cluster 2 is Japanese cars and Japanese-like American cars.
Additionally, note that Clusters 2 and 3 are very similar, but Cluster 3 seems to have slightly smaller, more fuel-efficient cars.
groups.4 <- cutree(c1.auto.cluster, 4)
# str(groups.4)
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.
list3.4clusters <- sapply(unique(groups.4),
function(x){df1.auto$name[groups.4 == x]})
df8.4cl.summary <- table(groups.4,
df1.auto$origin) %>%
as.data.frame.matrix() %>%
mutate(cluster = c("cluster1",
"cluster2",
"cluster3",
"cluster4")) %>%
select(cluster,
everything())
# print:
df8.4cl.summary %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster | American | European | Japanese |
---|---|---|---|
cluster1 | 104 | 0 | 0 |
cluster2 | 0 | 0 | 79 |
cluster3 | 141 | 0 | 0 |
cluster4 | 0 | 68 | 0 |
# Let's examine numeric variables by cluster:
aggregate(df6.auto.numeric,
list(groups.4),
median) %>%
rename(cluster.num = Group.1) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster.num | mpg | cylinders | displacement | horsepower | weight | acceleration |
---|---|---|---|---|---|---|
1 | 14.25 | 8 | 350 | 150.0 | 4137.5 | 13.0 |
2 | 31.60 | 4 | 97 | 75.0 | 2155.0 | 16.4 |
3 | 22.40 | 6 | 171 | 90.0 | 2790.0 | 16.4 |
4 | 26.00 | 4 | 105 | 76.5 | 2240.0 | 15.6 |
At this point, we’ve managed to split apart the Japanese cars from the Japanese-like American cars.
Finally, just for fun, let’s look at the breakdown into 5 clusters.
groups.5 <- cutree(c1.auto.cluster, 5)
# str(groups.4)
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.
list4.5clusters <- sapply(unique(groups.5),
function(x){df1.auto$name[groups.5 == x]})
df9.5cl.summary <- table(groups.5,
df1.auto$origin) %>%
as.data.frame.matrix() %>%
mutate(cluster = c("cluster1",
"cluster2",
"cluster3",
"cluster4",
"cluster5")) %>%
select(cluster,
everything())
# print:
df9.5cl.summary %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster | American | European | Japanese |
---|---|---|---|
cluster1 | 58 | 0 | 0 |
cluster2 | 46 | 0 | 0 |
cluster3 | 0 | 0 | 79 |
cluster4 | 141 | 0 | 0 |
cluster5 | 0 | 68 | 0 |
# Let's examine numeric variables by cluster:
aggregate(df6.auto.numeric,
list(groups.5),
median) %>%
rename(cluster.num = Group.1) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
cluster.num | mpg | cylinders | displacement | horsepower | weight | acceleration |
---|---|---|---|---|---|---|
1 | 15.5 | 8 | 318.0 | 145.0 | 4057.5 | 13.6 |
2 | 13.5 | 8 | 371.5 | 175.0 | 4365.0 | 11.5 |
3 | 31.6 | 4 | 97.0 | 75.0 | 2155.0 | 16.4 |
4 | 22.4 | 6 | 171.0 | 90.0 | 2790.0 | 16.4 |
5 | 26.0 | 4 | 105.0 | 76.5 | 2240.0 | 15.6 |
Huang, Zhexue. Extensions to the k-Means algorithm for clustering large datasets with categorical values. Data Mining and Knowledge Discovery, 1998.↩
Gower, J.C. A general coefficient of similarity and some of its properties. Biometrics, 1971.↩
See http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning#plot.hclust-r-base-function↩