Gower’s dissimilarity measure for mixed numeric/categorical data

Motivation for Gower’s dissimilarity measure
Numeric-only data: Comparing Gower’s measure with Euclidean distances
Mixed numeric/categorical data
- Changing the data to examine the effect on Gower’s measure
Clustering on mixed data
References

Motivation for Gower’s dissimilarity measure

Quite often, data analysis involves comparing different entities using a set of measurements on several variables. For example, patients in a hospital can be compared based on their length of stay in the hospital, their diagnoses, the number of medications they are taking, etc.

I was recently looking into a problem that called for clustering patients into groups. K-means clustering - possibly the most widely-known clustering algorithm - only works when all variables are numeric. However, we often want to cluster observations on both numeric and categorical variables. For relatively small datasets, this can be done with hierarchical clustering methods using Gower’s similarity coefficient. For larger datasets, the computational costs of hierarchical clustering are too large, and an alternative clustering method such as k-prototypes should be considered.¹

For details on the mechanics of Gower’s measure, please see Gower, 1971.² Here’s the gist of it:

To compare 2 people, A and B, on a variable X1, you first have to check whether a comparison is possible - i.e. measurements of X1 exist for both person A and person B. If a comparison is possible, then you assign a score for how similar these 2 people are. For example, if you want to compare Amy and Bob in terms of the variable “Nationality”, then you could assign a value of 1 if they are both American. Call this X1_similarity.

The overall similarity score between Amy and Bob is the sum of all the individual variable similarities (X1_similarity + X2_similarity + …) divided by the total possible comparisons (the number of variables for which data exists for both Amy and Bob).

Numeric-only data: Comparing Gower’s measure with Euclidean distances

Let’s examine the Auto dataset from the package ISLR. Here’s a sample of rows to show what it looks like:

df1.auto <- ISLR::Auto %>% 
    
    # recode origin as a factor 
    mutate(origin = case_when(
        origin == 1 ~ "American",
        origin == 2 ~ "European",
        origin == 3 ~ "Japanese"
    ) %>% as.factor())

## Warning: package 'bindrcpp' was built under R version 3.4.4

# str(df1.auto)
# summary(df1.auto)
df1.auto %>% 
      sample_n(15) %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
248	19.9	8	260	110	3365	15.5	78	American	oldsmobile cutlass salon brougham
138	14.0	8	302	140	4638	16.0	74	American	ford gran torino (sw)
195	24.5	4	98	60	2164	22.1	76	American	chevrolet woody
208	19.0	4	120	88	3270	21.9	76	European	peugeot 504
254	25.1	4	140	88	2720	15.4	78	American	ford fairmont (man)
343	32.3	4	97	67	2065	17.8	81	Japanese	subaru
206	20.0	4	130	102	3150	15.7	76	European	volvo 245
351	33.7	4	107	75	2210	14.4	81	Japanese	honda prelude
87	13.0	8	350	145	3988	13.0	73	American	chevrolet malibu
60	20.0	4	140	90	2408	19.5	72	American	chevrolet vega
199	18.0	6	250	78	3574	21.0	76	American	ford granada ghia
238	30.0	4	97	67	1985	16.4	77	Japanese	subaru dl
219	33.5	4	85	70	1945	16.8	77	Japanese	datsun f-10 hatchback
221	17.0	8	260	110	4060	19.0	77	American	oldsmobile cutlass supreme
217	36.0	4	79	58	1825	18.6	77	European	renault 5 gtl

For simplicity, let’s focus on a small subset of the data:

set.seed(3)

df2.auto.subset <- df1.auto %>% 
    select(name,
           mpg, 
           cylinders, 
           displacement) %>%  
    sample_n(3)

# print: 
df2.auto.subset %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

	name	mpg	cylinders	displacement
66	amc ambassador sst	17.0	8	304
316	audi 4000	34.3	4	97
151	plymouth valiant custom	19.0	6	225

Now we’ll compare Gower’s dissimilarity measure with Euclidean distances:

Euclidean distances:

# create matrix object for convenience: 
m1.auto.numeric <- select(df2.auto.subset, 
                              -name) %>% 
    as.matrix()

rownames(m1.auto.numeric) <- df2.auto.subset %>% pull(name)

m1.euclid.dis <- dist(m1.auto.numeric,
                       method = "euclidean")

m1.euclid.dis

##                         amc ambassador sst audi 4000
## audi 4000                        207.76017          
## plymouth valiant custom           79.05062 128.92668

# str(m1.euclid.dis)

So, in terms of similarity:

the plymouth valiant and amc ambassador are most similar
the audi is more similar to the plymouth than to the amc

Let’s compare this with the results from Gower’s dissimilarity:

Gower’s dissimilarity:

m1.gower.dis <- daisy(m1.auto.numeric, 
                      metric = "gower")

m1.gower.dis

## Dissimilarities :
##                         amc ambassador sst audi 4000
## audi 4000                        1.0000000          
## plymouth valiant custom          0.3324165 0.6675835
## 
## Metric :  mixed ;  Types = I, I, I 
## Number of objects : 3

Same conclusion:

the plymouth valiant and amc ambassador are most similar
the audi is more similar to the plymouth than to the amc

Mixed numeric/categorical data

First we’ll add in a categorical column: the origin of the cars

set.seed(3)
df3.auto.subset.cat <- df1.auto %>% 
    select(name,
           mpg, 
           cylinders, 
           displacement, 
           origin) %>%  
    sample_n(3)

# print: 
df3.auto.subset.cat %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

	name	mpg	cylinders	displacement	origin
66	amc ambassador sst	17.0	8	304	American
316	audi 4000	34.3	4	97	European
151	plymouth valiant custom	19.0	6	225	American

Now we’ll use Gower distances on mixed data:

# for convenience, add rownames: 
rownames(df3.auto.subset.cat) <- df3.auto.subset.cat %>%
    pull(name) %>% 
    substr(1,10)



m2.gower.dis <- daisy(df3.auto.subset.cat, 
                      metric = "gower")

m2.gower.dis

## Dissimilarities :
##            amc ambass audi 4000
## audi 4000   1.0000000          
## plymouth v  0.3994499 0.8005501
## 
## Metric :  mixed ;  Types = N, I, I, I, N 
## Number of objects : 3

As expected, this hasn’t changed the similarity scores between the cars by much.

Changing the data to examine the effect on Gower’s measure

Let’s mess with the data a bit. We’ll add a lot of categorical fields that make the audi and the amc similar. Then we’ll calculate Gower’s dissimilarity again.

set.seed(3)
df4.auto.new <- df1.auto %>% 
    select(name,
           mpg, 
           cylinders, 
           displacement, 
           origin) %>%  
    sample_n(3) %>% 
    mutate(category.var.1 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.2 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.3 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.4 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.5 = as.factor(c("yes", 
                                        "yes", 
                                        "no")))

# add rownames for convenience: 
rownames(df4.auto.new) <- df4.auto.new %>%
    pull(name) %>% 
    substr(1,10)


# Calculate Gower's similarity indexes: 
m3.gower.newdata <- daisy(df4.auto.new, 
                          metric = "gower")

m3.gower.newdata

## Dissimilarities :
##            amc ambass audi 4000
## audi 4000   0.5000000          
## plymouth v  0.6997249 0.9002751
## 
## Metric :  mixed ;  Types = N, I, I, I, N, N, N, N, N, N 
## Number of objects : 3

As expected, we have managed to make the audi and the amc the most similar. This means that Gower’s measure is behaving how we would intuitively expect it to.

Clustering on mixed data

Visualizing clusters

There are too many rows to visualize all car names in a single tree, but it’s still useful to see the vertical distances between clusters.³

# assign rownames for convenience: 
rownames.start <- df1.auto %>%
    pull(name) %>% 
    substr(1,6)

rownames <- paste(rownames.start, 
                  1:nrow(df1.auto), 
                  sep = "-")

rownames(df1.auto) <- rownames

# find distance matrix: 
m4.auto.gower.dist <- daisy(df1.auto, 
                            metric = "gower")


# now use hclust: 
c1.auto.cluster <- hclust(m4.auto.gower.dist)

# plot dendrogram: 
# Far too many rows to visualize in one tree
# plot(c1.auto.cluster, 
#      hang = -1,
#      cex = 0.6,
#      cex.lab = 0.1)

# That's not easy to interpret. Let's try an alternative plotting method: 

d1.auto.dendrogram <- as.dendrogram(c1.auto.cluster)

# Far too many rows to visualize in one tree
plot(d1.auto.dendrogram,
     horiz = FALSE, 
     leaflab = "none")

Examining split into 2 clusters

Now let’s take a look at the list of cars at different levels of the dendrogram.⁴

Given the vertical distances in the dendrogram, it looks like 2 to 4 clusters is probably most meaningful. Let’s start by looking at 2 clusters.

groups.2 <- cutree(c1.auto.cluster, 2)

# str(groups.2) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to. 

list1.2clusters <- sapply(unique(groups.2), 
                          function(x){df1.auto$name[groups.2 == x]})

# list1.2clusters[[1]]
# list1.2clusters[[2]]

df5.2cl.summary <- table(groups.2, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2")) %>% 
      select(cluster, 
             everything())

# print: 
df5.2cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster	American	European	Japanese
cluster1	104	0	0
cluster2	141	68	79

So all the cars in Cluster 1 are American, while those in Cluster 2 are a mix of American, European and Japanese. How do these clusters differ in terms of the other variables?

df6.auto.numeric <- df1.auto %>% 
      select(mpg:acceleration)

aggregate(df6.auto.numeric, 
          list(groups.2), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster.num	mpg	cylinders	displacement	horsepower	weight	acceleration
1	14.25	8	350.0	150	4137.5	13.00
2	26.00	4	120.5	86	2503.0	16.15

It’s fairly clear that Cluster 1 is large, fuel-inefficient cars (all American), whereas Cluster 2 is all other cars.

Examining split into 3 clusters

groups.3 <- cutree(c1.auto.cluster, 3)

# str(groups.3) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.

list2.3clusters <- sapply(unique(groups.3), 
                          function(x){df1.auto$name[groups.3 == x]})


df7.3cl.summary <- table(groups.3, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2", 
                         "cluster3")) %>% 
      select(cluster, 
             everything())

# print: 
df7.3cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster	American	European	Japanese
cluster1	104	0	0
cluster2	141	0	79
cluster3	0	68	0

# Let's examine numeric variables by cluster: 
aggregate(df6.auto.numeric, 
          list(groups.3), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster.num	mpg	cylinders	displacement	horsepower	weight	acceleration
1	14.25	8	350	150.0	4137.5	13.0
2	25.50	4	135	88.0	2568.5	16.4
3	26.00	4	105	76.5	2240.0	15.6

Very interesting. Cluster 1 is the large American cars, Cluster 3 is European Cars, and Cluster 2 is Japanese cars and Japanese-like American cars.

Additionally, note that Clusters 2 and 3 are very similar, but Cluster 3 seems to have slightly smaller, more fuel-efficient cars.

Examining split into 4 clusters

groups.4 <- cutree(c1.auto.cluster, 4)

# str(groups.4) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.

list3.4clusters <- sapply(unique(groups.4), 
                          function(x){df1.auto$name[groups.4 == x]})


df8.4cl.summary <- table(groups.4, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2", 
                         "cluster3", 
                         "cluster4")) %>% 
      select(cluster, 
             everything())

# print: 
df8.4cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster	American	European	Japanese
cluster1	104	0	0
cluster2	0	0	79
cluster3	141	0	0
cluster4	0	68	0

# Let's examine numeric variables by cluster: 
aggregate(df6.auto.numeric, 
          list(groups.4), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster.num	mpg	cylinders	displacement	horsepower	weight	acceleration
1	14.25	8	350	150.0	4137.5	13.0
2	31.60	4	97	75.0	2155.0	16.4
3	22.40	6	171	90.0	2790.0	16.4
4	26.00	4	105	76.5	2240.0	15.6

At this point, we’ve managed to split apart the Japanese cars from the Japanese-like American cars.

Finally, just for fun, let’s look at the breakdown into 5 clusters.

Examining split into 5 clusters

groups.5 <- cutree(c1.auto.cluster, 5)

# str(groups.4) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.

list4.5clusters <- sapply(unique(groups.5), 
                          function(x){df1.auto$name[groups.5 == x]})


df9.5cl.summary <- table(groups.5, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2", 
                         "cluster3", 
                         "cluster4", 
                         "cluster5")) %>% 
      select(cluster, 
             everything())

# print: 
df9.5cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster	American	European	Japanese
cluster1	58	0	0
cluster2	46	0	0
cluster3	0	0	79
cluster4	141	0	0
cluster5	0	68	0

# Let's examine numeric variables by cluster: 
aggregate(df6.auto.numeric, 
          list(groups.5), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cluster.num	mpg	cylinders	displacement	horsepower	weight	acceleration
1	15.5	8	318.0	145.0	4057.5	13.6
2	13.5	8	371.5	175.0	4365.0	11.5
3	31.6	4	97.0	75.0	2155.0	16.4
4	22.4	6	171.0	90.0	2790.0	16.4
5	26.0	4	105.0	76.5	2240.0	15.6

References

Huang, Zhexue. Extensions to the k-Means algorithm for clustering large datasets with categorical values. Data Mining and Knowledge Discovery, 1998.↩
Gower, J.C. A general coefficient of similarity and some of its properties. Biometrics, 1971.↩
See http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning#plot.hclust-r-base-function ↩
See https://www.stat.berkeley.edu/~spector/s133/Clus.html ↩