Motivation for Gower’s dissimilarity measure

Quite often, data analysis involves comparing different entities using a set of measurements on several variables. For example, patients in a hospital can be compared based on their length of stay in the hospital, their diagnoses, the number of medications they are taking, etc.

I was recently looking into a problem that called for clustering patients into groups. K-means clustering - possibly the most widely-known clustering algorithm - only works when all variables are numeric. However, we often want to cluster observations on both numeric and categorical variables. For relatively small datasets, this can be done with hierarchical clustering methods using Gower’s similarity coefficient. For larger datasets, the computational costs of hierarchical clustering are too large, and an alternative clustering method such as k-prototypes should be considered.1

For details on the mechanics of Gower’s measure, please see Gower, 1971.2 Here’s the gist of it:

To compare 2 people, A and B, on a variable X1, you first have to check whether a comparison is possible - i.e. measurements of X1 exist for both person A and person B. If a comparison is possible, then you assign a score for how similar these 2 people are. For example, if you want to compare Amy and Bob in terms of the variable “Nationality”, then you could assign a value of 1 if they are both American. Call this X1_similarity.

The overall similarity score between Amy and Bob is the sum of all the individual variable similarities (X1_similarity + X2_similarity + …) divided by the total possible comparisons (the number of variables for which data exists for both Amy and Bob).

     

Numeric-only data: Comparing Gower’s measure with Euclidean distances

Let’s examine the Auto dataset from the package ISLR. Here’s a sample of rows to show what it looks like:

df1.auto <- ISLR::Auto %>% 
    
    # recode origin as a factor 
    mutate(origin = case_when(
        origin == 1 ~ "American",
        origin == 2 ~ "European",
        origin == 3 ~ "Japanese"
    ) %>% as.factor())
## Warning: package 'bindrcpp' was built under R version 3.4.4
# str(df1.auto)
# summary(df1.auto)
df1.auto %>% 
      sample_n(15) %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
mpg cylinders displacement horsepower weight acceleration year origin name
248 19.9 8 260 110 3365 15.5 78 American oldsmobile cutlass salon brougham
138 14.0 8 302 140 4638 16.0 74 American ford gran torino (sw)
195 24.5 4 98 60 2164 22.1 76 American chevrolet woody
208 19.0 4 120 88 3270 21.9 76 European peugeot 504
254 25.1 4 140 88 2720 15.4 78 American ford fairmont (man)
343 32.3 4 97 67 2065 17.8 81 Japanese subaru
206 20.0 4 130 102 3150 15.7 76 European volvo 245
351 33.7 4 107 75 2210 14.4 81 Japanese honda prelude
87 13.0 8 350 145 3988 13.0 73 American chevrolet malibu
60 20.0 4 140 90 2408 19.5 72 American chevrolet vega
199 18.0 6 250 78 3574 21.0 76 American ford granada ghia
238 30.0 4 97 67 1985 16.4 77 Japanese subaru dl
219 33.5 4 85 70 1945 16.8 77 Japanese datsun f-10 hatchback
221 17.0 8 260 110 4060 19.0 77 American oldsmobile cutlass supreme
217 36.0 4 79 58 1825 18.6 77 European renault 5 gtl

     

For simplicity, let’s focus on a small subset of the data:

set.seed(3)

df2.auto.subset <- df1.auto %>% 
    select(name,
           mpg, 
           cylinders, 
           displacement) %>%  
    sample_n(3)

# print: 
df2.auto.subset %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
name mpg cylinders displacement
66 amc ambassador sst 17.0 8 304
316 audi 4000 34.3 4 97
151 plymouth valiant custom 19.0 6 225

     

Now we’ll compare Gower’s dissimilarity measure with Euclidean distances:

     

Euclidean distances:

# create matrix object for convenience: 
m1.auto.numeric <- select(df2.auto.subset, 
                              -name) %>% 
    as.matrix()

rownames(m1.auto.numeric) <- df2.auto.subset %>% pull(name)

m1.euclid.dis <- dist(m1.auto.numeric,
                       method = "euclidean")

m1.euclid.dis
##                         amc ambassador sst audi 4000
## audi 4000                        207.76017          
## plymouth valiant custom           79.05062 128.92668
# str(m1.euclid.dis)

So, in terms of similarity:

  • the plymouth valiant and amc ambassador are most similar
  • the audi is more similar to the plymouth than to the amc

Let’s compare this with the results from Gower’s dissimilarity:

     

Gower’s dissimilarity:

m1.gower.dis <- daisy(m1.auto.numeric, 
                      metric = "gower")

m1.gower.dis
## Dissimilarities :
##                         amc ambassador sst audi 4000
## audi 4000                        1.0000000          
## plymouth valiant custom          0.3324165 0.6675835
## 
## Metric :  mixed ;  Types = I, I, I 
## Number of objects : 3

Same conclusion:

  • the plymouth valiant and amc ambassador are most similar
  • the audi is more similar to the plymouth than to the amc

     

Mixed numeric/categorical data

First we’ll add in a categorical column: the origin of the cars

set.seed(3)
df3.auto.subset.cat <- df1.auto %>% 
    select(name,
           mpg, 
           cylinders, 
           displacement, 
           origin) %>%  
    sample_n(3)

# print: 
df3.auto.subset.cat %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
name mpg cylinders displacement origin
66 amc ambassador sst 17.0 8 304 American
316 audi 4000 34.3 4 97 European
151 plymouth valiant custom 19.0 6 225 American

     

Now we’ll use Gower distances on mixed data:

# for convenience, add rownames: 
rownames(df3.auto.subset.cat) <- df3.auto.subset.cat %>%
    pull(name) %>% 
    substr(1,10)



m2.gower.dis <- daisy(df3.auto.subset.cat, 
                      metric = "gower")

m2.gower.dis
## Dissimilarities :
##            amc ambass audi 4000
## audi 4000   1.0000000          
## plymouth v  0.3994499 0.8005501
## 
## Metric :  mixed ;  Types = N, I, I, I, N 
## Number of objects : 3

As expected, this hasn’t changed the similarity scores between the cars by much.

     

Changing the data to examine the effect on Gower’s measure

Let’s mess with the data a bit. We’ll add a lot of categorical fields that make the audi and the amc similar. Then we’ll calculate Gower’s dissimilarity again.

set.seed(3)
df4.auto.new <- df1.auto %>% 
    select(name,
           mpg, 
           cylinders, 
           displacement, 
           origin) %>%  
    sample_n(3) %>% 
    mutate(category.var.1 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.2 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.3 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.4 = as.factor(c("yes", 
                                        "yes", 
                                        "no")), 
           category.var.5 = as.factor(c("yes", 
                                        "yes", 
                                        "no")))
# add rownames for convenience: 
rownames(df4.auto.new) <- df4.auto.new %>%
    pull(name) %>% 
    substr(1,10)


# Calculate Gower's similarity indexes: 
m3.gower.newdata <- daisy(df4.auto.new, 
                          metric = "gower")

m3.gower.newdata
## Dissimilarities :
##            amc ambass audi 4000
## audi 4000   0.5000000          
## plymouth v  0.6997249 0.9002751
## 
## Metric :  mixed ;  Types = N, I, I, I, N, N, N, N, N, N 
## Number of objects : 3

     

As expected, we have managed to make the audi and the amc the most similar. This means that Gower’s measure is behaving how we would intuitively expect it to.

     

Clustering on mixed data

Visualizing clusters

There are too many rows to visualize all car names in a single tree, but it’s still useful to see the vertical distances between clusters.3

# assign rownames for convenience: 
rownames.start <- df1.auto %>%
    pull(name) %>% 
    substr(1,6)

rownames <- paste(rownames.start, 
                  1:nrow(df1.auto), 
                  sep = "-")

rownames(df1.auto) <- rownames

# find distance matrix: 
m4.auto.gower.dist <- daisy(df1.auto, 
                            metric = "gower")


# now use hclust: 
c1.auto.cluster <- hclust(m4.auto.gower.dist)

# plot dendrogram: 
# Far too many rows to visualize in one tree
# plot(c1.auto.cluster, 
#      hang = -1,
#      cex = 0.6,
#      cex.lab = 0.1)

# That's not easy to interpret. Let's try an alternative plotting method: 

d1.auto.dendrogram <- as.dendrogram(c1.auto.cluster)

# Far too many rows to visualize in one tree
plot(d1.auto.dendrogram,
     horiz = FALSE, 
     leaflab = "none")

     

Examining split into 2 clusters

Now let’s take a look at the list of cars at different levels of the dendrogram.4

Given the vertical distances in the dendrogram, it looks like 2 to 4 clusters is probably most meaningful. Let’s start by looking at 2 clusters.

groups.2 <- cutree(c1.auto.cluster, 2)

# str(groups.2) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to. 

list1.2clusters <- sapply(unique(groups.2), 
                          function(x){df1.auto$name[groups.2 == x]})

# list1.2clusters[[1]]
# list1.2clusters[[2]]

df5.2cl.summary <- table(groups.2, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2")) %>% 
      select(cluster, 
             everything())

# print: 
df5.2cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster American European Japanese
cluster1 104 0 0
cluster2 141 68 79

So all the cars in Cluster 1 are American, while those in Cluster 2 are a mix of American, European and Japanese. How do these clusters differ in terms of the other variables?

df6.auto.numeric <- df1.auto %>% 
      select(mpg:acceleration)

aggregate(df6.auto.numeric, 
          list(groups.2), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster.num mpg cylinders displacement horsepower weight acceleration
1 14.25 8 350.0 150 4137.5 13.00
2 26.00 4 120.5 86 2503.0 16.15

It’s fairly clear that Cluster 1 is large, fuel-inefficient cars (all American), whereas Cluster 2 is all other cars.

     

Examining split into 3 clusters

groups.3 <- cutree(c1.auto.cluster, 3)

# str(groups.3) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.

list2.3clusters <- sapply(unique(groups.3), 
                          function(x){df1.auto$name[groups.3 == x]})


df7.3cl.summary <- table(groups.3, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2", 
                         "cluster3")) %>% 
      select(cluster, 
             everything())

# print: 
df7.3cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster American European Japanese
cluster1 104 0 0
cluster2 141 0 79
cluster3 0 68 0
# Let's examine numeric variables by cluster: 
aggregate(df6.auto.numeric, 
          list(groups.3), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster.num mpg cylinders displacement horsepower weight acceleration
1 14.25 8 350 150.0 4137.5 13.0
2 25.50 4 135 88.0 2568.5 16.4
3 26.00 4 105 76.5 2240.0 15.6

Very interesting. Cluster 1 is the large American cars, Cluster 3 is European Cars, and Cluster 2 is Japanese cars and Japanese-like American cars.

Additionally, note that Clusters 2 and 3 are very similar, but Cluster 3 seems to have slightly smaller, more fuel-efficient cars.

     

Examining split into 4 clusters

groups.4 <- cutree(c1.auto.cluster, 4)

# str(groups.4) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.

list3.4clusters <- sapply(unique(groups.4), 
                          function(x){df1.auto$name[groups.4 == x]})


df8.4cl.summary <- table(groups.4, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2", 
                         "cluster3", 
                         "cluster4")) %>% 
      select(cluster, 
             everything())

# print: 
df8.4cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster American European Japanese
cluster1 104 0 0
cluster2 0 0 79
cluster3 141 0 0
cluster4 0 68 0
# Let's examine numeric variables by cluster: 
aggregate(df6.auto.numeric, 
          list(groups.4), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster.num mpg cylinders displacement horsepower weight acceleration
1 14.25 8 350 150.0 4137.5 13.0
2 31.60 4 97 75.0 2155.0 16.4
3 22.40 6 171 90.0 2790.0 16.4
4 26.00 4 105 76.5 2240.0 15.6

At this point, we’ve managed to split apart the Japanese cars from the Japanese-like American cars.

Finally, just for fun, let’s look at the breakdown into 5 clusters.

     

Examining split into 5 clusters

groups.5 <- cutree(c1.auto.cluster, 5)

# str(groups.4) 
# this is just a numeric vector: each entry is the cluster number that the corresponding element should be assigned to.

list4.5clusters <- sapply(unique(groups.5), 
                          function(x){df1.auto$name[groups.5 == x]})


df9.5cl.summary <- table(groups.5, 
                         df1.auto$origin) %>% 
      as.data.frame.matrix() %>% 
      mutate(cluster = c("cluster1", 
                         "cluster2", 
                         "cluster3", 
                         "cluster4", 
                         "cluster5")) %>% 
      select(cluster, 
             everything())

# print: 
df9.5cl.summary %>% 
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster American European Japanese
cluster1 58 0 0
cluster2 46 0 0
cluster3 0 0 79
cluster4 141 0 0
cluster5 0 68 0
# Let's examine numeric variables by cluster: 
aggregate(df6.auto.numeric, 
          list(groups.5), 
          median) %>% 
      
      rename(cluster.num = Group.1) %>% 
      
      kable() %>% 
      kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cluster.num mpg cylinders displacement horsepower weight acceleration
1 15.5 8 318.0 145.0 4057.5 13.6
2 13.5 8 371.5 175.0 4365.0 11.5
3 31.6 4 97.0 75.0 2155.0 16.4
4 22.4 6 171.0 90.0 2790.0 16.4
5 26.0 4 105.0 76.5 2240.0 15.6

     

References


  1. Huang, Zhexue. Extensions to the k-Means algorithm for clustering large datasets with categorical values. Data Mining and Knowledge Discovery, 1998.

  2. Gower, J.C. A general coefficient of similarity and some of its properties. Biometrics, 1971.

  3. See http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning#plot.hclust-r-base-function

  4. See https://www.stat.berkeley.edu/~spector/s133/Clus.html