Metrics Using For Similarity and Dissimilarity Measurements

Ho Huu Binh

2022-06-11

Introduction

Similarity and its counterpart (dissimilarity) metrics are essential and integral components of many algorithms emanated from different fields - unsupervised machine learning (clustering), time series hierarchical modeling, or natural language processing summarization technique, just to name a couple.

While at first glance distance metrics seem to be pieces of cake to handle, the problem arises is that the moment you embark on extracting some insights from your business, there are many decisions to consider. Furthermore, it is more arduous that you have to bear in mind some matters below: + Little research or theoretical instructions is given. + The result could change based on your choice and decisions.

As a result, this post is an attempt to give you a brief introduction and basic understanding of concepts such as distance metrics, (dis)similarity metrics, and some perspective why it is acclaimed to use kernels as similarity metrics.

Similarity Metrics

Some notable dis(similarity) metrics that you would get accustomed to are Elucidean distance and Manhattan distance. Both are special cases of Minkowski distance; thus, we would briefly write the special one instead: \[ d(X,Y) = \left(\sum_{i = 1}^{k} \left|x_i - y_i\right|^{p}\right)^{\frac{1}{p}} \] We could think \(d(X,Y)\) as some number representing how “far/close” of observation \(X\) with respect to observation \(Y\) (you also may view it as a “proximity score” between these two variables). For example, \(X\) is an average observation (let’s say an average house with \(n\) characteristics) and \(Y\) are some observations. Then, if \(p=2\), we have the Elucidean distance. Otherwise, \(p=1\) stands for Manhattan distance. Based on either of these distance metrics, we could determine how far a house is from the average house. Brilliant!!! However, how about the meaning of these distance values? What are interpretation for those? And which one should we consider first?

To build some intuition around this question, you can think of a house at the centre of any major city. It’s enough to step out to the suburbs for a serious price drop. So for that kind of reality the Euclidean distance would be more reasonable than the Manhattan distance; a larger exponent for a sharper decrease.

The whole, and only difference between various (dis)similarity metrics lies in their essences: what is the rate of decline in the “proximity score” as you move the two observations distantly, how fast of an increase we observe in the “proximity score” as we move the two observations closer together.

In reality, coming up with an ideal metric is somehow intricate. You want those observations that agglomerate score real close. But if an observation is not assembled enough (so outside the cluster), then you want that observation to score as far away as possible. In sharp contrast to classical machine learning textbooks, since real data looks different at different regions, finding an intelligent similarity metric is a far cry from a one-size-fits-all situation. Thus, we will try to explore different characteristics of similarity metrics that could surmount such hurdles.

Illustration of Simple Similarity Metrics

Boston housing dataset will be used in this post. The first thing we need to do is to scale all the features (crime rate, number of rooms, etc) are on the same measures. Observing the average house, our target is to quantify the degree of difference between each house and that average house. Obviously, the quantification will begin with a distance measure. We start with the Euclidean distance and the Manhattan distance.

library(MASS)
library(dplyr)
library(ggplot2)
# Glimpse at the boston dataset
# Boston %>% head(10)
# Treat as a nx1 vector
DIM <- NROW(Boston)
df <- Boston %>% 
  scale

mean_house <- rep(0, NCOL(df))
dist_manhat <- dist_euclid <- NULL

for (i in 1:DIM){
  dist_euclid[i] <- sqrt(sum((mean_house - df[i,])^2 ))
  dist_manhat[i] <-  sum(abs(mean_house - df[i,])) 
}
temp1 <- (1000*dist_euclid/sum(dist_euclid)) %>%  sort
temp2 <- (1000*dist_manhat/sum(dist_manhat)) %>% sort

temp <- tibble(euclid = temp1, manhat = temp2)

ggplot(temp, aes(y = euclid, x = manhat)) + 
  geom_point(colour = "red", size = 2.5) + 
  geom_abline(intercept = 0, slope = 1, size = 1.5) +
  labs(title = "", x = "Manhattan",y = "Euclidean")

I normalized and multiplied the distances by 1000 for better readability. What is plotted is the sorted relative distances for each house from the average house. The two metrics coincide for all but a handful of houses. That is probably a result of our choice to uniformly standardize the columns of the data matrix. In practice you must also think if all your variables (columns) are “born equal” so to speak.

Now, these are distances, so dissimilarities. If we would compute all pairwise distances we would have a dissimilarity matrix. How do we get from dissimilarity to similarities? how similar two observations are versus how dissimilar two observations are, is this the same question? We know how far each house from the average house is, in terms of (scaled) characteristics. Would be nice to get values between 0 and 1. Ideally an index-like value for how similar two houses are, whereby if they are identical we have a value of 1 and if they are completely different we have the value close to zero. How could we overcome this? Well, we could apply some transformation as follows:

Apply \(f(x) = 1-x\), with \(x = \frac{d(X,Y)}{\sum d(X,Y)}\) for all pair \((X,Y)\)
Apply a different function which can reverses the relation between distance and proximity score

Kernel Functions

Kernel functions are continuous, symmetric and positive everywhere. There are many of those functions around, but let’s here talk about just one: the Gaussian kernel. Extensions to other kernels are trivial once you get the concept. In moving from distances (dissimilarities) to similarities, you can think about the Gaussian kernel as a particular function which reverses the relation between the distance and the proximity score (the higher the distance the lower the proximity score). The Gaussian kernel function (with the same notation as above) takes the form \[ K(X,Y) = \exp\left(-\frac{\left||X-Y|\right|^{2}}{2 \sigma^2}\right) = \exp\left(-\frac{D(X,Y)}{2 \sigma^2}\right) \] with \(p=2\) in the \(D\) metric. The reasons that the Gaussian kernel would produce such ideal result are listed below

If the distance \(D\) is zero, then we have a identity and the proximity score is 1. On the other hand, the larger \(D\) is, the closer the proximity score get to zero.
Consequently, the increased distance \(D\) would result in the score quickly dropping to zero.
The Gaussian kernel provides some flexibility in the form of the hyperparameter \(\sigma\).

To illustrate the last statement, we will produce the Gaussian kernel with three different values of \(\sigma\)

library(tidyverse)
library(latex2exp)
library(patchwork)
low_sig <- 0.2
med_sig <- 0.5
high_sig <- 1
dist_kern_low <-  exp(-dist_euclid^2 * low_sig)
dist_kern_med <-  exp(-dist_euclid^2 * med_sig)
dist_kern_high <-  exp(-dist_euclid^2 * high_sig)

# three distance plot 
X <- seq(0, 505, by = 1)
temp_kern <- tibble(X = X, Low = dist_kern_low, 
                    Med = dist_kern_med,
                    High = dist_kern_high) %>%
  pivot_longer(cols = Low:High, names_to = "sigma_type",
               values_to = "score")

p1 <- ggplot(temp_kern) + geom_line(aes(x = X, y = score, colour = sigma_type))+
  scale_color_hue(labels = 
  unname(TeX(c("$\\sigma^2 = 1$","$\\sigma^2 = 0.2$","$\\sigma^2 = 0.5$")))) + 
  labs(x = "", y = "")


# Now the relation with the Gaussian curve:
tail_gauss <- tibble(x = seq(0, 4, by = .1),
                     y_low = dnorm(x, mean = 0, sd = sqrt(1)),
                     y_med = dnorm(x, mean = 0, sd = sqrt(2)),
                     y_high = dnorm(x, mean = 0, sd = sqrt(5))) %>%
  pivot_longer(cols = y_low:y_high, names_to = "sigma_type",
               values_to = "density")

p2 <- ggplot(tail_gauss, aes(x = x, y = density, color = sigma_type))+
  geom_line() + 
  scale_color_manual(values = c("#d11141","#f37735","#00b159")) + 
  labs(x = "",y = "")

layout <- "AAA
           BBB"
p1 + p2 + plot_layout(design = layout)

The (right) tail of the normal distribution with the corresponding \(\sigma\) values at the bottom is provided. A high \(\sigma\) value means that we are lenient in what we consider far. Put differently, there is a slower decay of the proximity score with respect to the distance. While a low \(\sigma\) value means faster decay, so observations are quickly being considered different with every small drift apart.

To conclude, there are many options that you could consider in terms of distance metrics, the reversal to similarity score and the rate of change based on the change in distance. Tailoring your own measures dependent on your application or business context is also a flexible way to deal with such problems and not to rigid on textbooks. Such issue like this one should merit more attention, I think, in the fast-developing and downstream of Machine Learning tasks.

References

Eran Raviv