Introduction

Goal of the paper

The idea for this paper came to me when I first learned about the different distance metrics. During that class it was said, that there isn’t a well known use case of the manhattan distance metric. That got me curious, so I search the internet for papers and stack overflow posts that could tell more on this topic. One of the few ideas that I was able to find, was that it may lead to increased performance because it uses modulus, and not exponentiation. In this paper, I would like to test this hypothesis, comparing manhattan distance to euclidean and a few other distance metrics, to see if it makes a difference in both synthetic benchmarks and real-worlds applications.

Theoretical introduction

Euclidean distance

The euclidean distance is “straight line” distance, calculated as a square root of a sum of squared differences between each axis. It is by far the most popular metric, and many libraries use it as a default option.

The equation for the euclidean distance. Source: Wikipedia

Manhattan distance

The manhatan distance is calculated as a sum of unsigned lengths on each axis. Because it uses absolute values instead of exponentiation and rooting, it is said to be more robust to outliers. Additionally, because modulus is much faster to compute than exponentiation, it may give a performance increase in certain applications.

Red, blue and yellow lines represent same distances in the manhattan geometry, and the green line represents the euclidean distance. Source: Wikipedia

Canberra distance

Canberra distance is a weighted version of the manhattan distance. It is much more robust to outliers than other metrics, but is very sensible to values around 0. This feature makes it a good tool to detect some kinds of outliers, that’s why it is used mostly in spam detection software. Since it’s more complex than manhattan distance, it may be slower in some applications.

The equation for the canberra distance. Source: Wikipedia

Data set overview

Libraries

library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(NbClust)
library(microbenchmark)
library(multcomp)
library(knitr)

options(scipen=999) #avoiding e10 notation

Importing the data set

This data was cleaned and prepared as a part of an another project. It is a scrapping of Google Play Store apps. It can be found on Kaggle.

# Importing main data table, NaN are ommited (about 10% of all data)
mainTable <- na.omit(read.csv("data/clean_googleplaystore.csv",header=TRUE))

Data overview

After loading the data set, we can look at it’s shape.

dim(mainTable)
[1] 7671   11

The data consists of 7 671 observations and 11 variables. Overall, this gives us 84 381 data points.

Next, we can see what are the variables in this data set.

str(mainTable)
'data.frame':   7671 obs. of  11 variables:
 $ App            : Factor w/ 8196 levels "- Free Comics - Comic Apps",..: 6019 2085 7602 6805 6078 5929 6857 4564 3988 4773 ...
 $ Rating         : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
 $ Reviews.Count  : int  159 967 87510 215644 967 167 178 36815 13791 121 ...
 $ Size           : num  19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
 $ Installs       : Factor w/ 19 levels "1,000,000,000+",..: 6 18 11 14 9 15 15 2 2 6 ...
 $ Type           : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
 $ Price          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Content.Rating : Factor w/ 5 levels "Adults only 18+",..: 2 2 2 5 2 2 2 2 2 2 ...
 $ Genres         : Factor w/ 48 levels "Action","Adventure",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Last.Updated   : Factor w/ 1300 levels "April 1 2016",..: 528 451 107 779 713 853 69 684 1243 631 ...
 $ Android.Version: num  4 4 4 4.2 4.4 2.3 4 4.2 3 4 ...
 - attr(*, "na.action")= 'omit' Named int  37 42 52 67 68 73 85 88 89 92 ...
  ..- attr(*, "names")= chr  "37" "42" "52" "67" ...

There are 11 columns, 5 of them are numeric, the rest are factors. We’ll focus on the numeric variables, since they are the easiest to cluster.

We can take a look at a summary of these numeric variables.

a <- summary(mainTable[,c(2,3,4,7,11)])
kable(a)
Rating Reviews.Count Size Price Android.Version
Min. :1.000 Min. : 1 Min. : 1.00 Min. : 0.000 Min. :1.000
1st Qu.:4.000 1st Qu.: 106 1st Qu.: 6.10 1st Qu.: 0.000 1st Qu.:4.000
Median :4.300 Median : 2250 Median : 16.00 Median : 0.000 Median :4.100
Mean :4.173 Mean : 294338 Mean : 37.41 Mean : 1.134 Mean :3.854
3rd Qu.:4.500 3rd Qu.: 38252 3rd Qu.: 37.00 3rd Qu.: 0.000 3rd Qu.:4.200
Max. :5.000 Max. :44893888 Max. :994.00 Max. :400.000 Max. :8.000

The ratings count, app size and price variables have some serious outliers, but this shouldn’t affect the results of performance analysis in a significant way.

Empirical study

Data preparation

First, we want to extract the two variables that we’ll be working on - average rating, and number of reviews.

clusteringData <- mainTable[,c(2,3)]
dim(clusteringData)
[1] 7671    2

Then, we run NbClust to find a optimal number of clusters. Since our main goal is to evaluate performance, the number of clusters will stay the same for different clustering methods, and distance metrics.

c3<-NbClust(clusteringData , distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
c3$Best.nc
Number_clusters     Value_Index 
            6.0         40838.9 

It turns out that for this data, the optimal number of clusters is 6.

First performance comparison - distance metrics

Sample distance matrix

Since we want to compare performance of different metrics, the most significant differences should occur when calculating the distance matrices. This will allow to ignore other factors, such as the implementation of different metrics in the clustering algorithms (we’ll get back to that later).

head(dist(clusteringData, method = "euclidean"))
[1]    808.000025  87351.000002 215485.000000    808.000025      8.005623
[6]     19.002368

Performance evaluation

To evaluate performance, I’ll be using the microbenchmark function from microbenchmark package to calculate distance matrices, each one 300 times. The result is a table containing times for each run. Microbenchmark is a great tool for this application, because it’s optimized to exclude any overhead, such as time spent checking system time. It is said to be able to record differences on the scale of nanoseconds.

timeDistanceMatrix <- microbenchmark(
  dist(clusteringData, method = "euclidean"),
  dist(clusteringData, method = "manhattan"),
  dist(clusteringData, method = "canberra"),
  dist(clusteringData, method = "minkowski"),
  times=300)

Results

One we run the tests, we can display the results in a table format and as a graph.

timeDistanceMatrix$expr <- revalue(timeDistanceMatrix$expr, c('dist(clusteringData, method = "euclidean")' = "Euclidean",
                                                              'dist(clusteringData, method = "manhattan")' = "Manhattan",
                                                              'dist(clusteringData, method = "canberra")' = 'Canberra',
                                                              'dist(clusteringData, method = "minkowski")' = 'Minkowski'))
timeDistanceMatrixResults <- print(timeDistanceMatrix, unit = "s", order = 'median', signif = 3)
a <- timeDistanceMatrixResults
kable(a)
expr min lq mean median uq max neval cld
2 Manhattan 0.390 0.401 0.411 0.405 0.411 0.638 300 a
1 Euclidean 0.425 0.440 0.451 0.444 0.451 0.722 300 b
3 Canberra 0.499 0.514 0.524 0.519 0.525 0.712 300 c
4 Minkowski 1.680 1.730 1.760 1.750 1.770 2.710 300 d

The results are promising - the manhattan distance is noticeably faster than all other distance metrics. When comparing medians, the euclidean distance is 9.63% slower, the Canberra is 28.15% slower. By far the slowest is the Minkowski distance metric - it’s 332.10% slower than Manhattan distance. Results of this “synthetic” benchmark show us that, in a best case scenario, we can expect clustering algorithms to run about 10% faster when using manhattan distance compared to euclidean.

ggplot2::autoplot(timeDistanceMatrix)

We can see the differences visually. There are strange “tails” on each test - these are outliers that will be analysed later. I’ve compared medians in the interpretation above to filter out the effects of this strange behavior.

T-test

x <- timeDistanceMatrix %>% filter(timeDistanceMatrix$expr == 'Euclidean') %>% transmute(time = time/1000000000)
y <- timeDistanceMatrix %>% filter(timeDistanceMatrix$expr == 'Manhattan') %>% transmute(time = time/1000000000)
t.test(x$time,y$time)

    Welch Two Sample t-test

data:  x$time and y$time
t = 16.333, df = 589.49, p-value < 0.00000000000000022
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.03537677 0.04504727
sample estimates:
mean of x mean of y 
0.4514761 0.4112641 

According to the T-test, we can reject the null hypothesis, which says that the means of this two tests are equal. The resulting difference is statistically significant, with a very low p-value.

Investigating the outliers - how are they distributed?

From the graph above, we can see that there are some outliers in each method. Maybe these are the first runs, after which the dist() function does some caching or other optimization? To test this, we can look at the run times in order.

timeDistanceMatrix %>% 
  filter(timeDistanceMatrix$expr == 'Euclidean') %>% 
  droplevels.data.frame() %>%
  ggplot(aes(y=time, x=c(1:300))) + geom_point()+ xlab('Time') + ylab('Run time')

It doesn’t seem like that’s the case - the outliers are scattered quite evenly across the time. I wasn’t able to find any information on this - it might be caused by a bug in the dist() function, or an background app taking the focus of the processor for a couple of milliseconds (e.g. update software, or cloud syncing apps).

Second performance comparison - CClust

Sample clustering in CClust

After a synthetic benchmark that was the distance matrices, we can move on to a first real-world example. The CClust function from flexclust package can perform the kmeans clustering. As of today, it supports two distance metrics - euclidean and manhattan. To show how it looks, we can run a sample clustering on our data. Since this package does not support the ggplot2, the resulting graph leaves much to be desired.

sampleCClust <- cclust(clusteringData, k = 6, dist = "euclidean", simple = FALSE, save.data=TRUE)
plot(sampleCClust)

Performance evaluation

Now, we can evaluate the performance. Both tests were run 200 times, with the number of clusters set to 6 (as was suggested by the NbClust). The results are then saved for further analysis.

timeCClust <- microbenchmark(
  cclust(clusteringData, k = 6, dist = "euclidean", simple = FALSE, save.data=TRUE),
  cclust(clusteringData, k = 6, dist = "manhattan", simple = FALSE, save.data=TRUE),
  times=200)

Results

timeCClust$expr <- revalue(timeCClust$expr, c('cclust(clusteringData, k = 6, dist = \"euclidean\", simple = FALSE,      save.data = TRUE)' = "Euclidean",
                                          'cclust(clusteringData, k = 6, dist = \"manhattan\", simple = FALSE,      save.data = TRUE)' = "Manhattan"))
timeCClustResults <- print(timeCClust, unit = "s", order = 'median', signif = 3)
kable(timeCClustResults)
expr min lq mean median uq max neval cld
Euclidean 0.917 0.945 0.964 0.959 0.975 1.18 200 a
Manhattan 0.989 1.010 1.040 1.030 1.040 1.39 200 b

The results are somewhat surprising. Contrary to the experiment on the distance matrices, it seems that manhattan distance is slower in CClust algorithm, by 7.40%. This maybe caused by the difference in implementation - when using euclidean, the cluster means are the centroids, and in manhattan, the column-wise cluster medians are used as centroids.

ggplot2::autoplot(timeCClust)

Strange outliers can be observed, similar to these in distance matrices. After a short analysis of the CClust source code, I was able to find that it indeed uses the dist() function, which may have caused the outliers.

T-test

x <- timeCClust %>% filter(timeCClust$expr == 'Euclidean') %>% transmute(time = time/1000000000)
y <- timeCClust %>% filter(timeCClust$expr == 'Manhattan') %>% transmute(time = time/1000000000)
t.test(x$time,y$time)

    Welch Two Sample t-test

data:  x$time and y$time
t = -16.765, df = 327.07, p-value < 0.00000000000000022
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.08579500 -0.06777472
sample estimates:
mean of x mean of y 
 0.963783  1.040568 

According to the T-test, we can say that the resulting difference is statistically significant, with a low p-value.

Third performance comparison - Clara

Sample clustering in Clara

Below, we can see a sample clustering done in clara. Number of samples, and the sample size was increased from defaults, as is suggested in the documentation.
The silhouette plot indicates that the resulting clustering is quite good, with a average silhouette width of 0.82.

sampleClara<-clara(clusteringData, 6, metric="euclidean", stand=FALSE, samples=50,
                   sampsize=200, trace=0, medoids.x=TRUE,
                   rngR=TRUE, pamLike=FALSE, correct.d=TRUE) #cluster::
fviz_cluster(sampleClara, ellipse.type = "t", geom = "point", pointsize = 1 )

fviz_silhouette(sampleClara)
  cluster size ave.sil.width
1       1  166          0.91
2       2   23          0.35
3       3    5          0.39
4       4    4          0.64
5       5    1          0.00
6       6    1          0.00

Performance evaluation

Similarly to CClust, the Clara supports two distance metrics - euclidean and manhattan, the former being the default. Since Clara it is noticeably faster than CClust, this test was run 600 times for each metric. Overall it took around 10 minutes to complete.

timeClara <- microbenchmark(
  clara(clusteringData, 6, metric="euclidean", stand=FALSE, samples=50,
        sampsize=200, trace=0, medoids.x=TRUE,
        rngR=TRUE, pamLike=FALSE, correct.d=TRUE),
  clara(clusteringData, 6, metric="manhattan", stand=FALSE, samples=50,
        sampsize=200, trace=0, medoids.x=TRUE,
        rngR=TRUE, pamLike=FALSE, correct.d=TRUE),
  times=600)

Results

timeClara$expr <- revalue(timeClara$expr, c('clara(clusteringData, 6, metric = "euclidean", stand = FALSE,      samples = 50, sampsize = 200, trace = 0, medoids.x = TRUE,      rngR = TRUE, pamLike = FALSE, correct.d = TRUE)' = "Euclidean",
                                          'clara(clusteringData, 6, metric = "manhattan", stand = FALSE,      samples = 50, sampsize = 200, trace = 0, medoids.x = TRUE,      rngR = TRUE, pamLike = FALSE, correct.d = TRUE)' = "Manhattan"))
timeClaraResults <- print(timeClara, unit = "s", order = 'median', signif = 3)
kable(timeClaraResults)
expr min lq mean median uq max neval cld
2 Manhattan 0.243 0.264 0.274 0.273 0.283 0.322 600 a
1 Euclidean 0.251 0.274 0.284 0.282 0.294 0.342 600 b

This time the results are more similar to those from the synthetic benchmark. When using the euclidean distance instead of manhattan in the clara algorithm, we can expect it to be around 3.30% slower.

ggplot2::autoplot(timeClara)

This time, the outliers are not present. This would suggest that the outliers are a problem with the dist() function, and not the microbenchmarking process, or some other outside influence.

T-test

x <- timeClara %>% filter(timeClara$expr == 'Euclidean') %>% transmute(time = time/1000000000)
y <- timeClara %>% filter(timeClara$expr == 'Manhattan') %>% transmute(time = time/1000000000)
t.test(x$time,y$time)

    Welch Two Sample t-test

data:  x$time and y$time
t = 12.779, df = 1187.4, p-value < 0.00000000000000022
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.009035688 0.012313547
sample estimates:
mean of x mean of y 
0.2844278 0.2737531 

Like in previous experiments, the the resulting difference is statistically significant.

Conclusions

The conclusions of this experiments are not straightforward. The synthetic benchmark - distance matrices, showed that using a manhattan distance metric can lead to a performance increase of around 10%. This positive effect was also present when using the Clara algorithm, although to a smaller extent (around 3.5%). On the contrary, in CClust the use of manhattan distance led to a slowdown of about 7.50%.

Generally speaking, much bigger performance differences can be achieved by using a different algorithm or implementation (e.g. one using precompiled C code). In the tests above, the difference between CClust and Clara are quite big - 1.03 seconds for CClust, and 0,28 seconds for Clara. But, if the most efficient algorithm and implementation was already chosen, the results from this paper may suggest that testing the performance of different distance metrics may be beneficial. Especially if the algorithm is expected to run more than once.

Bibliography