Introduction

The curse of dimensionality affects many distance metrics (Aggarwal, Hinneburg, and Keim 2001). This paper explores how commonly used distance and similarity measures behave in high-dimensional spaces, and finds that our usual intuition about β€œcloseness” breaks down as dimensionality increases. Specifically, it shows that the difference between the nearest and farthest points to any given observation shrinks, making all points appear almost equally distant. This phenomenon, known as the concentration of distances, undermines many distance-based algorithms used in machine learning and data analysis β€” such as clustering, nearest-neighbor search, and anomaly detection β€” because they rely on the assumption that β€œcloser” implies more similar. The paper evaluates several metrics, including Euclidean (L2), Manhattan (L1), fractional norms, and cosine similarity. It finds that Euclidean and cosine similarity degrade rapidly in high dimensions, meaning their ability to distinguish between near and far points weakens significantly. Surprisingly, cosine similarity, which is often used for high-dimensional data like text, suffers even more from this curse of dimensionality than some norm-based distances. This is relevant to the work we have done on quantifying similarity between NOCs on the basis of the ONET skills, abilities, knowledge and work activities: there are 161 dimensions to this data. If these numerous attributes are measured with error, then even the ranking of NOCs by similarity may be unreliable as it becomes increasingly likely that the closest neighbour is not actually the most similar (measurement error dominates the signal.) Fortunately most of ONET data is redundant, as many of the attributes are highly correlated: the first 8 principal components contain over 80 percent of the variation.

Functions utilized

We start by describing the functions we will use to generate data, calculate distances, and evaluate the concentration of distances. We will then run simulations across different dimensions and visualize the results.

generate_data <- function(dim) {
  runif(10 * dim, min = 0, max = 1) |> 
    matrix(ncol = dim) 
}

The above function generates a matrix of 10 points in a hyper-cube of a specified dimension dim. Each point is drawn from a uniform distribution between 0 and 1, resulting in a matrix with 10 rows (points) and dim columns (dimensions). As an aside, dimension reduction via principal components analysis (PCA) would not be effective on this simulated data, as the attributes are independent and uniformly distributed: there is no correlation to exploit.

fractional_minkowski <- function(x, y, p) {
  sum(abs(x - y)^p)^(1/p)
}

The above function calculates the fractional Minkowski distance between two vectors x and y using a parameter p. It generalizes the Minkowski distance, allowing for different values of p to adjust the sensitivity of the distance measure. When p=2 it is equivalent to the Euclidean distance, and when p=1 it is equivalent to the Manhattan distance. We also investigate a fractional Minkowski distance with p=0.5, which is a fractional norm that can be useful in high-dimensional spaces.

get_distances <- function(mat, p){
  mat|> 
    proxy::dist(method = function(x, y) fractional_minkowski(x, y, p)) 
}

The above wrapper function takes a dataframe of ten points in a hyper-cube of some dimension, and returns the 45 pairwise distances between the 10 points using the fractional Minkowski distance.

cv <- function(distances) {
  sd(distances) / mean(distances)
}

A unit-less measure of the variability of distances, the coefficient of variation (CV) is calculated as the standard deviation divided by the mean. A higher CV indicates indicates the metric is informative, while a lower CV suggests that the metric is basically the same for all points, and thus not very informative.

ratio <- function(distances) {
  min(distances) / max(distances) 
}

An alternative measure of the concentration of distances is the ratio between the minimum and maximum distances. A ratio close to 1 indicates that all points are almost equally distant from each other, while a ratio significantly less than 1 suggests that there is a meaningful difference between the nearest and farthest points.

Constants

dimensions <- 2^(1:10) 
sims <- 1:1000 

We are going to investigate the behavior of distance metrics across a range of dimensions, from 2 to 1024, and run 1000 simulations for each dimension. This will allow us to observe how the metrics behave as dimensionality increases.

The simulation

tbbl <- crossing(
  dimensions = dimensions,
  sims = sims
)|>
  mutate(data=map(dimensions, generate_data),
         manhattan=map(data, get_distances, p=1),
         euclidean=map(data, get_distances, p=2),
         minkowski=map(data, get_distances, p=.5),
         cosine=map(data, simil, method = "cosine"), #calculate the cosine similarity for each vector pair
         `manhattan.Coefficient of Variation`=map_dbl(manhattan, cv),
         `euclidean.Coefficient of Variation`=map_dbl(euclidean, cv),
         `minkowski.Coefficient of Variation`=map_dbl(minkowski, cv),
         `cosine.Coefficient of Variation`=map_dbl(cosine, cv),
         `manhattan.Ratio of smallest to largest`=map_dbl(manhattan, ratio),
         `euclidean.Ratio of smallest to largest`=map_dbl(euclidean, ratio),
         `minkowski.Ratio of smallest to largest`=map_dbl(minkowski, ratio),
         `cosine.Ratio of smallest to largest`=map_dbl(cosine, ratio)
         )|>
  select(dimensions, sims, contains("Coef"), contains("Ratio"))|>
  pivot_longer(cols = -c("dimensions", "sims"))|>
  separate(name, into = c("distance_type", "metric"), sep = "\\.")

Results

The plot below summarizes the results of the simulation, showing how the coefficient of variation and the ratio of smallest to largest distances change as dimensionality increases. Note that the points are β€œjittered” to avoid over-plotting, as there are 4000 points (1000 simulations, 4 metrics) above each x axis tick (dimension). What does each tiny point represent? In each run of the simulation we generate 10 points in a hyper-cube of the specified dimension, and calculate the pairwise distances between them. So in a single run of the simulation we generate \(\binom{10}{2}=45\) distances, which are uses to calculate the coefficient of variation \(CV=\frac{\sigma_d}{\mu_d}\) and the ratio \(\frac{\min(d)}{\max(d)}\). The main result demonstrated is that regardless of metric, as the dimensionality increases, the distances between points become more uniform. The coefficient of variation approaches 0, indicating that the distances are becoming more similar, and the ratio of smallest to largest distance approaches 1, indicating that all points are roughly the same distance apart. A secondary result is that the metrics can be ranked in terms of their ability to distinguish between points in high-dimensional space. The fractional Minkowski distance (p=0.5) performs best, followed by Manhattan (L1), Euclidean (L2), and cosine similarity is the worst (given the dense data used in this simulation). Cosine similarity is used extensively in text mining, where data is typically very sparse (a document contains a small proportion of all words in a vocabulary).

ggplot(tbbl, aes(x = dimensions, y = value, colour=distance_type)) +
  geom_jitter(shape = 16, size = 0.1) +  # dot-like points
  guides(color = guide_legend(override.aes = list(size = 3)))+  # bigger in legend
  scale_x_continuous(trans="log2", labels = dimensions, breaks=dimensions)+
  scale_colour_brewer(palette = "Dark2") +
  labs(
    x = "Dimensions",
    y = NULL,
    colour= "Metric",
    title = "All vectors roughly same distance apart in high dimension space, regardless of metric."
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )+
  facet_wrap(~metric, scales = "free_y")

References

Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. 2001. β€œOn the Surprising Behavior of Distance Metrics in High Dimensional Space.” International Conference on Database Theory, 420–34.