The curse of dimensionality affects many distance metrics (Aggarwal, Hinneburg, and Keim 2001). This paper explores how commonly used distance and similarity measures behave in high-dimensional spaces, and finds that our usual intuition about βclosenessβ breaks down as dimensionality increases. Specifically, it shows that the difference between the nearest and farthest points to any given observation shrinks, making all points appear almost equally distant. This phenomenon, known as the concentration of distances, undermines many distance-based algorithms used in machine learning and data analysis β such as clustering, nearest-neighbor search, and anomaly detection β because they rely on the assumption that βcloserβ implies more similar. The paper evaluates several metrics, including Euclidean (L2), Manhattan (L1), fractional norms, and cosine similarity. It finds that Euclidean and cosine similarity degrade rapidly in high dimensions, meaning their ability to distinguish between near and far points weakens significantly. Surprisingly, cosine similarity, which is often used for high-dimensional data like text, suffers even more from this curse of dimensionality than some norm-based distances. This is relevant to the work we have done on quantifying similarity between NOCs on the basis of the ONET skills, abilities, knowledge and work activities: there are 161 dimensions to this data. If these numerous attributes are measured with error, then even the ranking of NOCs by similarity may be unreliable as it becomes increasingly likely that the closest neighbour is not actually the most similar (measurement error dominates the signal.) Fortunately most of ONET data is redundant, as many of the attributes are highly correlated: the first 8 principal components contain over 80 percent of the variation.
We start by describing the functions we will use to generate data, calculate distances, and evaluate the concentration of distances. We will then run simulations across different dimensions and visualize the results.
generate_data <- function(dim) {
runif(10 * dim, min = 0, max = 1) |>
matrix(ncol = dim)
}
The above function generates a matrix of 10 points in a hyper-cube of
a specified dimension dim
. Each point is drawn from a
uniform distribution between 0 and 1, resulting in a matrix with 10 rows
(points) and dim
columns (dimensions). As an aside,
dimension reduction via principal components analysis (PCA) would not be
effective on this simulated data, as the attributes are independent and
uniformly distributed: there is no correlation to exploit.
fractional_minkowski <- function(x, y, p) {
sum(abs(x - y)^p)^(1/p)
}
The above function calculates the fractional Minkowski distance
between two vectors x
and y
using a parameter
p
. It generalizes the Minkowski distance, allowing for
different values of p
to adjust the sensitivity of the
distance measure. When p=2 it is equivalent to the Euclidean distance,
and when p=1 it is equivalent to the Manhattan distance. We also
investigate a fractional Minkowski distance with p=0.5, which is a
fractional norm that can be useful in high-dimensional spaces.
get_distances <- function(mat, p){
mat|>
proxy::dist(method = function(x, y) fractional_minkowski(x, y, p))
}
The above wrapper function takes a dataframe of ten points in a hyper-cube of some dimension, and returns the 45 pairwise distances between the 10 points using the fractional Minkowski distance.
cv <- function(distances) {
sd(distances) / mean(distances)
}
A unit-less measure of the variability of distances, the coefficient of variation (CV) is calculated as the standard deviation divided by the mean. A higher CV indicates indicates the metric is informative, while a lower CV suggests that the metric is basically the same for all points, and thus not very informative.
ratio <- function(distances) {
min(distances) / max(distances)
}
An alternative measure of the concentration of distances is the ratio between the minimum and maximum distances. A ratio close to 1 indicates that all points are almost equally distant from each other, while a ratio significantly less than 1 suggests that there is a meaningful difference between the nearest and farthest points.
dimensions <- 2^(1:10)
sims <- 1:1000
We are going to investigate the behavior of distance metrics across a range of dimensions, from 2 to 1024, and run 1000 simulations for each dimension. This will allow us to observe how the metrics behave as dimensionality increases.
tbbl <- crossing(
dimensions = dimensions,
sims = sims
)|>
mutate(data=map(dimensions, generate_data),
manhattan=map(data, get_distances, p=1),
euclidean=map(data, get_distances, p=2),
minkowski=map(data, get_distances, p=.5),
cosine=map(data, simil, method = "cosine"), #calculate the cosine similarity for each vector pair
`manhattan.Coefficient of Variation`=map_dbl(manhattan, cv),
`euclidean.Coefficient of Variation`=map_dbl(euclidean, cv),
`minkowski.Coefficient of Variation`=map_dbl(minkowski, cv),
`cosine.Coefficient of Variation`=map_dbl(cosine, cv),
`manhattan.Ratio of smallest to largest`=map_dbl(manhattan, ratio),
`euclidean.Ratio of smallest to largest`=map_dbl(euclidean, ratio),
`minkowski.Ratio of smallest to largest`=map_dbl(minkowski, ratio),
`cosine.Ratio of smallest to largest`=map_dbl(cosine, ratio)
)|>
select(dimensions, sims, contains("Coef"), contains("Ratio"))|>
pivot_longer(cols = -c("dimensions", "sims"))|>
separate(name, into = c("distance_type", "metric"), sep = "\\.")
The plot below summarizes the results of the simulation, showing how the coefficient of variation and the ratio of smallest to largest distances change as dimensionality increases. Note that the points are βjitteredβ to avoid over-plotting, as there are 4000 points (1000 simulations, 4 metrics) above each x axis tick (dimension). What does each tiny point represent? In each run of the simulation we generate 10 points in a hyper-cube of the specified dimension, and calculate the pairwise distances between them. So in a single run of the simulation we generate \(\binom{10}{2}=45\) distances, which are uses to calculate the coefficient of variation \(CV=\frac{\sigma_d}{\mu_d}\) and the ratio \(\frac{\min(d)}{\max(d)}\). The main result demonstrated is that regardless of metric, as the dimensionality increases, the distances between points become more uniform. The coefficient of variation approaches 0, indicating that the distances are becoming more similar, and the ratio of smallest to largest distance approaches 1, indicating that all points are roughly the same distance apart. A secondary result is that the metrics can be ranked in terms of their ability to distinguish between points in high-dimensional space. The fractional Minkowski distance (p=0.5) performs best, followed by Manhattan (L1), Euclidean (L2), and cosine similarity is the worst (given the dense data used in this simulation). Cosine similarity is used extensively in text mining, where data is typically very sparse (a document contains a small proportion of all words in a vocabulary).
ggplot(tbbl, aes(x = dimensions, y = value, colour=distance_type)) +
geom_jitter(shape = 16, size = 0.1) + # dot-like points
guides(color = guide_legend(override.aes = list(size = 3)))+ # bigger in legend
scale_x_continuous(trans="log2", labels = dimensions, breaks=dimensions)+
scale_colour_brewer(palette = "Dark2") +
labs(
x = "Dimensions",
y = NULL,
colour= "Metric",
title = "All vectors roughly same distance apart in high dimension space, regardless of metric."
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)+
facet_wrap(~metric, scales = "free_y")