The Curse of Dimensionality

This so called curse is essentially an exponential linear relationship between the feature space and the number of configurations; as the dimensions of the feature space increases, the number of configurations grows exponentially, and therefore the number of observations of configurations decreases. This makes it difficult for data scientists to make observations without getting lost in the weeds of data on their own. Yet, at the same time, this is an ideal opportunity for one to train their machine learning models.

According to Hughes phenomenon, as the number of features increases, the model’s performance increases until the optimal number of features is satisfied. It’s worth noting that if the number of features is equivalent to the training set, then one can expect the model’s performance to falter

A real life example of this is described in the paper “Gene-gene interaction: the curse of dimensionality”, by Amrita Chattopadhyay, and Tzu-Pin Lu. Here, the authors are describing a research regarding certain genetic variants that only show “modest effects” of disease risk which leads to the “missing heritability” problem. A method to account for these insufficiencies is to evaluate epistasis: gene-gene interaction. This method elucidates the effects on complex diseases and can even identify gene functions, pathways, and drug targets. This evaluation requires the analysis of all possible genetic interactions amongst millions of single nucleotide polymorphisms. Yet, this inevitably leads to the curse of dimensionality.

According to an article by Badreesh Shetty published on builtin.com, an increase in the number of dimensions in a dataset implies more entries in the vector of features that represents each observation in the corresponding Euclidian space; this vector is measured via Euclidean distance.

Take two dimensional vectors with the cartesian coordinates \(p\) and \(q\) to be denoted as: \(p = (p_1, p_2, p_3, ...,p_i) ; \ q = (q_1, q_2, q_3,...,q_j)\). Their Euclidian distance can be calculated as:

\(distance(p,\ q) = \sqrt{ \sum_{i = 1}^n(p_i - q_i)^2}\)

Here, each dimension introduced adds a non-negative or positive term to the sum, which causes the number of features to grow for a given number of observations, thus creating a vacuous feature space.

This hinders the effectiveness of traditional parametric statistical methods such as linear discriminant analysis, naive bayes, and perceptron; the pitfalls of using parametric methods is attributed to one having the wrong assumption about the data, like misconstruing it to be linear while it truly isn’t. Non-parametric methods, like support vector machines and KNN are more versatile and can pave the way for better model performance since no assumptions are being made.

The Curse of Dimensionality

Joe Connolly

3/7/2022

References