library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The curse of dimensionality is the concept: As dimensionality (the amount of features your model is trained on) increases, the amount of observations that are needed in your data set exponentially increases as well. I wanted to attempt to explain this by focusing on the distribution of each variable within your model. As we all know that your features following some sort of known distribution is important for most parametric methods.

We’ll start by assuming we are working with a normally distributed population of the response variable that has a mean of 0 with a standard deviation of 100.

set.seed(123)

u <- 0
std <- 100

Since, we always will have multiple samples from a population we will define a sample size to take which we will increase as we go through. We’ll also convert the values into integer values to be able to more easily visualize them without precise binning.

samp <- 10

y <- as.integer(rnorm(samp,u, std))

df <- tibble(y)

hist(df$y)

Even with looking at just the response variable itself, if the sample size is sufficiently small then we won’t have a good estimation of the population as we can see here with a histogram of only 10 random samples. If a model is attempting to fit itself to data that does not actually reflect the population, then we are never going to be making a model that fits the overall population. If we simulate more random samples such as 100 then we will be able to get a distribution that is more representative of our population.

samp <- 100

y <- as.integer(rnorm(samp,u, std))

df <- tibble(y)

hist(df$y)

With our sample upped to 100, our response variable now much better approximates the true distribution of the population which we want to model. Let’s extend this to having a predictor variable as well.

#y <- as.integer(rnorm(samp,u, std))
x1 <- as.integer(rnorm(samp,u, std))

#df <- tibble(y)
df$x1 <- x1

hist(df$x1)

Looking at just the histogram of this new predictor variable itself, x1, we are able to approximate the true population of a normal distribution. However, we must consider the each predictor variable is going to paired with each other and the response variable when a parametric model takes into account each row.

df |>
  filter(-100 < y & y < 100) |>
  select(x1) |>
  pull() |>
  hist(main = "Distribution of x1 Within 1SD of Y", xlab = "x1")

If we look at a subset of points where their y value is a standard deviation around the mean, and look at the x1 values, we notice the distribution has shifted. It is still approximately normal, but what happens when we add another dimension?

#y <- as.integer(rnorm(samp,u, std))
#x1 <- as.integer(rnorm(samp,u, std))
x2 <- as.integer(rnorm(samp,u, std))

#df <- tibble(y)
#df$x1 <- x1
df$x2 <- x2

hist(df$x2)

Looking at the histogram of the next new predictor variable, x2, we are able to approximate the true population of a normal distribution.

Now we subset for the datapoints that are a standard deviation around the mean of both y and x1. Suddenly, x2 is not approximating its true distribution anymore. It starts moving away from normal because we do not have as many coinciding data points.

So how would we fix this, perhaps since we added 100 new values to consider within the vector of x2 we could just add 100 to our sampling?

df |>
  filter( (-100 < y & y < 100) &
          (-100 < x1 & x1 < 100)) |>
  select(x1) |>
  pull() |>
  hist(main = "Distribution of x2 Within 1SD of Y and x1", xlab = "x2")

We do this and… The data still doesn’t look normal. Simply increasing the sampling size for each new feature added does not allow our distributions for a subset of the data to approximate the true distribution of each feature.

samp <- 200

y <- as.integer(rnorm(samp,u, std))
x1 <- as.integer(rnorm(samp,u, std))
x2 <- as.integer(rnorm(samp,u, std))

df <- tibble(y, x1, x2)
#df$x1 <- x1
#df$x2 <- x2

df |>
  filter( (-100 < y & y < 100) &
          (-100 < x1 & x1 < 100)) |>
  select(x1) |>
  pull() |>
  hist(main = "Distribution of x2 Within 1SD of y and x1", xlab = "x2")

Instead, what we need to do to actually grasp the true population distribution within a subset is exponentially increasing the sample size for each feature added! Below we see an example of returning to the true normal distribution, but not until after increasing our sample size from 100 to 10,000.

samp <- 100*100

y <- as.integer(rnorm(samp,u, std))
x1 <- as.integer(rnorm(samp,u, std))
x2 <- as.integer(rnorm(samp,u, std))

df <- tibble(y, x1, x2)
#df$x1 <- x1
#df$x2 <- x2

df |>
  filter( (-100 < y & y < 100) &
          (-100 < x1 & x1 < 100)) |>
  select(x1) |>
  pull() |>
  hist(main = "Distribution of x2 Within 1SD of y and x1", xlab = "x2")

If we were to add more features the sample sizes needed for approximating the true distributions for each set of variables will very quickly get out of hand. What we have here is an exaggeration, as we don’t need to have perfect distributions at every subset. However, it is very close to how the curse of dimensionality works.

So, how do we reconcile the curse of dimensionality with datasets that have more features than samples? In this case we can look towards non-parametric models. As with parametric models we have to keep strict assumptions about keeping to the distributions of our variables. With non-parametric models we loosen the assumptions about the distributions and would be able to utilize them well. Support vector machines are an example of this because they are strictly maximizing separation rather than assuming a stochastic distribution of variables and spreading based on assumed distributions like logistic regression would be doing.