I am writing this because I have found this mistake twice already in what I would consider important documents. Before explaining the issue, let me settle some definitions and go from there so that everybody knows what I am talking about.

Histograms

One very likely hears of histograms since very early in school when talking about data, and as a kid one envisions these vertical bars in the plot. In Statistics, histograms are key: for discrete variables, histograms are in fact the probability mass function. That is, \(\mathbb P[X = x]\) and basically look like this:

For continuous variables, the histogram is a more complex object. One first needs to break the support of the observed data into bins (which may or may not be of equal length) and each bin has a height equal to the relative frequency of all data points lying on it. The end result is a set of rectangles such that their areas add up to exactly 1.

The main idea is that the historgram should gives us an idea of what the probability density function of the variable looks like. That is, a non-negative function \(f_X\colon\mathbb R\to\mathbb R\) such that \[\mathbb P[X\leq x] = \int_{-\infty}^x f_X(\zeta)d\zeta\].

The idea of parametric statistics, simply explained, is that if the histogram of the data looks very much like the histogram of a known random variable then knowing anything we want from our data reduces to finding the parameters of the random variable we claim the data comes from.

To help our imagination and pattern matching mind, we may plot an empirical density rather than a histogram, which basically a smoothed version of the latter:

I’m going to get rid of the histogram just to have a clearer image of the two curves:

Sufficient and necessary conditions

The point of all the previous introduction is to basically have these two ideas in mind:

  1. The histogram is basically the empirical density.
  2. If the data is produced form a known random variable then their densities are the same.

Note the phrasing of point 2! If the data is produced form a known random variable then their densities are the same.

The reverse, happens to not be true.

Here are a couple of examples:

On this first example, densities are practically the same but they are inversely related.

Our second example deals with data where the densities are the same, and the variables are but only on the middle third! The tails are independent.

The third example works the other way around: it is the same data at the tails, but independent in the middle while of course having the same density.

Our fourth and final example shows the simplest of all cases: no relation whatsoever between the variables but yet, they have the same distribution.

The moral of the story is that if one wants to show that two variables are the same, we need to see that the scatter plot forms the identity line rather than looking at identical densities.

The main issue

Recently, I have seen reports, by people claiming to be machine learning experts, where they show regression type results of approximating \(Y\) by \(\hat Y = f(X)\) and show the results by claiming that both the empirical densities of \(Y\) and \(\hat Y\) are the same (or very similar). This is what I call the histogram trap: one does present a plot easy to understand and if the point one wants to make is true, the plot shows what we need.

The point is that that is either a mistake by the modeller because the plot itself does not help our point or, arguably worse, it is a trap! The modeller is luring the audience into a result that may not be true.

In fact, here is a code for a model that always gives the same density as the observed data, but it is not necessarily a good model:

# Training set
set.seed(666)
x_train <- rnorm(10000,3,2/3)
y_train <- sin(2*pi*x_train) + rnorm(10000,0,0.05) 


# Test set
x_test <- rnorm(1000,3,2/3)
y_test <- sin(2*pi*x_test) + rnorm(1000,0,0.05)

# Empirical distribution function of X:
F_X <- function(x, data = x_train){
  sort_x <- sort(data)
  distr <- cumsum(rep(1,length(sort_x))) / length(sort_x)
  n <- length(x)
  res <- 0 * x
  for(k in 1:n){
    pos <- cumsum(x[k] >= sort_x)[length(sort_x)]
    if(pos == 0){
      res[k] <- 0
    }else{
      if(pos == length(sort_x)){
        res[k] <- 1
      }else{
        alpha <- (x[k] - sort_x[pos]) / (sort_x[pos + 1] - sort_x[pos])
        res[k] <-  (1- alpha) * distr[pos] + alpha * distr[pos + 1]
      }
    }
  }
  return(res)
}

# The inverse function
F_inv <- function(p,fun,m, M, eps = 1e-8){
  n <- length(p)
  val <- 0 * p + (m + M)/2
  for(k in 1:n){
    if(p[k] <= 0){
      val[k] <- m
    }else{
      if(p[k] >= 1){
        val[k] <- M
      }else{
        test <- fun(val[k])
        low <- m
        high <- M
        while(abs(test - p[k]) > eps){
          aux <- val[k]
          if(test < p[k]){
            val[k] <- (val[k] + high)/ 2
            low <- aux
          }else{
            val[k] <- (val[k] + low)/ 2
            high <- aux
          }
          test <- fun(val[k])
        } # end while
      } # end else
    } # end else
  } # end for 
  return(val)
}

# Training the model
model_f <- function(x,x_train,y_train){
  y_vals <- F_X(x,x_train)
  F_Y <- function(y){
    return(F_X(y,y_train))
  }
  y_hat <- F_inv(p = y_vals,
                 fun = F_Y,
                 m = min(y_train),
                 M = max(y_train))
  return(y_hat)
}

# Evaluate model
y_hat <- model_f(x_test,x_train,y_train)

# Compare the models on the test data
par(mfrow = c(1,2))
plot(density(y_test),col = 'red',
     xlab = 'Values', ylab = 'density', main = 'Comparison of densities')
lines(density(y_hat),col = 'blue')

plot(y_test,y_hat,main = 'Actual relationship of test data\nand predicted values',
     xlab = 'Test data',
     ylab = 'Predicted values')

I think we can argue that the densities look quite similar. However, the model sucks: it’s completely discontinuous, it does not ressemble the identity line at all. The model is as useless as the plot showing the similarity of the densities.

Final comments

This, as many other errors, could happen to anyone of us. I am only raising the flag to pay attention to this specific trap.

Please do not be that guy showing the densities… and whatever you do, don’t fall for the histogram trap.