The issue

I teach a class on machine learning and predictive modeling. In one section, we build a random forest model, demonstrating its ability to perform automated feature selection, and examining the variable importance scores after the model is created. These importance measures are generated by permuting each input variable in turn and assessing how much worse the model performs. Permuting an important variable would decrease the performance of the model, while permuting an unrelated variable should have little effect.

At this point in the class I’m always asked if you could use a procedure like this to figure out which variables are important, and take only those to use in a more traditional statistical model, such as logistic regression. My answer is an emphatic no. It’s okay to take these selected variables forward with a different validation dataset, but it is not at all okay to use the same data for regression modeling using variables selected by a machine learning procedure, such as random forest.

The demonstration here shows what happens when you try this. I generate completely random data on hundreds of potential input \(X\) variables with no association to a randomly generated binary outcome variable. I select variables with random forest and show that the p-values on a follow-up logistic regression are biased toward zero, wildly inflating your type I error rate.

Generate random data

Load relevant libraries.

library(tidyverse)
library(caret)

Now, let’s simulate \(n=50\) samples with \(k=200\) normally distributed random \(X\) variables, and a binary outcome variable \(Y\) randomly sampled as 0 or 1. Importantly, all the \(X\) and \(Y\) variables are randomly generated, thus the null hypothesis is true at every \(X\) input variable.

n <- 50
k <- 200
set.seed(96815)
x <- as.data.frame(round(matrix(rnorm(n*k), ncol=k), 2))
y <- factor(sample(0:1, n, replace=TRUE))

Let’s look at the first 5 outcome variables \(Y\) and the first 10 rows of the first 5 input \(X\) variables.

head(y, 5)

## [1] 0 0 1 1 1
## Levels: 0 1

x[1:10, 1:5]

##       V1    V2    V3    V4    V5
## 1  -0.16  1.39 -0.05 -1.99  1.41
## 2  -0.58  0.16  0.96  0.19  0.40
## 3   2.40 -0.19 -0.68 -0.87 -2.18
## 4   0.13 -0.73  0.21 -0.45 -0.07
## 5  -1.01 -1.19 -3.05  0.39  0.71
## 6   0.95 -0.02 -0.34 -1.12  0.78
## 7   1.38 -0.96 -0.73  0.58  1.77
## 8   0.45  0.51  1.18  1.24 -0.56
## 9  -0.09  1.18  2.26 -0.64 -0.82
## 10 -0.21  0.53  0.80 -1.36  0.37

Demonstration of the problem

Let’s write a function that takes a data frame of input variables \(X\), a vector of binary outcomes \(Y\), and, optionally, indexes, a vector giving the indexes of variables from \(X\) that we’ll test for association with the outcome \(Y\). This function first creates a data frame with the outcome variable \(Y\) and the subset of \(X\) variables, and for each \(X\) input variable, runs a single-variable logistic regression model, and extracts the p-value from the coefficient in the model, returning that vector of p-values. If we don’t specify indexes to use, the function will run logistic regression on each input variable one at a time, returning the p-value for the term in that model.

run_logreg_selected_vars <- function(x, y, indexes=NULL) {
  stopifnot(length(y)==nrow(x))
  if(is.null(indexes)) indexes <- seq_along(x)
  stopifnot(length(indexes)<=ncol(x))
  data.frame(y, x[,indexes]) %>% 
    as_tibble() %>% 
    gather(var, x, -y) %>% 
    group_by(var) %>% 
    nest() %>% 
    mutate(mod=map(data, ~glm(y~x, data=., family="binomial"))) %>% 
    mutate(p=map_dbl(mod, 
                     ~broom::tidy(.) %>% 
                       filter(term=="x") %>% 
                       pull(p.value))) %>% 
    pull(p)
}

If we pick any 10 random \(X\) variables to test for association to \(Y\), these p-values should be uniformly distributed. Here’s what it looks like running a few times.

set.seed(119)
run_logreg_selected_vars(x, y, sample(ncol(x), 10))

##  [1] 0.39205832 0.38996898 0.15392488 0.04449994 0.93335342 0.35264030
##  [7] 0.79398163 0.58648285 0.64119610 0.34082935

run_logreg_selected_vars(x, y, sample(ncol(x), 10))

##  [1] 0.78769502 0.06686609 0.12856530 0.70436042 0.75285248 0.11428466
##  [7] 0.53166297 0.24294458 0.34917105 0.55036297

run_logreg_selected_vars(x, y, sample(ncol(x), 10))

##  [1] 0.44040551 0.10968163 0.82595049 0.07983172 0.31414499 0.64119610
##  [7] 0.35464497 0.68595883 0.27659737 0.36864575

In fact, we could look at the overall distribution of p-values, which should be uniform.

hist(run_logreg_selected_vars(x, y), col=8)

plot(density(run_logreg_selected_vars(x, y)), xlim=c(0,1))

Now, let’s train a Random Forest model to predict \(Y\) given \(X\).

set.seed(119)
rf <- train(x, y, model="rf")

Remember, we’ve randomly generated the \(X\) and \(Y\) variables, so the null hypothesis is, by definition, true. There is no real predictive relationship between any variable and \(Y\). But, RF will overfit to a degree, and we can see here that the accuracy is over 50% (better than a coin flip), so the procedure has picked up on noise variables that by chance are associated with Y.

rf

## Random Forest 
## 
##  50 samples
## 200 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      
##     2   0.6662011   0.02244032
##   101   0.5651703  -0.09628348
##   200   0.5337067  -0.14418679
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Let’s look at the variable importance.

imp <- varImp(rf)
plot(imp, 10)

Let’s get the indexes of the top 10 most importat features as selected by RF, and run logistic regression on each of these variables.

top10 <- order(imp$importance$Overall, decreasing=TRUE)[1:10]
pvals <- run_logreg_selected_vars(x, y, indexes=top10)
pvals

##  [1] 0.02820190 0.01110958 0.59902716 0.13220405 0.50792903 0.82194001
##  [7] 0.11015087 0.05002388 0.11710376 0.03709156

We’re only looking at 10 logistic regression models, but they’re highly biased toward zero. In fact, you see 4 of the 10 at or below \(p=0.05\), and several around the \(p=0.1\) level, indicating a vastly inflated type I error rate.

plot(density(pvals), xlim=c(0,1))

Don’t Use Machine Learning for Feature Selection for Modeling in the Same Data

Stephen Turner

2018-11-09

The issue

Generate random data

Demonstration of the problem