Partial Dependence Plots to Show Feature Influence

Here I use Partial Dependence Plots (PDP) to show how each feature influences the outcome. In this case using the Framingham Heart Study to predict time to hypertension.

First we build a black box model.

# Limit to folks that experienced hypertension
data <- framingham %>%
  filter(outcome == 1) %>%
  select(-outcome)

model <- gbm(time_outcome ~ .
             ,data = data
             ,n.trees = 200
             ,interaction.depth = 2
             ,shrinkage = 0.05
             ,cv.folds = 5
             ,distribution = 'gaussian'
             )
optimal_number_of_trees <- gbm.perf(model, method = "cv", plot.it = FALSE)
summary(model, plotit = FALSE)

##               var    rel.inf
## BPVar       BPVar 42.7283735
## bmi           bmi 15.5901238
## age           age 12.9454171
## totchol   totchol  9.6166384
## glucose   glucose  7.9428679
## heartrte heartrte  7.1630393
## cigpday   cigpday  2.7325736
## female     female  0.7676150
## cursmoke cursmoke  0.5133514

Not surprisingly, bloop pressure (BPVar) is the most influential feature. But what are the shapes of these relationships?

To find out, for each feature find the average prediction if that feature had a different value. For example, what is the average prediction if everyone had their blood pressure changed to normal (0 on this scale)?

probs = (1:99)/100
pdp <- ldply(model$var.names, function(variable) {
  values <- quantile(data[[variable]], probs = probs)
  ldply(values, function(value) {
    imagined_data <- data
    imagined_data[[variable]] <- value
    data.frame(variable = variable, 
               value = value,
               outcome=mean(predict(model, newdata = imagined_data, n.trees=optimal_number_of_trees)))
  })
})

ggplot(pdp, aes(value, outcome)) +
  geom_path() +
  geom_rug(alpha=0.2) +
  facet_wrap(~variable, scales="free_x") +
  theme_minimal()

Now we get insight into what our black box model learned.

No surprise that as blood pressure rises time to hypertension goes down. But now we can see how. The decline starts when folks rise above 5 points below normal. This suggests ‘normal’ blood pressure might not be a good enough target to de-risk hypertension. Maybe we should encourage folks to strive for below average blood pressure?

We also see some odd trends. The impact of blood pressure drops suddenly near 0 and even goes up a bit near 10. This doesn’t make a lot of sense. Maybe the model is overfitting? Maybe there’s an issue with the data?

With this visualization we learn where we have room to explore.