Here I use Partial Dependence Plots (PDP) to show how each feature influences the outcome. In this case using the Framingham Heart Study to predict time to hypertension.
First we build a black box model.
# Limit to folks that experienced hypertension
data <- framingham %>%
filter(outcome == 1) %>%
select(-outcome)
model <- gbm(time_outcome ~ .
,data = data
,n.trees = 200
,interaction.depth = 2
,shrinkage = 0.05
,cv.folds = 5
,distribution = 'gaussian'
)
optimal_number_of_trees <- gbm.perf(model, method = "cv", plot.it = FALSE)
summary(model, plotit = FALSE)
## var rel.inf
## BPVar BPVar 42.7283735
## bmi bmi 15.5901238
## age age 12.9454171
## totchol totchol 9.6166384
## glucose glucose 7.9428679
## heartrte heartrte 7.1630393
## cigpday cigpday 2.7325736
## female female 0.7676150
## cursmoke cursmoke 0.5133514
Not surprisingly, bloop pressure (BPVar) is the most influential feature. But what are the shapes of these relationships?
To find out, for each feature find the average prediction if that feature had a different value. For example, what is the average prediction if everyone had their blood pressure changed to normal (0 on this scale)?
probs = (1:99)/100
pdp <- ldply(model$var.names, function(variable) {
values <- quantile(data[[variable]], probs = probs)
ldply(values, function(value) {
imagined_data <- data
imagined_data[[variable]] <- value
data.frame(variable = variable,
value = value,
outcome=mean(predict(model, newdata = imagined_data, n.trees=optimal_number_of_trees)))
})
})
ggplot(pdp, aes(value, outcome)) +
geom_path() +
geom_rug(alpha=0.2) +
facet_wrap(~variable, scales="free_x") +
theme_minimal()
Now we get insight into what our black box model learned.
No surprise that as blood pressure rises time to hypertension goes down. But now we can see how. The decline starts when folks rise above 5 points below normal. This suggests ‘normal’ blood pressure might not be a good enough target to de-risk hypertension. Maybe we should encourage folks to strive for below average blood pressure?
We also see some odd trends. The impact of blood pressure drops suddenly near 0 and even goes up a bit near 10. This doesn’t make a lot of sense. Maybe the model is overfitting? Maybe there’s an issue with the data?
With this visualization we learn where we have room to explore.