Or do we have to include interaction terms as inputs to the decision tree model? This toy example shows that interaction terms are modeled appropriately by decision trees simply due to the way in which decision trees are built. We therefore do not need to (and in fact should not) include interaction terms in the decision tree model.
First, we build a model Y = aX + bW + cXW where W is 1 if X is between 1 and 2 but W is 0 for all other values of X.
x = seq(0,3,0.01)
w = (x>1 & x<2)*1
epsilon = rnorm(length(x),mean=0,sd=0.1)
y = 0.1*x + w + w*x + epsilon
We plot x and y to visualize the relationship
plot(x,y)
Let’s try regressing y on x without the interaction term, w
linear.model = lm(y~x)
If we plot the predicted vs. actual we see how poorly our model is performing…. Red line is perfect predictor
plot(predict(linear.model),y,xlab="Predicted",ylab="Actual")
abline(0,1,col='red')
If we look at the residuals they show us a different snapshot of our model’s poor performance
par(mfrow=c(2,2))
plot(linear.model)
Obviously, if we include the mediating variable in our model, the model performs much better
interaction.model = lm(y ~ x + w + w:x)
summary(interaction.model)
##
## Call:
## lm(formula = y ~ x + w + w:x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25713 -0.06688 -0.00511 0.06178 0.28360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.008352 0.012522 -0.667 0.505
## x 0.101320 0.006857 14.777 <2e-16 ***
## w 1.082382 0.055932 19.352 <2e-16 ***
## x:w 0.968520 0.036352 26.643 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1015 on 297 degrees of freedom
## Multiple R-squared: 0.9931, Adjusted R-squared: 0.993
## F-statistic: 1.419e+04 on 3 and 297 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(interaction.model)
par(mfrow=c(1,1))
plot(predict(interaction.model),y)
abline(0,1,col='red',xlab="Predict",ylab="Actual")
Now let’s try building a decision tree around our model. Does the decision tree capture the mediating variable effect?
library(tree)
tree.model2 = tree(y~x+w)
plot(tree.model2)
text(tree.model2,pretty=0)
Note that the decision tree captured the interaction effect of W with X. When W is less than 0.5 (which in this case means it’s equal to zero), we just predict that Y will be equal to the average value of X. When W is greater than 0.5 (i.e. when W is equal to 1) then our prediction of the outcome variable, Y, is adjusted/increased because we’re basically just doing regression on the segment 1 < X < 2. This is exactly the interaction that was baked into the model. Way to go, decision trees!
How well did we do on our predictions?
plot(predict(tree.model2),y,xlab="Predicted",ylab="Actual")
abline(0,1,col="red")
What if we don’t even include the mediating variable in the construction of our decision tree model? Can we still predict Y reasonably well?
tree.model = tree(y~x)
plot(tree.model)
text(tree.model,pretty=0)
Indeed, we see that even without including the mediating variable, W, in our model, we were able to predict Y with reasonable accuracy. This is because the construction of the decision tree with binary splits allows us to treat observations with 1 < X < 2 entirely differently from observations with X < 1 or X > 2.
plot(predict(tree.model),y,xlab="Predicted",ylab="Actual")
abline(0,1,col="red")