Tree Simulation

The simulated data is as follows.

  1. y = 10 sin(?? x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e

from x1 to x5 subject to uniformly distributed on the interval [0,1]

  1. y = (x1^2 + (x2 x3 - (1/(x2 x4)))2){0.5} + e

subject to i) 0 ??? x1 ??? 100 ii) 40 ?? ??? x2 ??? 560 ?? iii) 0 ??? x3 ??? 1 iv) 1 ??? x4 ??? 11

  1. y = atan ((x2 x3 - (1/(x2 x4)))/x1) + e subject to i) 0 ??? x1 ??? 100 ii) 40 ?? ??? x2 ??? 560 ?? iii) 0 ??? x3 ??? 1 iv) 1 ??? x4 ??? 11

    In 200 observations from these simulated data, we would test 4 different tree-based models with 3 repeated 10 cross validation, then figure out how much the bias it is. For the combinience, I used the tune length method for each tree set.
set.seed(200)
sim.1 <- mlbench.friedman1(200, sd = 1)
sim.1 <- cbind(sim.1$x, sim.1$y)
sim.1 <- as.data.frame(sim.1)
colnames(sim.1)[ncol(sim.1)] <- "y"

The first simulation data plot with respect to y is as follows.

featurePlot(sim.1[,1:10], sim.1[,11])

set.seed(200)
sim.2 <- mlbench.friedman2(200, sd = 1)
sim.2 <- cbind(sim.2$x, sim.2$y)
sim.2 <- as.data.frame(sim.2)
colnames(sim.2)[ncol(sim.2)] <- "y"

The second simulation data plot with respect to y is as follows.

featurePlot(sim.2[,1:4], sim.2[,5])

The third simulation data plot with respect to y is as follows.

set.seed(200)
sim.3 <- mlbench.friedman3(200, sd = 1)
sim.3 <- cbind(sim.3$x, sim.3$y)
sim.3 <- as.data.frame(sim.3)
colnames(sim.3)[ncol(sim.3)] <- "y"
featurePlot(sim.3[,1:4], sim.3[,5])

Single Tree

In the single tree method, I use CART, Confidential Inference Tree. CART model penalizes the node to prevent the overfitting, and/or controling the maximum depth of the tree. I seperated the method to isolate the effect with each parameter.

ctrl <- trainControl(method = "repeatedcv", repeats = 3)
r.tune.cp.1 <- train(y ~.,data = sim.1, trControl = ctrl, method = "rpart", tuneLength = 10)
## Loading required package: rpart
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
r.tune.dpth.1  <- train(y ~.,data = sim.1, trControl = ctrl, method = "rpart2", tuneLength = 10)

r.tune.cp.2 <- train(y ~.,data = sim.2, trControl = ctrl, method = "rpart", tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
r.tune.dpth.2  <- train(y ~.,data = sim.2, trControl = ctrl, method = "rpart2", tuneLength = 10)

r.tune.cp.3 <- train(y ~.,data = sim.3, trControl = ctrl, method = "rpart", tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
r.tune.dpth.3  <- train(y ~.,data = sim.3, trControl = ctrl, method = "rpart2", tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

Conditional Inference Tree model is similar with CART, but using different model to prevent the overfitting. It uses p-value to judge whether the regression mean coming out of the nodes from the split is significantly different. If not, the split removed. Also as CART, the maximum tree value can be assigned. For the note, Conditional Inference Tree can be used in the random forest model, so called CForest.

ctrl <- trainControl(method = "repeatedcv", repeats = 3)

c.tune.dpth.1  <- train(y ~.,data = sim.1, trControl = ctrl, method = "ctree2", tuneLength = 10)
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
c.tune.dpth.2  <- train(y ~.,data = sim.2, trControl = ctrl, method = "ctree2", tuneLength = 10)

c.tune.dpth.3  <- train(y ~.,data = sim.3, trControl = ctrl, method = "ctree2", tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

Rule-based model prevents excess split on the same rule. Since tree tends to favor the continuous variable with more unique data, it is likely to split data excessvley so as to make the redundency, and add the unnecesarry complexity on the model.

ru.tune.cp.1 <- train(y ~.,data = sim.1, trControl = ctrl, method = "M5Rules", tuneLength = 10)
## Loading required package: RWeka
ru.tune.cp.2 <- train(y ~.,data = sim.2, trControl = ctrl, method = "M5Rules", tuneLength = 10)

ru.tune.cp.3 <- train(y ~.,data = sim.3, trControl = ctrl, method = "M5Rules", tuneLength = 10)

As you see in the plot, except for the rule-based model for second simulated data, smoothed model perfoms well, has lower bias compared to non smoothed model. Also overall pruened model performs well compared to unpruned one.

plot(ru.tune.cp.1)

plot(ru.tune.cp.2)

plot(ru.tune.cp.3)

For Rpart model, the model on first simulated model shows as complex penalty increases, and the maximum depth decreases, the bias increases. It is no suprise in that the penalty parameter penalizes the overfitting, and the maximum depth encourages the overfitting.

cp.1 <- r.tune.cp.1$results
dpth.1 <- r.tune.dpth.1$results
par(mfrow = c(1,2), mar = c(5.1,4.1,2,1))
plot(cp.1[,1], cp.1[,2], type = "b", main = "Complex penalty",
     xlab = "penalty", ylab = "RMSE")
plot(dpth.1[,1], dpth.1[,2], type = "b", main = "Maxim Depth",
     xlab = "Depth", ylab = "RMSE")

For the second simulated data, it shows the same pattern with the first one.

cp.2 <- r.tune.cp.2$results
dpth.2 <- r.tune.dpth.2$results
plot(cp.2[,1], cp.2[,2], type = "b", main = "Complex Penalty",
     xlab = "penalty", ylab = "RMSE")

plot(dpth.2[,1], dpth.2[,2], type = "b", main = "Maxim Depth",
     xlab = "Depth", ylab = "RMSE")

But this data pattern shows the opposite pattern. My guess is that all X with respects to y shows any relationship.

cp.3 <- r.tune.cp.3$results
dpth.3 <- r.tune.dpth.3$results
plot(cp.3[,1], cp.3[,2], type = "b", main = "Complex penalty",
     xlab = "penalty", ylab = "RMSE")

plot(dpth.3[,1], dpth.3[,2], type = "b", main = "Maxim Depth",
     xlab = "Depth", ylab = "RMSE")

For the Condtional Inference Tree, The deeper tree goes, the lower bias is. But for this simuatled data, the p-value shows at best marginal effect. Meanwhile, as the tree deeper, the higher p-value has an impact on the bias of the model. But the overall effect seems marginal compared to the maximum depth of the tree.

c.1 <- c.tune.dpth.1$results
num <- list(c(1:10), c(11:20),c(21:30),c(31:40),
            c(41:50), c(51:60), c(61:70),
            c(71:80), c(81:90), c(91:100))

num_1 <- rep(1,10)
num_1[1] <- paste("Depth",1, sep = ".")
par(mar=c(5.1, 4.1, 4.1, 8.1), mfrow = c(1,1), xpd=TRUE)

plot(c.1[num[[1]], 2], c.1[num[[1]], 3],type = 'l', xlab = "P-Value",ylab = "RMSE",ylim = c(3.4,4.7), main = "RMSE per Tree depth and P-Value")
for(i in 1:9) {
        lines(c.1[num[[i+1]], 2], c.1[num[[i+1]], 3], col = i+1, lty = i +1 )
        num_1[i+1] <- paste("Depth",i+1, sep = ".")
}
par(xpd=TRUE)
legend("topright", inset=c(-0.25,0),num_1, lty = c(1:10), col = c(1:10), cex = 0.5)

For the second simulated data, only does the maximum tree depth affect bias.

c.2 <- c.tune.dpth.2$results

plot(c.2[num[[1]], 2], c.2[num[[1]], 3],type = 'l', xlab = "P-Value",ylab = "RMSE",ylim = c(100,300), main = "RMSE per Tree depth and P-Value")
for(i in 1:9) {
        lines(c.2[num[[i+1]], 2], c.2[num[[i+1]], 3], col = i+1, lty = i +1 )
        num_1[i+1] <- paste("Depth",i+1, sep = ".")
}
par(xpd=TRUE)
legend("topright", inset=c(-0.25,0),num_1, lty = c(1:10), col = c(1:10), cex = 0.5)

It also the similar pattern with CART model on the third simulated data. Both Lower p-value and maximum depth of the trees, which are supposed to incur overfitting, then lower bias, encourages the underfitting.

c.3 <- c.tune.dpth.3$results
plot(c.3[num[[1]], 2], c.3[num[[1]], 3],type = 'l', xlab = "P-Value",ylab = "RMSE",ylim = c(1,1.4), main = "RMSE per Tree depth and P-Value")
for(i in 1:9) {
        lines(c.3[num[[i+1]], 2], c.3[num[[i+1]], 3], col = i+1, lty = i +1 )
        num_1[i+1] <- paste("Depth",i+1, sep = ".")
}
par(xpd=TRUE)
legend("topright", inset=c(-0.25,0),num_1, lty = c(1:10), col = c(1:10), cex = 0.5)