8.2 . Use a simulation to show tree bias with different granularities.
suppressMessages(suppressWarnings(library(partykit)))
sim2 <- as.data.frame(cbind(runif(100, 1, 3000), floor(runif(100, 1,100)), floor(runif(100, 25,75)), floor(runif(100,50,60))))
colnames(sim2) <- c("Y", "high", "middle", "low")
rpartTree <- rpart(Y~., sim2 )
#rpart.plot(r.rpartTree)
imp4 <-varImp(rpartTree)
imp4
## Overall
## high 0.2862183
## low 0.1846420
## middle 0.1668561
plot(as.party(rpartTree))

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
a. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
There can be two reasons:
1. Bagging fractions: lower bagging fractions could allow other variables to be modeled separately from the most explanatory variables, therefore giving them some possibly undue importance. The larger bagging fractions would more often have the most explanatory variables involved and enjoy perhaps too much importance.
2. Learning rate : A learning rate closer to 1 will make less corrections for each tree added to the model.
b. Which model do you think would be more predictive of other samples?
0.1 model would be more predictive, since the 0.9 model gives weight to the top 3-4 variables, while missing the 2nd most important variable on the 0.1 model (HydrophilicFactor).
c. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
For the 0.9 tree, we should make the slope less step, and for the 0.1 tree, more steep. Interaction depth specifies the tree depth and node splits. As the tree depth increase, and more node splits occur the variable importance becomes spread across more predictors. In both models the variable importance would decrease for the top variables and increase for less important variables. If we have any highly correlated variables we may actually see a swap of importance between the two variables.
8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
a. Which tree-based regression model gives the optimal resampling and test set performance?
suppressMessages(suppressWarnings(library(AppliedPredictiveModeling)))
suppressMessages(suppressWarnings(library(missForest)))
data(ChemicalManufacturingProcess)
df = ChemicalManufacturingProcess
df_imp1 = missForest(df)
## missForest iteration 1 in progress...
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## done!
## missForest iteration 2 in progress...
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## done!
## missForest iteration 3 in progress...
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?
## done!
df_imp = df_imp1$ximp
data = df_imp[,2:58]
target = df_imp[,1]
training = createDataPartition( target, p=0.75 )
predictor_training = data[training$Resample1,]
target_training = target[training$Resample]
predictor_testing = data[-training$Resample1,]
target_testing = target[-training$Resample1]
ctrl <- trainControl(method = "cv", number = 10)
Regression tree
First we will use Single Tree model and evaluate the results.
rt_grid <- expand.grid(maxdepth= seq(1,10,by=1))
rt_Tune <- train(x = predictor_training, y = target_training, method = "rpart2", metric = "Rsquared", tuneGrid = rt_grid, trControl = ctrl)
Predict
rt_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
## RMSE Rsquared MAE
## 1.7466632 0.2455436 1.3260212
Next we will use Random forest model to evaluate
rf_grid <- expand.grid(mtry=seq(2,38,by=3))
rf_Tune <- train(x = predictor_training, y = target_training, method = "rf", tuneGrid = rf_grid, metric = "Rsquared", importance = TRUE, trControl = ctrl)
Predict
rf_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
## RMSE Rsquared MAE
## 1.7466632 0.2455436 1.3260212
Finally Cubist model
cube_grid <- expand.grid(committees = c(1, 5, 10, 20, 50), neighbors = c(0, 1, 3, 5))
cube_Tune <- train(x = predictor_training, y = target_training, method = "cubist", metric = "Rsquared", tuneGrid = cube_grid, trControl = ctrl)
Predict
cube_pred = predict(cube_Tune, predictor_testing)
postResample(pred = cube_pred, obs = target_testing)
## RMSE Rsquared MAE
## 0.8906420 0.7768936 0.6831904
We can see that Cubist model has the best RMSE score.
b. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
predictor importance of cubist model
plot(varImp(cube_Tune), top=10, scales = list(y = list(cex = 0.8)))

Manufacturing process32 is top of the list followed by Manufacturingprocess13. The top 2 are process and then comes biological variable. Fpr top 10 mostly ist process variables. Cubist model heavily relys on top 2 predictors vs PLS same as MARS.
C. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Plot the OPtimal Sinlge tree
plot(as.party(rt_Tune$finalModel),gp=gpar(fontsize=11))

We can see the the top predictors are process variables. Manufacturing process32 is at the top only few Biological processes affect target.