Exercises from Chapter 7 of textbook Applied Predictive Modeling by Kuhn & Johnson

8.

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?

From the output below it appears V6-V10 had minimal important in the model.

model1 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

(b1) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

(b2) Fit another random forest model to these data. Did the importance score for V1 change?

Yes, the importance of V1 decreased in this second model, though not quite proportional the the amount V11 (the duplicate) created.

model2 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp1 <- varImp(model2, scale = FALSE)

rfImp1
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

(b3) What happens when you add another predictor that is also highly correlated with V1?

V1 drops again, but not by as much this time.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp1 <- varImp(model3, scale = FALSE)

rfImp1
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

They do appear to have the same pattern, with variables V6-V10 not ranking high.

model4 <- cforest(y ~ ., data = simulated)
varimp(model4)
##          V1          V2          V3          V4          V5          V6 
##  5.42152778  5.79075433 -0.01425384  5.96294311  1.87652947 -0.06822257 
##          V7          V8          V9         V10  duplicate1  duplicate2 
##  0.07696942 -0.17067028  0.02155362 -0.13376134  5.27935126  2.92224516

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

boost_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(boost_model)

##                   var    rel.inf
## V4                 V4 30.8498158
## V2                 V2 20.9639117
## V1                 V1 16.1293953
## duplicate1 duplicate1 11.5441703
## V5                 V5 11.0492470
## V3                 V3  8.5487055
## duplicate2 duplicate2  0.7137245
## V6                 V6  0.2010299
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000
cub_sim <- cubist(simulated[,-11], simulated$y)
cub_sim$usage
##    Conditions Model   Variable
## 1           0   100         V1
## 2           0   100         V2
## 3           0   100         V4
## 4           0   100         V5
## 5           0   100 duplicate1
## 6           0     0         V3
## 7           0     0         V6
## 8           0     0         V7
## 9           0     0         V8
## 10          0     0         V9
## 11          0     0        V10
## 12          0     0 duplicate2

The additionally models above also successfully avoided incorrectly using the unimportant variables to a high degree.

8.2

Use a simulation to show tree bias with different granularities.

Below we see that more importance is placed on x over x2 simply because that variable has less variance.

x <- rep(1:2, each=100) 
x2 <- rnorm(200, mean=0, sd=4)
y <- x + rnorm(200, mean= 0 , sd=1)
data <- data.frame(y,x,x2)

tree_simulation <- rpart(y~., data=data)
varImp(tree_simulation)
##      Overall
## x  0.2157706
## x2 0.6133085

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Figure

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

I believe it’s accurate to say that the right-hand plot is essentially more stringent, due to the tuning parameters, in it’s threshhold for considering a predictor as important - it’s considered a greedy or a rapid leaner. Alternatively, the left-hand plot could be characterizes as more methodical, it doesn’t need to learn fast so it spends more time finding relationships in the data.

(b) Which model do you think would be more predictive of other samples?

While these are both extreme settings, I think the left-hand model would be better at predicting new data, as it’s incorporated more diverse data which should limit the overfitting issues the model on the right likely would have.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

As the interaction depth is increased, we would see more predictors with importance, and more predictors with greater importance. As the maximum number of nodes per tree increases, it leaves more room to assign importance.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

First I copy in the code from my previous work.

data(ChemicalManufacturingProcess)
chem <- as.data.frame(ChemicalManufacturingProcess)

knn_imp <- preProcess(chem, "knnImpute")
chem_imp <- predict(knn_imp, chem)

#split train/test
set.seed(3190)
sample_set <- sample(nrow(chem_imp), round(nrow(chem_imp)*0.75), replace = FALSE)

chem_train <-chem_imp[sample_set, ]
chem_train_x <- chem_train[, -1]
chem_train_y <- chem_train[, 1]


chem_test <-chem_imp[-sample_set, ]
chem_test_x <- chem_test[, -1]
chem_test_y <- chem_test[, 1]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

This model doesn’t look spectacular, it explain 57% of the variance in the data with a somewhat high RMSE of 0.684.

randf_model <- randomForest(chem_train_x, chem_train_y, importance = TRUE, ntrees = 1000)
randf_model
## 
## Call:
##  randomForest(x = chem_train_x, y = chem_train_y, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##           Mean of squared residuals: 0.3666167
##                     % Var explained: 63.09
randf_pred <- predict(randf_model, newdata = chem_test_x)
postResample(pred = randf_pred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 0.6839209 0.5676939 0.4982828

Boosted

This boosted model appear to be worse than the random forest above, it has a lower R-squared and even higher RMSE/MAE.

boosted_model <- gbm.fit(chem_train_x, chem_train_y, distribution = "gaussian") 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9925             nan     0.0010    0.0008
##      2        0.9918             nan     0.0010    0.0007
##      3        0.9910             nan     0.0010    0.0008
##      4        0.9902             nan     0.0010    0.0008
##      5        0.9895             nan     0.0010    0.0007
##      6        0.9889             nan     0.0010    0.0002
##      7        0.9881             nan     0.0010    0.0008
##      8        0.9874             nan     0.0010    0.0008
##      9        0.9867             nan     0.0010    0.0005
##     10        0.9858             nan     0.0010    0.0008
##     20        0.9783             nan     0.0010    0.0006
##     40        0.9626             nan     0.0010    0.0008
##     60        0.9485             nan     0.0010    0.0008
##     80        0.9343             nan     0.0010    0.0007
##    100        0.9212             nan     0.0010    0.0006
boosted_model
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 57 predictors of which 7 had non-zero influence.
boosted_pred <- predict(boosted_model, newdata = chem_test_x)
postResample(pred = boosted_pred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 0.9679884 0.4364299 0.7602427

Cubist

The cubist model is comparable to the first model, the Random Forest - however the Random Forest does technically outperform the cubist. Further tuning may find slight advantages, but at it’s base/default level it performs best on this data.

cub_model <- cubist(chem_train_x, chem_train_y)
cub_model
## 
## Call:
## cubist.default(x = chem_train_x, y = chem_train_y)
## 
## Number of samples: 132 
## Number of predictors: 57 
## 
## Number of committees: 1 
## Number of rules: 1
cub_pred <- predict(cub_model, newdata = chem_test_x)
postResample(pred = cub_pred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 0.6926415 0.5227459 0.5258773

(b) - Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

For the Random Forest Model we see the top 10 important variables below. Again we see over half are manufacturing measures, although there is one more biological measure in this model compared to my last two, and they appear a bit higher in the list.

randf_imp <- varImp(randf_model)


head(rownames(randf_imp)[order(randf_imp$Overall, decreasing=TRUE)], 10)
##  [1] "ManufacturingProcess32" "BiologicalMaterial12"   "ManufacturingProcess31"
##  [4] "ManufacturingProcess36" "ManufacturingProcess13" "BiologicalMaterial06"  
##  [7] "ManufacturingProcess17" "ManufacturingProcess39" "BiologicalMaterial03"  
## [10] "BiologicalMaterial02"

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

With this visualization I see that the biological predictors only are useful when the first node, ManufacturingProcess32, is less than 0.192. This is where we see the 4 biological predictors come into play.

rpart_tree <- rpart(Yield ~., data = chem_train)

tree_plot <- as.party(rpart_tree)
plot(tree_plot)

Further, a quick look at how ManufacturingProcess32 is distributed shows that the nodes split point of 0.192 is fairly close to half of the data - meaning for half of the data no biological predictors are even used.

summary(chem_train$ManufacturingProcess32)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.86552 -0.64216 -0.08632  0.02456  0.65480  2.69287
boxplot(chem_train$ManufacturingProcess32)