Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
library(dplyr)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

part a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
set.seed(1)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 <- rfImp1 %>%
  mutate(Variable = rownames(rfImp1)) %>%
  rename(Importance_before = Overall)
rfImp1 <- rfImp1[order(-rfImp1$Importance_before),]
rfImp1
##     Importance_before Variable
## V1         8.64598308       V1
## V4         7.62987929       V4
## V2         6.81423882       V2
## V5         2.17385901       V5
## V3         0.72598030       V3
## V6         0.10559967       V6
## V7         0.06427442       V7
## V10       -0.02447215      V10
## V9        -0.08227548       V9
## V8        -0.10071802       V8

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The informative predictors (V1 – V5) has higher importance values, while the uninformative predictors (V6 – V10) have importance values near zero. The random forest model did not significantly use V6 – V10 predictors in the model.

part b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(2)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9385209

Fit another random forest model to these data. Did the importance score for V1 change?

Before adding duplicated1, V1 was the strongest predictor in the random forest model. However, after adding duplicated1, - V1 ‘s importance values dropped from about 8.6 to 5.3 - V4 becomes the strongest predictor - all other predictors’ importance values remain similar but decreased slightly - duplicated1 gain a importance value of about 4.4

V1 and duplicated1 are highly correlated so they carry almost the same information. Random forest split their predicting power.

set.seed(3)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2 <- rfImp2 %>%
  mutate(Variable = rownames(rfImp2)) %>%
  rename(Importance_after = Overall)

rfImp_combined <- merge(rfImp1, rfImp2, by = "Variable", all= TRUE) 
rfImp_combined <- rfImp_combined[order(-rfImp_combined$Importance_after),]
rfImp_combined
##      Variable Importance_before Importance_after
## 6          V4        7.62987929       6.37409556
## 4          V2        6.81423882       6.22814840
## 2          V1        8.64598308       5.34102073
## 1  duplicate1                NA       4.37373219
## 7          V5        2.17385901       1.92347519
## 5          V3        0.72598030       0.52023473
## 8          V6        0.10559967       0.23659142
## 3         V10       -0.02447215       0.11989685
## 9          V7        0.06427442       0.01111648
## 11         V9       -0.08227548      -0.01672550
## 10         V8       -0.10071802      -0.11847288

What happens when you add another predictor that is also highly correlated with V1?

Add another predictor that is highly correlated with one of the informative predictors.

set.seed(4)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9473968

Before adding duplicated1 and duplicated2, V1 was the strongest predictor (importance value = 8.6) in the random forest model.

After adding duplicated2, - V4 remains to be the strongest predictor - all other predictors’ importance values remain similar - V1’s importance value dropped further 8.6 -> 5.3 -> 4.3 - duplicated1’s importance value dropped from 4.4 to 3.8 - duplicated2 gain a importance value of about 2.0

Adding V1, duplicated1, and duplicated2’s importance value (4.3 + 3.8 + 2.0 = 10.1) is similar to V1’s original importance value.

set.seed(4)
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

rfImp3 <- rfImp3 %>%
  mutate(Variable = rownames(rfImp3)) %>%
  rename(Importance_after2 = Overall)

rfImp_combined2 <- merge(rfImp_combined, rfImp3, by = "Variable", all= TRUE) 
rfImp_combined2 <- rfImp_combined2[order(-rfImp_combined2$Importance_after2),]
rfImp_combined2
##      Variable Importance_before Importance_after Importance_after2
## 7          V4        7.62987929       6.37409556       6.929337982
## 5          V2        6.81423882       6.22814840       6.709044774
## 3          V1        8.64598308       5.34102073       4.281019480
## 1  duplicate1                NA       4.37373219       3.813202454
## 8          V5        2.17385901       1.92347519       2.183434956
## 2  duplicate2                NA               NA       1.978928422
## 6          V3        0.72598030       0.52023473       0.437590286
## 9          V6        0.10559967       0.23659142       0.144365953
## 4         V10       -0.02447215       0.11989685       0.018855843
## 12         V9       -0.08227548      -0.01672550       0.003648182
## 10         V7        0.06427442       0.01111648      -0.031073686
## 11         V8       -0.10071802      -0.11847288      -0.099575772

part c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Traditional importance behaves similarly to random forest model. The importance of V1 is split with V1, duplicated1, and duplicated2. The uninformative predictors (V6 – V10) continue to have near zero importance values. The importance of V3 dropped.

library(party)
set.seed(4)
model4 <- cforest(y ~ ., data = simulated)
rfImp4 <- varImp(model4, conditional = FALSE)

rfImp4 %>%
  arrange(desc(Overall))
##                 Overall
## V4          6.583309522
## V2          5.881114391
## V1          4.053541290
## duplicate1  3.798439407
## V5          1.817428374
## duplicate2  1.757678596
## V7          0.056120073
## V3          0.040002274
## V6          0.006464814
## V9         -0.002232059
## V10        -0.034487197
## V8         -0.043319070

Conditional importance adjusts for correlation.

Adding V1, duplicated1, and duplicated2’s importance value (1.6 + 1.3 + 0.5 = 3.4) is no longer similar to V1’s original importance value (8.6).

rfImp5 <- varImp(model4, conditional = TRUE)
rfImp5 %>%
  arrange(desc(Overall))
##                 Overall
## V4          5.208699787
## V2          4.583794176
## V1          1.547152586
## duplicate1  1.229387806
## V5          1.167797154
## duplicate2  0.585745567
## V7          0.030583956
## V9          0.014444673
## V3          0.010708149
## V6          0.009149605
## V8         -0.028654096
## V10        -0.030472259

part d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
set.seed(5)
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 29.2071475
## V2                 V2 21.6898002
## duplicate1 duplicate1 14.9937496
## V1                 V1 13.2567960
## V5                 V5 11.1785877
## V3                 V3  8.6389593
## duplicate2 duplicate2  0.5023758
## V6                 V6  0.3949306
## V7                 V7  0.1376532
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000
library(Cubist)
set.seed(6)

predictors <- simulated[,setdiff(names(simulated),"y")]

cubistModel <- cubist(x = predictors, y = simulated$y)
varImp(cubistModel)
##            Overall
## duplicate1    79.0
## V2            74.0
## V4            50.0
## V5            50.0
## V1            45.0
## V6            23.5
## V3             0.0
## V7             0.0
## V8             0.0
## V9             0.0
## V10            0.0
## duplicate2     0.0

Exercise 8.2

Use a simulation to show tree bias with different granularities.

Continuous Variables

set.seed(10)

x <- runif(2000, min = 0, max= 1)
y <- sin(x) + rnorm(length(x)) * .25

x_noisy <- x + rnorm(length(x)) *0.1

data <- data.frame(
  x_fine = x,
  x_medium = round(x,1), 
  x_coarse = round(x, 0),
  x_noisy = x_noisy,
  y= y
)
library(rpart)
fit_fine <- rpart(y ~ x_fine + x_noisy, data = data)
fit_med <- rpart(y ~ x_medium+ x_noisy, data = data)
fit_coarse <- rpart(y ~ x_coarse+ x_noisy, data = data)

plot(fit_fine, uniform = TRUE)
text(fit_fine, use.n = TRUE)

plot(fit_med, uniform = TRUE)
text(fit_med, use.n = TRUE)

plot(fit_coarse, uniform = TRUE)
text(fit_coarse, use.n = TRUE)

We see more splits for fine granularity and less splits for fine granularity. As the granularity becomes coarse, it overfits on noisy data.

Include Binary Variable

set.seed(11)

x <- runif(2000, min = 0, max= 1)
y <- sin(x) + rnorm(length(x)) * .25

x_noisy <- x + rnorm(length(x)) *0.1
x_binary <- rbinom(length(x),1, 0.5)

data <- data.frame(
  x_fine = x,
  x_medium = round(x,1), 
  x_coarse = round(x, 0),
  x_noisy = x_noisy,
  x_binary= x_binary,
  y= y
)
library(rpart)
fit_fine <- rpart(y ~ x_fine + x_noisy + x_binary, data = data)
fit_med <- rpart(y ~ x_medium+ x_noisy + x_binary, data = data)
fit_coarse <- rpart(y ~ x_coarse+ x_noisy + x_binary, data = data)

plot(fit_fine, uniform = TRUE)
text(fit_fine, use.n = TRUE)

fit_fine$variable.importance
##      x_fine     x_noisy    x_binary 
## 109.8058141  90.2847522   0.4228861
plot(fit_med, uniform = TRUE)
text(fit_med, use.n = TRUE)

fit_med$variable.importance
## x_medium  x_noisy 
## 108.0585  88.1173
plot(fit_coarse, uniform = TRUE)
text(fit_coarse, use.n = TRUE)

fit_coarse$variable.importance
##  x_coarse   x_noisy  x_binary 
## 88.891683 88.314359  2.768394

Decision trees favor high cardinality variables (x_fine). Evne though a binary variable was included, it barely influence splits.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

part a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The left model with its low parameters has high regularization and randomness. Low learning rate allows more trees to be built, which allows for more robust model. Low bagging fraction allows each tree to be trained on a small subset of the total data, whihc introduces randomness in the model. It allows the model to use a wider range of predictors. The right model relies on only the few strongest and most dominant predictors.

part b

Which model do you think would be more predictive of other samples?

The left model would likely to be more more predictive of other samples. Its strong regularization prevents overfitting of the training dataset.

part c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth aloows the trees to capture more complex interactions. It will make the slope of predictor importance steeper for both models. But it will affect mostly the right model. The top predictor will become even more dominant due to the high learning rate.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(caret)
data(ChemicalManufacturingProcess)
# Refering text p. 54
trans <- preProcess(ChemicalManufacturingProcess[,-1], 
                     method = "knnImpute" )

transformed <- predict(trans, ChemicalManufacturingProcess[,-1])

sum(is.na(transformed))
## [1] 0
chem_filtered <- transformed[, -nearZeroVar(transformed)]

set.seed(6)
trainingRows <- createDataPartition(ChemicalManufacturingProcess$Yield,
                                    p = .70,
                                    list= FALSE)

TrainX <- chem_filtered[trainingRows, ]
TrainY <- ChemicalManufacturingProcess$Yield[trainingRows]

TestX <- chem_filtered[-trainingRows, ]
TestY <- ChemicalManufacturingProcess$Yield[-trainingRows]

Weighted KNN

# Provide by Text
set.seed(111)
weightedknnModel2 <- train(x = TrainX,
                  y = TrainY,
                  method = "kknn",
                  preProc = c("center", "scale", "pca"),
                  tuneGrid = expand.grid(kmax = seq(5,25,5), 
                                         distance = 2,
                                         kernel = c("rectangular", "triangular","epanechnikov")))
weightedknnModel2
## k-Nearest Neighbors 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), principal component
##  signal extraction (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   kmax  kernel        RMSE      Rsquared   MAE     
##    5    rectangular   1.699732  0.2860582  1.306812
##    5    triangular    1.594407  0.3072062  1.230722
##    5    epanechnikov  1.618735  0.2967283  1.242189
##   10    rectangular   1.699732  0.2860582  1.306812
##   10    triangular    1.553165  0.3255548  1.214788
##   10    epanechnikov  1.599011  0.3028114  1.237059
##   15    rectangular   1.699732  0.2860582  1.306812
##   15    triangular    1.549762  0.3273930  1.213288
##   15    epanechnikov  1.599011  0.3028114  1.237059
##   20    rectangular   1.699732  0.2860582  1.306812
##   20    triangular    1.549762  0.3273930  1.213288
##   20    epanechnikov  1.599011  0.3028114  1.237059
##   25    rectangular   1.699732  0.2860582  1.306812
##   25    triangular    1.549762  0.3273930  1.213288
##   25    epanechnikov  1.599011  0.3028114  1.237059
## 
## Tuning parameter 'distance' was held constant at a value of 2
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 25, distance = 2 and kernel
##  = triangular.
weightedknnPred2 <- predict(weightedknnModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = weightedknnPred2, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.4676074 0.4733189 1.1051550

SVM

set.seed(113)
#refer to text p. 166
svmModel2 <- train(x = TrainX,
                  y = TrainY,
                    method = "svmRadial",
                    preProc = c("center", "scale", "pca"),
                    tuneLength = 14,
                    trControl = trainControl(method = "cv"))
svmModel2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), principal component
##  signal extraction (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 110, 112, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.450085  0.5568588  1.1997266
##      0.50  1.332047  0.5647384  1.0980277
##      1.00  1.238790  0.6006898  1.0179477
##      2.00  1.200795  0.6201847  0.9773347
##      4.00  1.205578  0.6058129  0.9752402
##      8.00  1.222136  0.5942627  0.9882264
##     16.00  1.221243  0.5954775  0.9870167
##     32.00  1.221243  0.5954775  0.9870167
##     64.00  1.221243  0.5954775  0.9870167
##    128.00  1.221243  0.5954775  0.9870167
##    256.00  1.221243  0.5954775  0.9870167
##    512.00  1.221243  0.5954775  0.9870167
##   1024.00  1.221243  0.5954775  0.9870167
##   2048.00  1.221243  0.5954775  0.9870167
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02748059
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02748059 and C = 2.
svmPred2 <- predict(svmModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = svmPred2, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.2695080 0.6191610 0.9728057

MARS

set.seed(114)
#refer to text p. 165
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsModel2 <- train(x = TrainX,
                  y = TrainY,
                  method = "earth",
                  tuneGrid = marsGrid,
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv"))
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsModel2
## Multivariate Adaptive Regression Spline 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 111, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.420245  0.4161646  1.1223649
##   1        3      1.192370  0.6063616  0.9763954
##   1        4      1.171674  0.6163737  0.9433324
##   1        5      1.169215  0.6034139  0.9388695
##   1        6      1.223304  0.5669649  0.9703588
##   1        7      1.185700  0.6027056  0.9510157
##   1        8      1.180833  0.5932560  0.9827166
##   1        9      1.180792  0.5976978  0.9827534
##   1       10      1.185384  0.6048656  0.9685761
##   1       11      1.184590  0.6095915  0.9674617
##   1       12      1.373554  0.5963351  1.0829173
##   1       13      1.395727  0.5843234  1.0984551
##   1       14      1.404500  0.5844436  1.1004205
##   1       15      1.395147  0.5864584  1.0949697
##   1       16      1.405719  0.5837766  1.1066869
##   1       17      1.392827  0.5902839  1.0942656
##   1       18      1.392827  0.5902839  1.0942656
##   1       19      1.392827  0.5902839  1.0942656
##   1       20      1.415115  0.5837947  1.1080326
##   1       21      1.421045  0.5836208  1.1194441
##   1       22      1.421045  0.5836208  1.1194441
##   1       23      1.421045  0.5836208  1.1194441
##   1       24      1.421045  0.5836208  1.1194441
##   1       25      1.421045  0.5836208  1.1194441
##   1       26      1.421045  0.5836208  1.1194441
##   1       27      1.421045  0.5836208  1.1194441
##   1       28      1.421045  0.5836208  1.1194441
##   1       29      1.421045  0.5836208  1.1194441
##   1       30      1.421045  0.5836208  1.1194441
##   1       31      1.421045  0.5836208  1.1194441
##   1       32      1.421045  0.5836208  1.1194441
##   1       33      1.421045  0.5836208  1.1194441
##   1       34      1.421045  0.5836208  1.1194441
##   1       35      1.421045  0.5836208  1.1194441
##   1       36      1.421045  0.5836208  1.1194441
##   1       37      1.421045  0.5836208  1.1194441
##   1       38      1.421045  0.5836208  1.1194441
##   2        2      1.420245  0.4161646  1.1223649
##   2        3      1.184745  0.6076965  0.9602160
##   2        4      1.209515  0.5854936  0.9681314
##   2        5      1.254889  0.5406905  0.9757740
##   2        6      1.183539  0.5938054  0.9470884
##   2        7      1.156842  0.6058061  0.9096095
##   2        8      1.159008  0.6048214  0.9155860
##   2        9      1.158606  0.6017263  0.9002778
##   2       10      1.212862  0.5807610  0.9210734
##   2       11      1.398343  0.5475084  0.9810743
##   2       12      1.489289  0.5383656  1.0335915
##   2       13      1.510972  0.5529365  1.0257113
##   2       14      1.533395  0.5414225  1.0553174
##   2       15      1.532111  0.5254442  1.0506624
##   2       16      1.546251  0.5066151  1.0598119
##   2       17      1.516852  0.5173416  1.0311456
##   2       18      1.578750  0.5075698  1.0671888
##   2       19      1.574408  0.5085381  1.0617242
##   2       20      1.571305  0.5131123  1.0582440
##   2       21      1.583670  0.5129312  1.0444304
##   2       22      1.576575  0.5135234  1.0351308
##   2       23      1.585898  0.5104487  1.0460993
##   2       24      1.585898  0.5104487  1.0460993
##   2       25      1.585898  0.5104487  1.0460993
##   2       26      1.585898  0.5104487  1.0460993
##   2       27      1.585898  0.5104487  1.0460993
##   2       28      1.582272  0.5062776  1.0514079
##   2       29      1.594988  0.5027155  1.0586435
##   2       30      1.594988  0.5027155  1.0586435
##   2       31      1.594988  0.5027155  1.0586435
##   2       32      1.594988  0.5027155  1.0586435
##   2       33      1.594988  0.5027155  1.0586435
##   2       34      1.594988  0.5027155  1.0586435
##   2       35      1.594988  0.5027155  1.0586435
##   2       36      1.594988  0.5027155  1.0586435
##   2       37      1.594988  0.5027155  1.0586435
##   2       38      1.594988  0.5027155  1.0586435
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 2.
marsPred2 <- predict(marsModel2, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred2, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.2278715 0.6159963 0.9913622

Random Forest

set.seed(115)
#refer to text p. 165
rfGrid <- expand.grid(.mtry = c(15, 18, 21, 24, 27))

rfModel <- train(x = TrainX,
                  y = TrainY,
                  method = "rf",
                  tuneGrid = rfGrid,
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv"),
                 ntree = 1000)
rfModel
## Random Forest 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 112, 111, 111, 112, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##   15    1.098227  0.6851523  0.8789189
##   18    1.092095  0.6878947  0.8735003
##   21    1.096146  0.6814199  0.8793522
##   24    1.097314  0.6791979  0.8761327
##   27    1.095722  0.6798135  0.8699768
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 18.
rfPred <- predict(rfModel, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = rfPred, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.3841889 0.5202457 1.1082526

GBM

set.seed(117)
#refer to text p. 165
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
                       .n.trees = seq(100, 1000, by = 50),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode = 10)

gbmModel <- train(x = TrainX,
                  y = TrainY,
                  method = "gbm",
                  tuneGrid = gbmGrid,
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv"),
                  verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 112, 112, 111, 112, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   0.01       1                   100     1.423289  0.5413911  1.1667342
##   0.01       1                   150     1.331318  0.5617667  1.0901688
##   0.01       1                   200     1.279718  0.5651657  1.0448565
##   0.01       1                   250     1.247326  0.5695968  1.0101425
##   0.01       1                   300     1.224454  0.5764725  0.9826139
##   0.01       1                   350     1.203294  0.5875462  0.9569889
##   0.01       1                   400     1.188215  0.5943973  0.9386811
##   0.01       1                   450     1.175066  0.5997744  0.9227480
##   0.01       1                   500     1.164687  0.6052734  0.9091128
##   0.01       1                   550     1.158112  0.6083235  0.9018204
##   0.01       1                   600     1.152841  0.6105767  0.8955818
##   0.01       1                   650     1.147077  0.6148857  0.8890254
##   0.01       1                   700     1.141007  0.6187003  0.8819512
##   0.01       1                   750     1.135141  0.6224087  0.8770528
##   0.01       1                   800     1.133440  0.6228413  0.8734346
##   0.01       1                   850     1.132026  0.6247604  0.8697171
##   0.01       1                   900     1.127433  0.6270981  0.8653934
##   0.01       1                   950     1.125513  0.6284345  0.8643677
##   0.01       1                  1000     1.125789  0.6279463  0.8627760
##   0.01       3                   100     1.343599  0.5551676  1.0982510
##   0.01       3                   150     1.267234  0.5725646  1.0266337
##   0.01       3                   200     1.222567  0.5889824  0.9757780
##   0.01       3                   250     1.195491  0.5992512  0.9460645
##   0.01       3                   300     1.176967  0.6101907  0.9214132
##   0.01       3                   350     1.167288  0.6145875  0.9077750
##   0.01       3                   400     1.153302  0.6222767  0.8946053
##   0.01       3                   450     1.149045  0.6253844  0.8887948
##   0.01       3                   500     1.141792  0.6295808  0.8803786
##   0.01       3                   550     1.136828  0.6322995  0.8767161
##   0.01       3                   600     1.129970  0.6357851  0.8725452
##   0.01       3                   650     1.128943  0.6363298  0.8716499
##   0.01       3                   700     1.126405  0.6383188  0.8707491
##   0.01       3                   750     1.123365  0.6404017  0.8692980
##   0.01       3                   800     1.120793  0.6420349  0.8675805
##   0.01       3                   850     1.122359  0.6407888  0.8691054
##   0.01       3                   900     1.121090  0.6412225  0.8705442
##   0.01       3                   950     1.118007  0.6433928  0.8679228
##   0.01       3                  1000     1.117042  0.6429260  0.8686071
##   0.01       5                   100     1.348762  0.5517591  1.1029438
##   0.01       5                   150     1.268411  0.5661896  1.0199601
##   0.01       5                   200     1.226719  0.5766824  0.9720797
##   0.01       5                   250     1.201941  0.5882316  0.9401397
##   0.01       5                   300     1.185710  0.5967537  0.9220089
##   0.01       5                   350     1.167212  0.6072465  0.9028121
##   0.01       5                   400     1.157383  0.6121039  0.8933403
##   0.01       5                   450     1.148480  0.6170459  0.8823184
##   0.01       5                   500     1.143795  0.6199530  0.8745758
##   0.01       5                   550     1.140687  0.6212369  0.8727268
##   0.01       5                   600     1.135444  0.6241541  0.8682690
##   0.01       5                   650     1.136242  0.6234792  0.8694747
##   0.01       5                   700     1.137893  0.6220183  0.8709401
##   0.01       5                   750     1.135129  0.6235764  0.8682022
##   0.01       5                   800     1.134210  0.6247533  0.8663565
##   0.01       5                   850     1.131808  0.6258023  0.8643168
##   0.01       5                   900     1.129189  0.6270516  0.8616205
##   0.01       5                   950     1.127407  0.6279617  0.8606197
##   0.01       5                  1000     1.127206  0.6283794  0.8599961
##   0.01       7                   100     1.326611  0.5786090  1.0847976
##   0.01       7                   150     1.244848  0.5911154  1.0035258
##   0.01       7                   200     1.199004  0.6042228  0.9555810
##   0.01       7                   250     1.170475  0.6143877  0.9223474
##   0.01       7                   300     1.157853  0.6198942  0.9107050
##   0.01       7                   350     1.146733  0.6251460  0.8983793
##   0.01       7                   400     1.132562  0.6324873  0.8847830
##   0.01       7                   450     1.125064  0.6367261  0.8763655
##   0.01       7                   500     1.119617  0.6392291  0.8707603
##   0.01       7                   550     1.117285  0.6411367  0.8671246
##   0.01       7                   600     1.113380  0.6428982  0.8629368
##   0.01       7                   650     1.109695  0.6451248  0.8599001
##   0.01       7                   700     1.104532  0.6472997  0.8560713
##   0.01       7                   750     1.104302  0.6470565  0.8552709
##   0.01       7                   800     1.101732  0.6482944  0.8536485
##   0.01       7                   850     1.100361  0.6497801  0.8530984
##   0.01       7                   900     1.097763  0.6511390  0.8517636
##   0.01       7                   950     1.097189  0.6516431  0.8522323
##   0.01       7                  1000     1.094306  0.6528367  0.8497895
##   0.10       1                   100     1.139782  0.6041604  0.8775208
##   0.10       1                   150     1.132027  0.6142042  0.8691031
##   0.10       1                   200     1.116140  0.6282207  0.8779192
##   0.10       1                   250     1.112499  0.6302337  0.8761978
##   0.10       1                   300     1.111850  0.6280776  0.8760848
##   0.10       1                   350     1.117811  0.6266570  0.8751897
##   0.10       1                   400     1.109572  0.6335824  0.8650768
##   0.10       1                   450     1.110829  0.6309095  0.8626886
##   0.10       1                   500     1.115151  0.6276409  0.8630946
##   0.10       1                   550     1.113257  0.6295937  0.8626008
##   0.10       1                   600     1.113128  0.6293539  0.8614671
##   0.10       1                   650     1.117905  0.6264673  0.8650594
##   0.10       1                   700     1.117931  0.6254488  0.8656066
##   0.10       1                   750     1.123672  0.6219291  0.8676270
##   0.10       1                   800     1.124216  0.6234135  0.8661076
##   0.10       1                   850     1.121010  0.6247008  0.8654976
##   0.10       1                   900     1.128553  0.6208985  0.8706600
##   0.10       1                   950     1.131260  0.6197499  0.8751686
##   0.10       1                  1000     1.134479  0.6175351  0.8788523
##   0.10       3                   100     1.192937  0.5831494  0.9534042
##   0.10       3                   150     1.176441  0.5942503  0.9442726
##   0.10       3                   200     1.160444  0.6059577  0.9383723
##   0.10       3                   250     1.151995  0.6100688  0.9303530
##   0.10       3                   300     1.143422  0.6154580  0.9251949
##   0.10       3                   350     1.137039  0.6190863  0.9183965
##   0.10       3                   400     1.134891  0.6208552  0.9162557
##   0.10       3                   450     1.130939  0.6233878  0.9130177
##   0.10       3                   500     1.129706  0.6240104  0.9122626
##   0.10       3                   550     1.129453  0.6242597  0.9124971
##   0.10       3                   600     1.128395  0.6250840  0.9120618
##   0.10       3                   650     1.127809  0.6255418  0.9114271
##   0.10       3                   700     1.126992  0.6260127  0.9107517
##   0.10       3                   750     1.127035  0.6259410  0.9107677
##   0.10       3                   800     1.126634  0.6261981  0.9106233
##   0.10       3                   850     1.126753  0.6261375  0.9107700
##   0.10       3                   900     1.126545  0.6262471  0.9106498
##   0.10       3                   950     1.126522  0.6262460  0.9105503
##   0.10       3                  1000     1.126416  0.6263158  0.9104782
##   0.10       5                   100     1.182585  0.5868787  0.9060703
##   0.10       5                   150     1.157691  0.6021213  0.8882589
##   0.10       5                   200     1.143956  0.6111160  0.8746710
##   0.10       5                   250     1.133866  0.6170608  0.8663558
##   0.10       5                   300     1.129412  0.6201409  0.8630462
##   0.10       5                   350     1.126043  0.6218628  0.8622037
##   0.10       5                   400     1.126764  0.6213953  0.8609887
##   0.10       5                   450     1.125189  0.6221890  0.8603034
##   0.10       5                   500     1.123235  0.6231935  0.8590695
##   0.10       5                   550     1.121713  0.6242633  0.8580728
##   0.10       5                   600     1.121014  0.6246869  0.8578194
##   0.10       5                   650     1.120303  0.6251466  0.8573793
##   0.10       5                   700     1.119608  0.6254000  0.8571426
##   0.10       5                   750     1.119307  0.6255528  0.8571290
##   0.10       5                   800     1.118725  0.6258675  0.8568384
##   0.10       5                   850     1.118305  0.6261355  0.8568044
##   0.10       5                   900     1.118353  0.6260849  0.8570405
##   0.10       5                   950     1.118173  0.6262247  0.8570595
##   0.10       5                  1000     1.117936  0.6263906  0.8570213
##   0.10       7                   100     1.165168  0.6164119  0.9037594
##   0.10       7                   150     1.147902  0.6262323  0.8964773
##   0.10       7                   200     1.138103  0.6315710  0.8912544
##   0.10       7                   250     1.135841  0.6327367  0.8906029
##   0.10       7                   300     1.133264  0.6335503  0.8876008
##   0.10       7                   350     1.131041  0.6344672  0.8865193
##   0.10       7                   400     1.128790  0.6356978  0.8834832
##   0.10       7                   450     1.126209  0.6372801  0.8821103
##   0.10       7                   500     1.125467  0.6372491  0.8824050
##   0.10       7                   550     1.124105  0.6380272  0.8813631
##   0.10       7                   600     1.123055  0.6387294  0.8803206
##   0.10       7                   650     1.122448  0.6391842  0.8801378
##   0.10       7                   700     1.122021  0.6393152  0.8798143
##   0.10       7                   750     1.121867  0.6393796  0.8799372
##   0.10       7                   800     1.121683  0.6395168  0.8798424
##   0.10       7                   850     1.121603  0.6395411  0.8797303
##   0.10       7                   900     1.121495  0.6395804  0.8797665
##   0.10       7                   950     1.121448  0.6396150  0.8796829
##   0.10       7                  1000     1.121375  0.6396165  0.8796685
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  7, shrinkage = 0.01 and n.minobsinnode = 10.
gbmPred <- predict(gbmModel, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = gbmPred, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.3844945 0.5226213 0.9988662

Cubist

set.seed(117)
#refer to text p. 165
cubistGrid <- expand.grid(committees = c(1, 5, 10, 20),
                          neighbors = c(0, 3, 5))

cubistModel <- train(x = TrainX,
                  y = TrainY,
                  method = "cubist",
                  tuneGrid = cubistGrid,
                  preProc = c("center", "scale"),
                  trControl = trainControl(method = "cv"))
cubistModel
## Cubist 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 112, 112, 111, 112, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.307594  0.5003035  1.0431963
##    1          3          1.187560  0.5904275  0.9104327
##    1          5          1.211412  0.5699709  0.9321292
##    5          0          1.117679  0.6233613  0.8723219
##    5          3          1.041239  0.6850491  0.8005877
##    5          5          1.065681  0.6617667  0.8061668
##   10          0          1.103046  0.6554743  0.8612666
##   10          3          1.042496  0.7068475  0.7924773
##   10          5          1.062694  0.6875339  0.7969218
##   20          0          1.082658  0.6669867  0.8372747
##   20          3          1.031802  0.7054281  0.7817013
##   20          5          1.049982  0.6910679  0.7951145
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 3.
cubistPred <- predict(cubistModel, newdata = TestX)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = cubistPred, obs = TestY)
##      RMSE  Rsquared       MAE 
## 1.3129126 0.5603709 0.9849868

part a

Which tree-based regression model gives the optimal resampling and test set performance?

Model \(R^2\) RMSE
SVM 0.6191610 1.2695080
MARS 0.6159963 1.2278715
Cubist 0.5603709 1.3129126
GBM 0.5226213 1.3844945
Random Forest 0.5202457 1.3841889
Weighted KNN 0.4733189 1.4676074

The SVM model outperform all the other models as it has the highest \(R^2\). However, it is still not an ideal model as it only explains about 62% of the variance of the test data.

But in terms of only tree-based regression models, Cubist is the top performer.

part b

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cubistModel)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   62.03
## BiologicalMaterial12     59.49
## ManufacturingProcess09   50.63
## BiologicalMaterial03     50.63
## BiologicalMaterial02     43.04
## ManufacturingProcess30   39.24
## ManufacturingProcess04   25.32
## ManufacturingProcess29   25.32
## BiologicalMaterial06     25.32
## ManufacturingProcess26   24.05
## ManufacturingProcess25   24.05
## ManufacturingProcess17   22.78
## ManufacturingProcess39   21.52
## ManufacturingProcess28   18.99
## BiologicalMaterial09     15.19
## ManufacturingProcess42   13.92
## ManufacturingProcess27   11.39
## ManufacturingProcess06   11.39
## ManufacturingProcess33   11.39

part c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

cubistModel$finalModel
## 
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
## 
## Number of samples: 124 
## Number of predictors: 56 
## 
## Number of committees: 20 
## Number of rules per committee: 1, 5, 1, 4, 1, 2, 1, 1, 3, 2, 1, 2, 1, 2, 1, 3, 2, 3, 1, 4