Data 624 HW 9

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
library(vip)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = T,
                       ntree = 1000)
                      
rfImp1 <-varImp(model1, scale = FALSE)

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

varImpPlot(model1)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The random forest model uses the uninformative predictors but it is not significant since the importance of these are really low.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                       importance = T,
                       ntree = 1000)
                      
rfImp2 <-varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

varImpPlot(model2)

V1’s importance decreases.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
library(tidyverse)
cformod <- cforest(y ~ ., data = simulated)
varimp(cformod) %>% 
  sort(decreasing = T)

##            V4            V2    duplicate1            V1            V5 
##  7.6223892727  6.0579730772  5.0941897280  4.6171158805  1.7161194047 
##            V7            V9            V3            V6           V10 
##  0.0465374951  0.0046062409  0.0003116115 -0.0289427183 -0.0310326410 
##            V8 
## -0.0380965511

barplot(varimp(cformod) %>% 
  sort(decreasing = T))

It is different since the importance of the duplicate 3rd instead of fourth, and V1 is no longer as important, the most important is v4. What is still the same is that V6-V10 are still not important.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosting

library(gbm)

gbmmod<- gbm(y ~ ., data = simulated, distribution= "gaussian")
summary(gbmmod)

##                   var    rel.inf
## V4                 V4 28.4060721
## V2                 V2 24.7780725
## V1                 V1 17.0887380
## V5                 V5 10.1864440
## duplicate1 duplicate1  9.8217115
## V3                 V3  8.8594164
## V7                 V7  0.4766545
## V6                 V6  0.3828911
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

For the boosted model, V4 and V2 are the most important and V1 is third and V6-V10 are still not that important.

Cubist

cubmod<- train(y ~ ., data = simulated, method= "cubist")
summary(cubmod)

## 
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun May 02 23:00:22 2021
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (12 attributes) from undefined.data
## 
## Model 1:
## 
##   Rule 1/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.936506]
## 
##  outcome = 0.269253 + 8.9 V4 + 7.1 V2 + 5.1 V5 + 4.8 V1 + 3.2 duplicate1
## 
## Model 2:
## 
##   Rule 2/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.990785]
## 
##  outcome = 0.826137 + 9 V4 + 8.3 V1 + 7.3 V2 + 5.2 V5 - 3 V6
## 
## Model 3:
## 
##   Rule 3/1: [105 cases, mean 13.381248, range 3.55596 to 23.3956, est err 2.029922]
## 
##     if
##  V1 <= 0.7340099
##  V3 <= 0.654213
##     then
##  outcome = 2.658355 - 12.6 V3 + 11.6 duplicate1 + 10.2 V4 + 7.8 V2
##            + 2.4 V6 + 1.5 V1 + 0.5 V5
## 
##   Rule 3/2: [20 cases, mean 14.639552, range 8.442596 to 21.62877, est err 2.450924]
## 
##     if
##  V1 > 0.7340099
##  V2 <= 0.5403168
##     then
##  outcome = 2.108552 + 35 V2 + 10.4 V4 - 6 V3 + 1.3 duplicate1 + 0.8 V5
## 
##   Rule 3/3: [57 cases, mean 14.914219, range 4.888355 to 28.38167, est err 2.814725]
## 
##     if
##  V1 <= 0.7340099
##  V3 > 0.654213
##     then
##  outcome = -21.377814 + 25.2 V3 + 11.3 V4 + 11 V1 + 8.1 V2 + 7.1 V5
## 
##   Rule 3/4: [18 cases, mean 18.628002, range 13.07191 to 23.57269, est err 2.682001]
## 
##     if
##  V1 > 0.7340099
##  V2 > 0.5403168
##     then
##  outcome = 43.992161 - 34.9 V2 + 0.2 V4
## 
## Model 4:
## 
##   Rule 4/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 2.058539]
## 
##  outcome = 0.1879 + 9.1 V4 + 7.9 V1 + 7 V5 + 7.2 V2 - 3.1 V6
## 
## Model 5:
## 
##   Rule 5/1: [106 cases, mean 12.285650, range 3.55596 to 23.3956, est err 3.237101]
## 
##     if
##  V2 <= 0.5403168
##     then
##  outcome = -7.104052 + 28.4 V2 + 12.9 duplicate1 + 7.9 V4 + 0.3 V5
## 
##   Rule 5/2: [105 cases, mean 13.381248, range 3.55596 to 23.3956, est err 2.238507]
## 
##     if
##  V1 <= 0.7340099
##  V3 <= 0.654213
##     then
##  outcome = 3.509951 - 13.8 V3 + 12.8 duplicate1 + 9.9 V4 + 7.7 V2
##            + 2.6 V6 + 0.4 V1
## 
##   Rule 5/3: [57 cases, mean 14.914219, range 4.888355 to 28.38167, est err 2.820595]
## 
##     if
##  V1 <= 0.7340099
##  V3 > 0.654213
##     then
##  outcome = -20.764964 + 25.3 V3 + 11.4 V1 + 11.2 V4 + 8.2 V2 + 5.3 V5
## 
##   Rule 5/4: [18 cases, mean 18.628002, range 13.07191 to 23.57269, est err 2.651464]
## 
##     if
##  V1 > 0.7340099
##  V2 > 0.5403168
##     then
##  outcome = 43.756229 - 34.5 V2 + 0.2 V4
## 
## Model 6:
## 
##   Rule 6/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 2.317699]
## 
##  outcome = 1.218663 + 12.6 V1 + 10 V4 + 8.5 V5 - 5.4 duplicate1 + 4.1 V2
##            - 2.6 V6
## 
## Model 7:
## 
##   Rule 7/1: [14 cases, mean 11.101607, range 5.325261 to 17.15359, est err 3.645042]
## 
##     if
##  V2 <= 0.9183624
##  duplicate1 <= 0.07016665
##     then
##  outcome = -2.763226 + 43.1 duplicate1 + 12.8 V2 + 10.6 V4
## 
##   Rule 7/2: [12 cases, mean 14.461304, range 7.444598 to 19.79759, est err 3.295416]
## 
##     if
##  V2 > 0.9183624
##     then
##  outcome = 5.665264 + 11.8 duplicate1 + 2.1 V2 + 1.5 V4
## 
##   Rule 7/3: [100 cases, mean 14.512158, range 3.55596 to 28.38167, est err 2.334867]
## 
##     if
##  V3 > 0.4459752
##  duplicate1 > 0.07016665
##     then
##  outcome = -14.769383 + 18.5 V3 + 13.7 V2 + 9.7 V4 + 7 duplicate1
##            + 2.6 V5
## 
##   Rule 7/4: [83 cases, mean 15.070425, range 5.784235 to 23.57269, est err 3.290021]
## 
##     if
##  V3 <= 0.4459752
##  duplicate1 > 0.07016665
##     then
##  outcome = 6.19234 - 21.2 V3 + 14.9 duplicate1 + 13.1 V2 - 8.1 V1
##            + 7.7 V4
## 
## Model 8:
## 
##   Rule 8/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 2.479266]
## 
##  outcome = 0.203753 + 13 V1 + 10 V4 + 9.3 V5 - 5.7 duplicate1 + 2.7 V2
## 
## Model 9:
## 
##   Rule 9/1: [17 cases, mean 10.657375, range 5.325261 to 17.15359, est err 4.792915]
## 
##     if
##  duplicate1 <= 0.07016665
##     then
##  outcome = -2.702351 + 43.8 duplicate1 + 15.2 V2 + 10.1 V4
## 
##   Rule 9/2: [78 cases, mean 13.686872, range 3.55596 to 28.38167, est err 2.377873]
## 
##     if
##  V2 <= 0.7803221
##  V3 > 0.4459752
##  duplicate1 > 0.07016665
##     then
##  outcome = -15.617722 + 20.1 V3 + 16.6 V2 + 10.1 V4 + 6.1 duplicate1
## 
##   Rule 9/3: [68 cases, mean 14.288445, range 5.784235 to 23.57269, est err 3.086318]
## 
##     if
##  V2 <= 0.7803221
##  V3 <= 0.4459752
##  duplicate1 > 0.07016665
##     then
##  outcome = 4.558539 - 18.8 V3 + 16.2 V2 + 13.9 duplicate1 + 8.1 V4
##            - 6.6 V1 - 0.2 V6
## 
##   Rule 9/4: [17 cases, mean 16.810188, range 7.444598 to 25.01616, est err 4.524103]
## 
##     if
##  V1 <= 0.671787
##  V2 > 0.7803221
##  V3 > 0.6060145
##     then
##  outcome = 5.227253 + 36.2 V3 - 28.3 V2 + 15.8 duplicate1 + 6.3 V4
## 
##   Rule 9/5: [52 cases, mean 16.931290, range 8.442596 to 26.94567, est err 4.433290]
## 
##     if
##  V1 > 0.671787
##     then
##  outcome = 29.735355 - 24.5 V1 + 7.4 V4 + 0.9 V2 + 0.7 duplicate1
##            - 0.2 V6
## 
##   Rule 9/6: [16 cases, mean 17.710804, range 9.597466 to 22.05247, est err 4.757797]
## 
##     if
##  V1 <= 0.671787
##  V2 > 0.7803221
##  V3 <= 0.6060145
##     then
##  outcome = 39.25704 - 27 V2 + 14.4 V1 - 11.7 V3 + 1.8 duplicate1
## 
## Model 10:
## 
##   Rule 10/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 2.432866]
## 
##  outcome = -1.808888 + 13.8 V1 + 10.1 V4 + 9.8 V5 - 3.7 duplicate1
##            + 3.4 V2
## 
## Model 11:
## 
##   Rule 11/1: [110 cases, mean 14.087054, range 3.55596 to 28.38167, est err 2.779835]
## 
##     if
##  V3 > 0.4459752
##     then
##  outcome = -12.561468 + 17.6 V3 + 14.5 V2 + 9.5 V4 + 5.4 duplicate1
## 
##   Rule 11/2: [90 cases, mean 14.818451, range 5.325261 to 23.57269, est err 3.266586]
## 
##     if
##  V3 <= 0.4459752
##     then
##  outcome = 6.060055 - 18.2 V3 + 16.1 duplicate1 + 14 V2 - 10.5 V1 + 8 V4
##            - 0.2 V6
## 
##   Rule 11/3: [41 cases, mean 17.110712, range 7.444598 to 25.01616, est err 6.489881]
## 
##     if
##  V2 > 0.7803221
##     then
##  outcome = 8.034255 + 2.3 V2 + 1.6 V4 + 1.4 duplicate1 - 0.6 V6
## 
##   Rule 11/4: [35 cases, mean 17.380442, range 7.444598 to 25.01616, est err 2.897774]
## 
##     if
##  V1 <= 0.7421085
##  V2 > 0.7803221
##     then
##  outcome = 40.756111 - 31.7 V2 + 11.1 V1 + 1 V4 + 0.9 duplicate1 - 0.4 V6
## 
## Model 12:
## 
##   Rule 12/1: [164 cases, mean 13.949840, range 3.55596 to 28.38167, est err 2.349439]
## 
##     if
##  V1 <= 0.7514832
##     then
##  outcome = -4.684432 + 15.1 V1 + 11.1 V4 + 9.8 V5 + 5.1 V2
## 
##   Rule 12/2: [36 cases, mean 16.540636, range 8.442596 to 23.57269, est err 2.349583]
## 
##     if
##  V1 > 0.7514832
##     then
##  outcome = 0.145388 + 10.2 V4 + 7.4 V5 + 7 V1 + 2.9 V2
## 
## Model 13:
## 
##   Rule 13/1: [110 cases, mean 14.087054, range 3.55596 to 28.38167, est err 2.850824]
## 
##     if
##  V3 > 0.4459752
##     then
##  outcome = -11.696705 + 18.1 V3 + 14 V2 + 8.9 V4 + 3.9 V1
## 
##   Rule 13/2: [90 cases, mean 14.818451, range 5.325261 to 23.57269, est err 3.362806]
## 
##     if
##  V3 <= 0.4459752
##     then
##  outcome = 6.742455 - 20.9 V3 + 12.8 duplicate1 + 13.4 V2 + 8.3 V4
##            - 6.6 V1
## 
##   Rule 13/3: [41 cases, mean 17.110712, range 7.444598 to 25.01616, est err 2.771123]
## 
##     if
##  V2 > 0.7803221
##     then
##  outcome = 44.851998 - 33.1 V2 + 1.5 V4 + 1.2 duplicate1
## 
## Model 14:
## 
##   Rule 14/1: [164 cases, mean 13.949840, range 3.55596 to 28.38167, est err 2.508220]
## 
##     if
##  V1 <= 0.7514832
##     then
##  outcome = -4.057296 + 15.4 V1 + 11 V4 + 10.1 V5 + 3.1 V2 - 0.4 V6
##            + 0.2 V3
## 
##   Rule 14/2: [36 cases, mean 16.540636, range 8.442596 to 23.57269, est err 2.458088]
## 
##     if
##  V1 > 0.7514832
##     then
##  outcome = -1.343076 + 10.7 V4 + 8.6 V1 + 8.4 V5 - 2.1 V6 + 1.9 V2
##            + 1.2 V3
## 
## Model 15:
## 
##   Rule 15/1: [85 cases, mean 13.413920, range 3.55596 to 28.38167, est err 2.607504]
## 
##     if
##  V2 <= 0.7803221
##  V3 > 0.4459752
##     then
##  outcome = -11.744776 + 17.3 V3 + 15.7 V2 + 8.7 V4 + 3.7 V1
##            + 0.2 duplicate1
## 
##   Rule 15/2: [74 cases, mean 14.074516, range 5.325261 to 23.57269, est err 3.106939]
## 
##     if
##  V2 <= 0.7803221
##  V3 <= 0.4459752
##     then
##  outcome = 6.285717 - 21.4 V3 + 15.1 V2 + 8.2 V4 + 6.9 duplicate1
## 
##   Rule 15/3: [41 cases, mean 17.110712, range 7.444598 to 25.01616, est err 2.792759]
## 
##     if
##  V2 > 0.7803221
##     then
##  outcome = 43.111191 - 30.3 V2 + 1.6 V4 + 1.3 duplicate1
## 
## Model 16:
## 
##   Rule 16/1: [164 cases, mean 13.949840, range 3.55596 to 28.38167, est err 2.496711]
## 
##     if
##  V1 <= 0.7514832
##     then
##  outcome = -4.747534 + 14.9 V1 + 11.7 V4 + 9.9 V5 + 3.6 V2
##            + 0.5 duplicate1 + 0.2 V3
## 
##   Rule 16/2: [36 cases, mean 16.540636, range 8.442596 to 23.57269, est err 2.689233]
## 
##     if
##  V1 > 0.7514832
##     then
##  outcome = -2.046392 + 10.8 V4 + 7.5 V5 + 4.3 V1 + 3.1 duplicate1
##            + 2.1 V2 + 1.4 V3
## 
## Model 17:
## 
##   Rule 17/1: [85 cases, mean 13.413920, range 3.55596 to 28.38167, est err 2.580971]
## 
##     if
##  V2 <= 0.7803221
##  V3 > 0.4459752
##     then
##  outcome = -11.940892 + 18.3 V3 + 15.5 V2 + 8.3 V4 + 4 duplicate1
## 
##   Rule 17/2: [74 cases, mean 14.074516, range 5.325261 to 23.57269, est err 3.121916]
## 
##     if
##  V2 <= 0.7803221
##  V3 <= 0.4459752
##     then
##  outcome = 6.54601 - 21.9 V3 + 14.6 V2 + 7.7 duplicate1 + 7.9 V4
## 
##   Rule 17/3: [41 cases, mean 17.110712, range 7.444598 to 25.01616, est err 2.776035]
## 
##     if
##  V2 > 0.7803221
##     then
##  outcome = 43.613615 - 31 V2 + 1.4 duplicate1 + 1.4 V4
## 
## Model 18:
## 
##   Rule 18/1: [164 cases, mean 13.949840, range 3.55596 to 28.38167, est err 2.499548]
## 
##     if
##  V1 <= 0.7514832
##     then
##  outcome = -5.392214 + 15.7 V1 + 12.2 V4 + 9.7 V5 + 4.4 V2
## 
##   Rule 18/2: [36 cases, mean 16.540636, range 8.442596 to 23.57269, est err 2.627912]
## 
##     if
##  V1 > 0.7514832
##     then
##  outcome = -1.4426 + 11.2 V4 + 7.6 V5 + 7.1 V1 + 2.6 V2
## 
## Model 19:
## 
##   Rule 19/1: [84 cases, mean 13.374577, range 3.55596 to 28.38167, est err 2.586975]
## 
##     if
##  V2 <= 0.770291
##  V3 > 0.4459752
##     then
##  outcome = -11.572484 + 18.8 V3 + 15 V2 + 7.7 V4 + 4 duplicate1
## 
##   Rule 19/2: [73 cases, mean 13.980781, range 5.325261 to 23.57269, est err 3.218686]
## 
##     if
##  V2 <= 0.770291
##  V3 <= 0.4459752
##     then
##  outcome = 6.635832 - 20.7 V3 + 14.2 V2 + 13.1 duplicate1 + 7.7 V4
##            - 6.1 V1
## 
##   Rule 19/3: [43 cases, mean 17.190119, range 7.444598 to 25.01616, est err 2.792291]
## 
##     if
##  V2 > 0.770291
##     then
##  outcome = 44.540363 - 31.6 V2 + 0.9 duplicate1 + 0.9 V4
## 
## Model 20:
## 
##   Rule 20/1: [164 cases, mean 13.949840, range 3.55596 to 28.38167, est err 2.555268]
## 
##     if
##  V1 <= 0.7514832
##     then
##  outcome = -5.959895 + 16 V1 + 12.6 V4 + 9.7 V5 + 4.8 V2
## 
##   Rule 20/2: [36 cases, mean 16.540636, range 8.442596 to 23.57269, est err 3.267745]
## 
##     if
##  V1 > 0.7514832
##     then
##  outcome = 6.763101 + 10.5 V5 + 10.6 V4 - 3.4 V6 + 0.4 V1
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           1.563071
##     Relative |error|               0.39
##     Correlation coefficient        0.92
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##     37%    47%    V3
##     35%    76%    V1
##     25%    99%    V2
##      8%    66%    duplicate1
##           100%    V4
##            62%    V5
##            31%    V6
## 
## 
## Time: 0.1 secs

varImp(cubmod)

## cubist variable importance
## 
##            Overall
## V2          100.00
## V1           89.52
## V4           80.65
## V3           67.74
## duplicate1   59.68
## V5           50.00
## V6           25.00
## V7            0.00
## V8            0.00
## V9            0.00
## V10           0.00

plot(varImp(cubmod))

In the cubist model the importance is much higher V2 has the importance of 100% while V1 has the importance of 89% while V4 is 3. It seems like the cubist method is more tuned in in comparison to the other models.

8.2. Use a simulation to show tree bias with different granularities.

library(rpart)
Y1 <- runif(1000, 2,800)
Y2 <- rnorm(1000, 2,30)
Y3 <- rnorm(1000, 1,1000)
y <- Y2 - Y1 

df <- data.frame(Y1, Y2, Y3, y)

library(rpart)

rpartTree <- rpart(y ~ ., data=df)
varImp(rpartTree)

##       Overall
## Y1 3.41469691
## Y2 0.43375419
## Y3 0.04630341

library(partykit)
plot(as.party(rpartTree))

This simulation shows that Y1 takes over all of the trees, and the importance is set to Y1 even though there are other values

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right focuses on the first few predictors because the tuning parameters are much higher on the right compared to the model on the left which the one on the right is set to .9 while the one on the left is set to .1.

Which model do you think would be more predictive of other samples?

The one on the left would be more predictive of other samples because it has more predictors that are considered more important comparison to the one on the right that has few important predictors.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth would decrease the slope and spread across the preditors.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
library(tidyverse)

preP <- preProcess(ChemicalManufacturingProcess, 
                   method = c( "knnImpute", "center", "scale"))
df <- predict(preP, ChemicalManufacturingProcess)
## Restore the response variable values to original
df$Yield = ChemicalManufacturingProcess$Yield

## Split the data into a training and a test set
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
df.train <- df[trainRows, ]
df.test <- df[-trainRows, ]


colYield <- which(colnames(df) == "Yield")
trainingX <- df.train[, -colYield]
trainingY <- df.train$Yield
testingX <- df.test[, -colYield]
testingY <- df.test$Yield

Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

rfmod <- randomForest(trainingY ~ ., data = trainingX,
                       importance = T,
                       ntree = 1000)

rfpredict <- predict(rfmod, newdata = testingX)

Bagging

cfmod <- cforest(trainingY ~ ., data = trainingX)
cfpredict <- predict(cfmod, newdata = testingX)

Boosted

gbmod<- gbm(trainingY ~ ., data = trainingX, distribution= "gaussian")
gbpredict <- predict(gbmod, newdata = testingX)

Cubist

cbmod<- train(trainingX, trainingY, method= "cubist")
cbpredict <- predict(cbmod, newdata = testingX)

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.6.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

resample<-rbind(
  "Random Forest" = postResample(pred = predict(rfmod), obs = trainingY),
  "Bagging" = postResample(pred = predict(cfmod), obs = trainingY),
  "Boosting" = postResample(pred = predict(gbmod), obs = trainingY),
  "Cubist" = postResample(pred = predict(cbmod), obs = trainingY)
)

## Using 100 trees...

resample %>%
  kable() %>%
  kable_styling()

	RMSE	Rsquared	MAE
Random Forest	1.1725937	0.6395192	0.8881461
Bagging	1.0986633	0.7404666	0.8432799
Boosting	0.8297382	0.8155654	0.6206663
Cubist	0.1892204	0.9921570	0.1458568

Cubist is the best.

predictcmp<-rbind(
  "Random Forest" = postResample(pred = rfpredict, obs = testingY),
  "Bagging" = postResample(pred = cfpredict, obs = testingY),
  "Boosting" = postResample(pred = gbpredict, obs = testingY),
  "Cubist" = postResample(pred = cbpredict, obs = testingY)
)
predictcmp %>%
  kable() %>%
  kable_styling()

	RMSE	Rsquared	MAE
Random Forest	0.956162	0.6409486	0.7509213
Bagging	1.119648	0.5107594	0.9052674
Boosting	1.198602	0.4794783	0.8598609
Cubist	0.901844	0.6769447	0.7176895

Cubist is the best.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cbmod)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   72.92
## ManufacturingProcess17   54.17
## ManufacturingProcess13   47.92
## BiologicalMaterial06     46.88
## BiologicalMaterial02     26.04
## BiologicalMaterial03     26.04
## ManufacturingProcess33   26.04
## ManufacturingProcess04   25.00
## ManufacturingProcess29   20.83
## ManufacturingProcess01   18.75
## ManufacturingProcess19   17.71
## ManufacturingProcess36   17.71
## ManufacturingProcess26   16.67
## ManufacturingProcess25   16.67
## ManufacturingProcess37   15.62
## BiologicalMaterial04     14.58
## ManufacturingProcess27   14.58
## BiologicalMaterial10     12.50
## ManufacturingProcess14   12.50

plot(varImp(cbmod), top =10)

ManufactoringProcess is the most important and it is alos important for the non linear models as well.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.6.3

optimaltree = rpart(trainingY~., data=trainingX)
rpart.plot(optimaltree)

Yes it does, it shows even more that the ManufactoringProcess has a higher relationship with Yield.

Data 624 HW 9

Maryluz Cruz

2021-05-02

8.1. Recreate the simulated data from Exercise 7.2:

Boosting

Cubist

8.2. Use a simulation to show tree bias with different granularities.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Random Forest

Bagging

Boosted

Cubist