Do problems 8.1, 8.2, 8.3, and 8.7

8.1. Recreate the simulated data from Exercise 7.2:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(mlbench)  
set.seed(200)  
simulated <- mlbench.friedman1(200, sd = 1)  
simulated <- cbind(simulated$x, simulated$y)  
simulated <- as.data.frame(simulated)  
colnames(simulated)[ncol(simulated)] <- "y" 
library(randomForest)  
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)  
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
model1 <- randomForest(y ~., data = simulated,   importance = TRUE,  ntree = 1000)  
rfImp1 <- varImp(model1, scale = FALSE) 
rfImp1
##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

Predictors v6 - v10 have very small importance and therefore the random forest model did not significantly use these.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1  
cor(simulated$duplicate1, simulated$V1)  
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~., data = simulated,   importance = TRUE,  ntree = 1000)  
rfImp2 <- varImp(model2, scale = FALSE) 
rfImp2
##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

The importance score of v1 has decreased. When you add another predictor that is also highly correlated with v1, the number of splits in the random tree model are shared between v1 and the new predictor, therefore lessening the importance.

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
model3<- cforest(y ~., data = simulated)
varimp(model3, conditional = TRUE)
##            V1            V2            V3            V4            V5 
##  3.173621e+00  4.954327e+00 -2.487929e-03  6.122763e+00  1.157286e+00 
##            V6            V7            V8            V9           V10 
##  6.534901e-05 -2.353746e-02  6.846242e-03  1.737579e-02  1.154302e-02 
##    duplicate1 
##  9.159232e-01

The cforest model shows a slightly different pattern to the random forest model with diffent importance value, yet a similar order of importance with v4 being most important.

** Boosted Trees

library(gbm)
## Loaded gbm 2.1.8
model4 <- gbm(y ~., data = simulated, distribution = "gaussian")

summary(model4)

##                   var    rel.inf
## V4                 V4 30.1882249
## V2                 V2 23.3402488
## V1                 V1 20.2136333
## V5                 V5 10.9949556
## duplicate1 duplicate1  7.6076567
## V3                 V3  7.3687812
## V7                 V7  0.1678235
## V8                 V8  0.1186761
## V6                 V6  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

Similarly to previous models, V4 is predicted to be the most important variable with the Boosted Tree Model.

** Cubist Model

library(Cubist)
Model5<-cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(Model5)
##            Overall
## V1            64.5
## V3            41.0
## V2            60.0
## V4            48.0
## V5            31.0
## V6             9.0
## duplicate1     6.0
## V8             2.0
## V10            0.5
## V7             0.0
## V9             0.0

The Cubist Model shows a difference in importance with the variables, predicting V1 to be most important as opposed to V4 from the earlier models.

8.2 Use a simulation to show tree bias with different granularities.

set.seed(1)
x1 <- sample(0:10000 / 10000, 200, replace = T)
x2 <- sample(0:1000 / 1000, 200, replace = T)
x3 <- sample(0:100 / 100, 200, replace = T)
x4 <- sample(0:10 / 10, 200, replace = T)

y <- x1 + x4 + rnorm(200) 

df <- data.frame(x1, x2, x3, x4, y)
library(rpart)

rpartTree <- rpart(y ~ ., data=df)
varImp(rpartTree)
##      Overall
## x1 0.7443663
## x2 0.7563594
## x3 0.5903005
## x4 0.4806487

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Boosting means that each tree is dependent on prior trees. The algorithm learns by fitting the residual of the trees that preceded it. The learning rate determines how fast or slow the learner converges on the optimal solution. If the step size is too big, you might overshoot the optimal solution. If the step size is too small, training takes longer to converge on the best solution. The higher the learning rate, the less dependent values will be correlated as in the secon graph with 0.9 learning rate.

The model with 0.1 learning rate will be most predictive of the other samples since models with less bagging fraction and learning rate will be less likely to overfit and will have more generalization.

Increasing the interaction depth decreased the RMSE on the number of trees.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

set.seed(56)

knnmodel2 <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knnmodel2, ChemicalManufacturingProcess)

df <- df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
df.train.x = train_df[,-1]
df.train.y = train_df[,1]
df.test.x = test_df[,-1]
df.test.y = test_df[,1]

** Random Forest

library(randomForest)  
library(caret)  
set.seed(10)
rfModel <- train(x = df.train.x,
                 y = df.train.y,
                 method = 'rf',
                 tuneLength = 10)
rfModel
## Random Forest 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6887506  0.6034891  0.5369235
##    8    0.6458874  0.6237886  0.4966979
##   14    0.6399251  0.6223807  0.4892793
##   20    0.6384215  0.6179594  0.4864648
##   26    0.6427075  0.6083136  0.4867058
##   32    0.6421827  0.6067880  0.4856572
##   38    0.6445505  0.6017645  0.4862263
##   44    0.6524093  0.5890632  0.4918243
##   50    0.6547797  0.5848733  0.4929361
##   56    0.6587105  0.5785681  0.4949506
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.

** Boosted

set.seed(10)
grid <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.5), 
                    n.minobsinnode=c(5, 10, 15))
gbmModel <- train(x = df.train.x,
                  y = df.train.y,
                  method = 'gbm', 
                  tuneGrid = grid, 
                  verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE       Rsquared 
##   0.01        1                  5               50      0.8861818  0.4429902
##   0.01        1                  5              100      0.8122191  0.4898878
##   0.01        1                  5              150      0.7641514  0.5171798
##   0.01        1                  5              200      0.7311479  0.5349375
##   0.01        1                 10               50      0.8865594  0.4541180
##   0.01        1                 10              100      0.8106351  0.4943815
##   0.01        1                 10              150      0.7621684  0.5158366
##   0.01        1                 10              200      0.7305949  0.5299783
##   0.01        1                 15               50      0.8875225  0.4488035
##   0.01        1                 15              100      0.8124663  0.4930673
##   0.01        1                 15              150      0.7621555  0.5187584
##   0.01        1                 15              200      0.7298443  0.5332401
##   0.01        5                  5               50      0.8298870  0.5342329
##   0.01        5                  5              100      0.7418307  0.5549246
##   0.01        5                  5              150      0.6962730  0.5721927
##   0.01        5                  5              200      0.6702899  0.5863603
##   0.01        5                 10               50      0.8344043  0.5176358
##   0.01        5                 10              100      0.7478868  0.5390828
##   0.01        5                 10              150      0.7050841  0.5542119
##   0.01        5                 10              200      0.6830626  0.5643717
##   0.01        5                 15               50      0.8465142  0.4980061
##   0.01        5                 15              100      0.7628770  0.5258196
##   0.01        5                 15              150      0.7183813  0.5437013
##   0.01        5                 15              200      0.6946348  0.5550203
##   0.01       10                  5               50      0.8222481  0.5492881
##   0.01       10                  5              100      0.7337827  0.5631995
##   0.01       10                  5              150      0.6871138  0.5810400
##   0.01       10                  5              200      0.6628534  0.5928022
##   0.01       10                 10               50      0.8335054  0.5215425
##   0.01       10                 10              100      0.7478048  0.5396830
##   0.01       10                 10              150      0.7040710  0.5557826
##   0.01       10                 10              200      0.6818857  0.5656583
##   0.01       10                 15               50      0.8458350  0.5031433
##   0.01       10                 15              100      0.7629048  0.5259137
##   0.01       10                 15              150      0.7204376  0.5398506
##   0.01       10                 15              200      0.6968279  0.5511118
##   0.01       15                  5               50      0.8265157  0.5393524
##   0.01       15                  5              100      0.7345052  0.5640981
##   0.01       15                  5              150      0.6892562  0.5785601
##   0.01       15                  5              200      0.6649028  0.5910551
##   0.01       15                 10               50      0.8335155  0.5227051
##   0.01       15                 10              100      0.7468066  0.5437311
##   0.01       15                 10              150      0.7023107  0.5589773
##   0.01       15                 10              200      0.6791003  0.5701137
##   0.01       15                 15               50      0.8467634  0.5013152
##   0.01       15                 15              100      0.7631723  0.5291374
##   0.01       15                 15              150      0.7193982  0.5427500
##   0.01       15                 15              200      0.6953333  0.5537806
##   0.10        1                  5               50      0.6847272  0.5476474
##   0.10        1                  5              100      0.6778655  0.5535859
##   0.10        1                  5              150      0.6768149  0.5574006
##   0.10        1                  5              200      0.6758626  0.5601805
##   0.10        1                 10               50      0.6901427  0.5412144
##   0.10        1                 10              100      0.6822384  0.5487883
##   0.10        1                 10              150      0.6777967  0.5563153
##   0.10        1                 10              200      0.6767969  0.5594859
##   0.10        1                 15               50      0.6940884  0.5315457
##   0.10        1                 15              100      0.6852072  0.5432212
##   0.10        1                 15              150      0.6796391  0.5504543
##   0.10        1                 15              200      0.6790532  0.5536966
##   0.10        5                  5               50      0.6529136  0.5839441
##   0.10        5                  5              100      0.6397736  0.6016521
##   0.10        5                  5              150      0.6367649  0.6063580
##   0.10        5                  5              200      0.6356574  0.6078057
##   0.10        5                 10               50      0.6522776  0.5845012
##   0.10        5                 10              100      0.6409728  0.5996146
##   0.10        5                 10              150      0.6329560  0.6103532
##   0.10        5                 10              200      0.6297537  0.6146869
##   0.10        5                 15               50      0.6710660  0.5604933
##   0.10        5                 15              100      0.6550701  0.5806070
##   0.10        5                 15              150      0.6456897  0.5931307
##   0.10        5                 15              200      0.6401443  0.5996200
##   0.10       10                  5               50      0.6518655  0.5873948
##   0.10       10                  5              100      0.6430852  0.5999454
##   0.10       10                  5              150      0.6411299  0.6027942
##   0.10       10                  5              200      0.6404337  0.6039800
##   0.10       10                 10               50      0.6659471  0.5708972
##   0.10       10                 10              100      0.6484634  0.5937127
##   0.10       10                 10              150      0.6424486  0.6026656
##   0.10       10                 10              200      0.6422782  0.6035665
##   0.10       10                 15               50      0.6692092  0.5635372
##   0.10       10                 15              100      0.6511635  0.5844770
##   0.10       10                 15              150      0.6434424  0.5950505
##   0.10       10                 15              200      0.6370439  0.6029454
##   0.10       15                  5               50      0.6445123  0.5958846
##   0.10       15                  5              100      0.6350294  0.6080568
##   0.10       15                  5              150      0.6334227  0.6105791
##   0.10       15                  5              200      0.6332966  0.6111995
##   0.10       15                 10               50      0.6556006  0.5808349
##   0.10       15                 10              100      0.6413477  0.5990053
##   0.10       15                 10              150      0.6356894  0.6058592
##   0.10       15                 10              200      0.6328858  0.6097917
##   0.10       15                 15               50      0.6624939  0.5701062
##   0.10       15                 15              100      0.6482972  0.5888305
##   0.10       15                 15              150      0.6419439  0.5964804
##   0.10       15                 15              200      0.6383940  0.6009170
##   0.50        1                  5               50      0.7649890  0.4746799
##   0.50        1                  5              100      0.7684013  0.4822725
##   0.50        1                  5              150      0.7672336  0.4872189
##   0.50        1                  5              200      0.7651752  0.4896447
##   0.50        1                 10               50      0.7393639  0.5010414
##   0.50        1                 10              100      0.7305378  0.5189482
##   0.50        1                 10              150      0.7290002  0.5241207
##   0.50        1                 10              200      0.7309560  0.5230796
##   0.50        1                 15               50      0.7308240  0.5030164
##   0.50        1                 15              100      0.7404889  0.4990986
##   0.50        1                 15              150      0.7399604  0.5011822
##   0.50        1                 15              200      0.7442833  0.4982731
##   0.50        5                  5               50      0.7652374  0.4823328
##   0.50        5                  5              100      0.7641241  0.4836292
##   0.50        5                  5              150      0.7639709  0.4838715
##   0.50        5                  5              200      0.7639500  0.4839010
##   0.50        5                 10               50      0.7606870  0.4761583
##   0.50        5                 10              100      0.7596902  0.4783854
##   0.50        5                 10              150      0.7599554  0.4784211
##   0.50        5                 10              200      0.7598924  0.4786795
##   0.50        5                 15               50      0.7380017  0.4975783
##   0.50        5                 15              100      0.7384295  0.5004710
##   0.50        5                 15              150      0.7383149  0.5010109
##   0.50        5                 15              200      0.7390560  0.5005151
##   0.50       10                  5               50      0.7704154  0.4734729
##   0.50       10                  5              100      0.7692222  0.4747773
##   0.50       10                  5              150      0.7692219  0.4748132
##   0.50       10                  5              200      0.7692004  0.4748450
##   0.50       10                 10               50      0.7692894  0.4684564
##   0.50       10                 10              100      0.7657319  0.4735310
##   0.50       10                 10              150      0.7655652  0.4747527
##   0.50       10                 10              200      0.7654292  0.4750363
##   0.50       10                 15               50      0.7464144  0.4916312
##   0.50       10                 15              100      0.7456142  0.4961866
##   0.50       10                 15              150      0.7435738  0.4989335
##   0.50       10                 15              200      0.7434776  0.4993274
##   0.50       15                  5               50      0.7970965  0.4377319
##   0.50       15                  5              100      0.7956610  0.4396156
##   0.50       15                  5              150      0.7953666  0.4398930
##   0.50       15                  5              200      0.7955319  0.4397517
##   0.50       15                 10               50      0.7345734  0.5013059
##   0.50       15                 10              100      0.7331767  0.5038218
##   0.50       15                 10              150      0.7338389  0.5036720
##   0.50       15                 10              200      0.7337766  0.5036788
##   0.50       15                 15               50      0.7488314  0.4859507
##   0.50       15                 15              100      0.7457297  0.4918584
##   0.50       15                 15              150      0.7440140  0.4948376
##   0.50       15                 15              200      0.7440738  0.4950966
##   MAE      
##   0.7084569
##   0.6423555
##   0.5999429
##   0.5703862
##   0.7093705
##   0.6409595
##   0.5969848
##   0.5680730
##   0.7092613
##   0.6418178
##   0.5971861
##   0.5683856
##   0.6611786
##   0.5821977
##   0.5405946
##   0.5150489
##   0.6649608
##   0.5854547
##   0.5449346
##   0.5226105
##   0.6729642
##   0.5974332
##   0.5577710
##   0.5363203
##   0.6549598
##   0.5745619
##   0.5303373
##   0.5068978
##   0.6635435
##   0.5850749
##   0.5444182
##   0.5218759
##   0.6727529
##   0.5978940
##   0.5598675
##   0.5376782
##   0.6593590
##   0.5752664
##   0.5312511
##   0.5077834
##   0.6636590
##   0.5857195
##   0.5443379
##   0.5217813
##   0.6729232
##   0.5970193
##   0.5584931
##   0.5354213
##   0.5269139
##   0.5225728
##   0.5247743
##   0.5260974
##   0.5252831
##   0.5192826
##   0.5162082
##   0.5177700
##   0.5313570
##   0.5229483
##   0.5190232
##   0.5198918
##   0.5018773
##   0.4929476
##   0.4900240
##   0.4891554
##   0.5005073
##   0.4915170
##   0.4844472
##   0.4824698
##   0.5123633
##   0.5017698
##   0.4952874
##   0.4910297
##   0.4915354
##   0.4866941
##   0.4858184
##   0.4858829
##   0.5069258
##   0.4958276
##   0.4924167
##   0.4936399
##   0.5156449
##   0.5022434
##   0.4964176
##   0.4924929
##   0.4878967
##   0.4813249
##   0.4806941
##   0.4809243
##   0.4963140
##   0.4901471
##   0.4876518
##   0.4864212
##   0.5055348
##   0.4941241
##   0.4891849
##   0.4872332
##   0.6016989
##   0.6022836
##   0.6014021
##   0.5989816
##   0.5780994
##   0.5740305
##   0.5727691
##   0.5744938
##   0.5721260
##   0.5816357
##   0.5781600
##   0.5812817
##   0.5790967
##   0.5784724
##   0.5783322
##   0.5783135
##   0.5974886
##   0.5970656
##   0.5971459
##   0.5971986
##   0.5815954
##   0.5815951
##   0.5808630
##   0.5820511
##   0.6054156
##   0.6038618
##   0.6038410
##   0.6038177
##   0.5974770
##   0.5956630
##   0.5958917
##   0.5957671
##   0.5880511
##   0.5891991
##   0.5873446
##   0.5875142
##   0.6125050
##   0.6122914
##   0.6119624
##   0.6121229
##   0.5792508
##   0.5795749
##   0.5803924
##   0.5803430
##   0.5897518
##   0.5875889
##   0.5865808
##   0.5864940
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 200, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 10.
gbmModel$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 68     200                 5       0.1             10
gbmModel$finalModel
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 56 predictors of which 54 had non-zero influence.

** Cubist

set.seed(1)
cubModel <- train(x = df.train.x,
                     y = df.train.y,
                     method = 'cubist')
cubModel
## Cubist 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.9855217  0.3720333  0.7021369
##    1          5          0.9678239  0.3922185  0.6775515
##    1          9          0.9714719  0.3869898  0.6851331
##   10          0          0.6985204  0.5798378  0.5258657
##   10          5          0.6813682  0.5998483  0.5088235
##   10          9          0.6869407  0.5935009  0.5148524
##   20          0          0.6560534  0.6266983  0.4945791
##   20          5          0.6378305  0.6468926  0.4769200
##   20          9          0.6441261  0.6398704  0.4832740
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

The RMSE appears to be best for the Random Forest Model,

varImp(gbmModel)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  22.925
## BiologicalMaterial12    21.277
## ManufacturingProcess31  17.960
## BiologicalMaterial03    17.644
## ManufacturingProcess17  15.682
## BiologicalMaterial11    14.622
## ManufacturingProcess13  12.238
## BiologicalMaterial09     9.518
## ManufacturingProcess06   8.842
## ManufacturingProcess01   5.852
## ManufacturingProcess18   5.764
## ManufacturingProcess29   5.658
## BiologicalMaterial02     5.579
## ManufacturingProcess10   4.691
## BiologicalMaterial06     4.644
## BiologicalMaterial05     4.633
## ManufacturingProcess02   4.434
## ManufacturingProcess14   4.241
## ManufacturingProcess33   4.164

The Mfg process variables dominate the list of most important variables. This is similar to what we saw with the linear and non linear models from previous chapters in which mfg process 32 was most important and the mfg process dominated the models over the biological variables.