Do problems 8.1, 8.2, 8.3, and 8.7
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
model1 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
Predictors v6 - v10 have very small importance and therefore the random forest model did not significantly use these.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.29780744
## V2 6.08038134
## V3 0.58410718
## V4 6.93924427
## V5 2.03104094
## V6 0.07947642
## V7 -0.02566414
## V8 -0.11007435
## V9 -0.08839463
## V10 -0.00715093
## duplicate1 3.56411581
The importance score of v1 has decreased. When you add another predictor that is also highly correlated with v1, the number of splits in the random tree model are shared between v1 and the new predictor, therefore lessening the importance.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
model3<- cforest(y ~., data = simulated)
varimp(model3, conditional = TRUE)
## V1 V2 V3 V4 V5
## 3.173621e+00 4.954327e+00 -2.487929e-03 6.122763e+00 1.157286e+00
## V6 V7 V8 V9 V10
## 6.534901e-05 -2.353746e-02 6.846242e-03 1.737579e-02 1.154302e-02
## duplicate1
## 9.159232e-01
The cforest model shows a slightly different pattern to the random forest model with diffent importance value, yet a similar order of importance with v4 being most important.
** Boosted Trees
library(gbm)
## Loaded gbm 2.1.8
model4 <- gbm(y ~., data = simulated, distribution = "gaussian")
summary(model4)
## var rel.inf
## V4 V4 30.1882249
## V2 V2 23.3402488
## V1 V1 20.2136333
## V5 V5 10.9949556
## duplicate1 duplicate1 7.6076567
## V3 V3 7.3687812
## V7 V7 0.1678235
## V8 V8 0.1186761
## V6 V6 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
Similarly to previous models, V4 is predicted to be the most important variable with the Boosted Tree Model.
** Cubist Model
library(Cubist)
Model5<-cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(Model5)
## Overall
## V1 64.5
## V3 41.0
## V2 60.0
## V4 48.0
## V5 31.0
## V6 9.0
## duplicate1 6.0
## V8 2.0
## V10 0.5
## V7 0.0
## V9 0.0
The Cubist Model shows a difference in importance with the variables, predicting V1 to be most important as opposed to V4 from the earlier models.
set.seed(1)
x1 <- sample(0:10000 / 10000, 200, replace = T)
x2 <- sample(0:1000 / 1000, 200, replace = T)
x3 <- sample(0:100 / 100, 200, replace = T)
x4 <- sample(0:10 / 10, 200, replace = T)
y <- x1 + x4 + rnorm(200)
df <- data.frame(x1, x2, x3, x4, y)
library(rpart)
rpartTree <- rpart(y ~ ., data=df)
varImp(rpartTree)
## Overall
## x1 0.7443663
## x2 0.7563594
## x3 0.5903005
## x4 0.4806487
Boosting means that each tree is dependent on prior trees. The algorithm learns by fitting the residual of the trees that preceded it. The learning rate determines how fast or slow the learner converges on the optimal solution. If the step size is too big, you might overshoot the optimal solution. If the step size is too small, training takes longer to converge on the best solution. The higher the learning rate, the less dependent values will be correlated as in the secon graph with 0.9 learning rate.
The model with 0.1 learning rate will be most predictive of the other samples since models with less bagging fraction and learning rate will be less likely to overfit and will have more generalization.
Increasing the interaction depth decreased the RMSE on the number of trees.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
set.seed(56)
knnmodel2 <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knnmodel2, ChemicalManufacturingProcess)
df <- df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
df.train.x = train_df[,-1]
df.train.y = train_df[,1]
df.test.x = test_df[,-1]
df.test.y = test_df[,1]
** Random Forest
library(randomForest)
library(caret)
set.seed(10)
rfModel <- train(x = df.train.x,
y = df.train.y,
method = 'rf',
tuneLength = 10)
rfModel
## Random Forest
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6887506 0.6034891 0.5369235
## 8 0.6458874 0.6237886 0.4966979
## 14 0.6399251 0.6223807 0.4892793
## 20 0.6384215 0.6179594 0.4864648
## 26 0.6427075 0.6083136 0.4867058
## 32 0.6421827 0.6067880 0.4856572
## 38 0.6445505 0.6017645 0.4862263
## 44 0.6524093 0.5890632 0.4918243
## 50 0.6547797 0.5848733 0.4929361
## 56 0.6587105 0.5785681 0.4949506
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
** Boosted
set.seed(10)
grid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
gbmModel <- train(x = df.train.x,
y = df.train.y,
method = 'gbm',
tuneGrid = grid,
verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 0.01 1 5 50 0.8861818 0.4429902
## 0.01 1 5 100 0.8122191 0.4898878
## 0.01 1 5 150 0.7641514 0.5171798
## 0.01 1 5 200 0.7311479 0.5349375
## 0.01 1 10 50 0.8865594 0.4541180
## 0.01 1 10 100 0.8106351 0.4943815
## 0.01 1 10 150 0.7621684 0.5158366
## 0.01 1 10 200 0.7305949 0.5299783
## 0.01 1 15 50 0.8875225 0.4488035
## 0.01 1 15 100 0.8124663 0.4930673
## 0.01 1 15 150 0.7621555 0.5187584
## 0.01 1 15 200 0.7298443 0.5332401
## 0.01 5 5 50 0.8298870 0.5342329
## 0.01 5 5 100 0.7418307 0.5549246
## 0.01 5 5 150 0.6962730 0.5721927
## 0.01 5 5 200 0.6702899 0.5863603
## 0.01 5 10 50 0.8344043 0.5176358
## 0.01 5 10 100 0.7478868 0.5390828
## 0.01 5 10 150 0.7050841 0.5542119
## 0.01 5 10 200 0.6830626 0.5643717
## 0.01 5 15 50 0.8465142 0.4980061
## 0.01 5 15 100 0.7628770 0.5258196
## 0.01 5 15 150 0.7183813 0.5437013
## 0.01 5 15 200 0.6946348 0.5550203
## 0.01 10 5 50 0.8222481 0.5492881
## 0.01 10 5 100 0.7337827 0.5631995
## 0.01 10 5 150 0.6871138 0.5810400
## 0.01 10 5 200 0.6628534 0.5928022
## 0.01 10 10 50 0.8335054 0.5215425
## 0.01 10 10 100 0.7478048 0.5396830
## 0.01 10 10 150 0.7040710 0.5557826
## 0.01 10 10 200 0.6818857 0.5656583
## 0.01 10 15 50 0.8458350 0.5031433
## 0.01 10 15 100 0.7629048 0.5259137
## 0.01 10 15 150 0.7204376 0.5398506
## 0.01 10 15 200 0.6968279 0.5511118
## 0.01 15 5 50 0.8265157 0.5393524
## 0.01 15 5 100 0.7345052 0.5640981
## 0.01 15 5 150 0.6892562 0.5785601
## 0.01 15 5 200 0.6649028 0.5910551
## 0.01 15 10 50 0.8335155 0.5227051
## 0.01 15 10 100 0.7468066 0.5437311
## 0.01 15 10 150 0.7023107 0.5589773
## 0.01 15 10 200 0.6791003 0.5701137
## 0.01 15 15 50 0.8467634 0.5013152
## 0.01 15 15 100 0.7631723 0.5291374
## 0.01 15 15 150 0.7193982 0.5427500
## 0.01 15 15 200 0.6953333 0.5537806
## 0.10 1 5 50 0.6847272 0.5476474
## 0.10 1 5 100 0.6778655 0.5535859
## 0.10 1 5 150 0.6768149 0.5574006
## 0.10 1 5 200 0.6758626 0.5601805
## 0.10 1 10 50 0.6901427 0.5412144
## 0.10 1 10 100 0.6822384 0.5487883
## 0.10 1 10 150 0.6777967 0.5563153
## 0.10 1 10 200 0.6767969 0.5594859
## 0.10 1 15 50 0.6940884 0.5315457
## 0.10 1 15 100 0.6852072 0.5432212
## 0.10 1 15 150 0.6796391 0.5504543
## 0.10 1 15 200 0.6790532 0.5536966
## 0.10 5 5 50 0.6529136 0.5839441
## 0.10 5 5 100 0.6397736 0.6016521
## 0.10 5 5 150 0.6367649 0.6063580
## 0.10 5 5 200 0.6356574 0.6078057
## 0.10 5 10 50 0.6522776 0.5845012
## 0.10 5 10 100 0.6409728 0.5996146
## 0.10 5 10 150 0.6329560 0.6103532
## 0.10 5 10 200 0.6297537 0.6146869
## 0.10 5 15 50 0.6710660 0.5604933
## 0.10 5 15 100 0.6550701 0.5806070
## 0.10 5 15 150 0.6456897 0.5931307
## 0.10 5 15 200 0.6401443 0.5996200
## 0.10 10 5 50 0.6518655 0.5873948
## 0.10 10 5 100 0.6430852 0.5999454
## 0.10 10 5 150 0.6411299 0.6027942
## 0.10 10 5 200 0.6404337 0.6039800
## 0.10 10 10 50 0.6659471 0.5708972
## 0.10 10 10 100 0.6484634 0.5937127
## 0.10 10 10 150 0.6424486 0.6026656
## 0.10 10 10 200 0.6422782 0.6035665
## 0.10 10 15 50 0.6692092 0.5635372
## 0.10 10 15 100 0.6511635 0.5844770
## 0.10 10 15 150 0.6434424 0.5950505
## 0.10 10 15 200 0.6370439 0.6029454
## 0.10 15 5 50 0.6445123 0.5958846
## 0.10 15 5 100 0.6350294 0.6080568
## 0.10 15 5 150 0.6334227 0.6105791
## 0.10 15 5 200 0.6332966 0.6111995
## 0.10 15 10 50 0.6556006 0.5808349
## 0.10 15 10 100 0.6413477 0.5990053
## 0.10 15 10 150 0.6356894 0.6058592
## 0.10 15 10 200 0.6328858 0.6097917
## 0.10 15 15 50 0.6624939 0.5701062
## 0.10 15 15 100 0.6482972 0.5888305
## 0.10 15 15 150 0.6419439 0.5964804
## 0.10 15 15 200 0.6383940 0.6009170
## 0.50 1 5 50 0.7649890 0.4746799
## 0.50 1 5 100 0.7684013 0.4822725
## 0.50 1 5 150 0.7672336 0.4872189
## 0.50 1 5 200 0.7651752 0.4896447
## 0.50 1 10 50 0.7393639 0.5010414
## 0.50 1 10 100 0.7305378 0.5189482
## 0.50 1 10 150 0.7290002 0.5241207
## 0.50 1 10 200 0.7309560 0.5230796
## 0.50 1 15 50 0.7308240 0.5030164
## 0.50 1 15 100 0.7404889 0.4990986
## 0.50 1 15 150 0.7399604 0.5011822
## 0.50 1 15 200 0.7442833 0.4982731
## 0.50 5 5 50 0.7652374 0.4823328
## 0.50 5 5 100 0.7641241 0.4836292
## 0.50 5 5 150 0.7639709 0.4838715
## 0.50 5 5 200 0.7639500 0.4839010
## 0.50 5 10 50 0.7606870 0.4761583
## 0.50 5 10 100 0.7596902 0.4783854
## 0.50 5 10 150 0.7599554 0.4784211
## 0.50 5 10 200 0.7598924 0.4786795
## 0.50 5 15 50 0.7380017 0.4975783
## 0.50 5 15 100 0.7384295 0.5004710
## 0.50 5 15 150 0.7383149 0.5010109
## 0.50 5 15 200 0.7390560 0.5005151
## 0.50 10 5 50 0.7704154 0.4734729
## 0.50 10 5 100 0.7692222 0.4747773
## 0.50 10 5 150 0.7692219 0.4748132
## 0.50 10 5 200 0.7692004 0.4748450
## 0.50 10 10 50 0.7692894 0.4684564
## 0.50 10 10 100 0.7657319 0.4735310
## 0.50 10 10 150 0.7655652 0.4747527
## 0.50 10 10 200 0.7654292 0.4750363
## 0.50 10 15 50 0.7464144 0.4916312
## 0.50 10 15 100 0.7456142 0.4961866
## 0.50 10 15 150 0.7435738 0.4989335
## 0.50 10 15 200 0.7434776 0.4993274
## 0.50 15 5 50 0.7970965 0.4377319
## 0.50 15 5 100 0.7956610 0.4396156
## 0.50 15 5 150 0.7953666 0.4398930
## 0.50 15 5 200 0.7955319 0.4397517
## 0.50 15 10 50 0.7345734 0.5013059
## 0.50 15 10 100 0.7331767 0.5038218
## 0.50 15 10 150 0.7338389 0.5036720
## 0.50 15 10 200 0.7337766 0.5036788
## 0.50 15 15 50 0.7488314 0.4859507
## 0.50 15 15 100 0.7457297 0.4918584
## 0.50 15 15 150 0.7440140 0.4948376
## 0.50 15 15 200 0.7440738 0.4950966
## MAE
## 0.7084569
## 0.6423555
## 0.5999429
## 0.5703862
## 0.7093705
## 0.6409595
## 0.5969848
## 0.5680730
## 0.7092613
## 0.6418178
## 0.5971861
## 0.5683856
## 0.6611786
## 0.5821977
## 0.5405946
## 0.5150489
## 0.6649608
## 0.5854547
## 0.5449346
## 0.5226105
## 0.6729642
## 0.5974332
## 0.5577710
## 0.5363203
## 0.6549598
## 0.5745619
## 0.5303373
## 0.5068978
## 0.6635435
## 0.5850749
## 0.5444182
## 0.5218759
## 0.6727529
## 0.5978940
## 0.5598675
## 0.5376782
## 0.6593590
## 0.5752664
## 0.5312511
## 0.5077834
## 0.6636590
## 0.5857195
## 0.5443379
## 0.5217813
## 0.6729232
## 0.5970193
## 0.5584931
## 0.5354213
## 0.5269139
## 0.5225728
## 0.5247743
## 0.5260974
## 0.5252831
## 0.5192826
## 0.5162082
## 0.5177700
## 0.5313570
## 0.5229483
## 0.5190232
## 0.5198918
## 0.5018773
## 0.4929476
## 0.4900240
## 0.4891554
## 0.5005073
## 0.4915170
## 0.4844472
## 0.4824698
## 0.5123633
## 0.5017698
## 0.4952874
## 0.4910297
## 0.4915354
## 0.4866941
## 0.4858184
## 0.4858829
## 0.5069258
## 0.4958276
## 0.4924167
## 0.4936399
## 0.5156449
## 0.5022434
## 0.4964176
## 0.4924929
## 0.4878967
## 0.4813249
## 0.4806941
## 0.4809243
## 0.4963140
## 0.4901471
## 0.4876518
## 0.4864212
## 0.5055348
## 0.4941241
## 0.4891849
## 0.4872332
## 0.6016989
## 0.6022836
## 0.6014021
## 0.5989816
## 0.5780994
## 0.5740305
## 0.5727691
## 0.5744938
## 0.5721260
## 0.5816357
## 0.5781600
## 0.5812817
## 0.5790967
## 0.5784724
## 0.5783322
## 0.5783135
## 0.5974886
## 0.5970656
## 0.5971459
## 0.5971986
## 0.5815954
## 0.5815951
## 0.5808630
## 0.5820511
## 0.6054156
## 0.6038618
## 0.6038410
## 0.6038177
## 0.5974770
## 0.5956630
## 0.5958917
## 0.5957671
## 0.5880511
## 0.5891991
## 0.5873446
## 0.5875142
## 0.6125050
## 0.6122914
## 0.6119624
## 0.6121229
## 0.5792508
## 0.5795749
## 0.5803924
## 0.5803430
## 0.5897518
## 0.5875889
## 0.5865808
## 0.5864940
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
gbmModel$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 68 200 5 0.1 10
gbmModel$finalModel
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 56 predictors of which 54 had non-zero influence.
** Cubist
set.seed(1)
cubModel <- train(x = df.train.x,
y = df.train.y,
method = 'cubist')
cubModel
## Cubist
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.9855217 0.3720333 0.7021369
## 1 5 0.9678239 0.3922185 0.6775515
## 1 9 0.9714719 0.3869898 0.6851331
## 10 0 0.6985204 0.5798378 0.5258657
## 10 5 0.6813682 0.5998483 0.5088235
## 10 9 0.6869407 0.5935009 0.5148524
## 20 0 0.6560534 0.6266983 0.4945791
## 20 5 0.6378305 0.6468926 0.4769200
## 20 9 0.6441261 0.6398704 0.4832740
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
The RMSE appears to be best for the Random Forest Model,
varImp(gbmModel)
## gbm variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09 22.925
## BiologicalMaterial12 21.277
## ManufacturingProcess31 17.960
## BiologicalMaterial03 17.644
## ManufacturingProcess17 15.682
## BiologicalMaterial11 14.622
## ManufacturingProcess13 12.238
## BiologicalMaterial09 9.518
## ManufacturingProcess06 8.842
## ManufacturingProcess01 5.852
## ManufacturingProcess18 5.764
## ManufacturingProcess29 5.658
## BiologicalMaterial02 5.579
## ManufacturingProcess10 4.691
## BiologicalMaterial06 4.644
## BiologicalMaterial05 4.633
## ManufacturingProcess02 4.434
## ManufacturingProcess14 4.241
## ManufacturingProcess33 4.164
The Mfg process variables dominate the list of most important variables. This is similar to what we saw with the linear and non linear models from previous chapters in which mfg process 32 was most important and the mfg process dominated the models over the biological variables.