library(tidyverse)
library(caret)
library(party)
library(randomForest)
library(gbm)

Exercise 8.1

First, let’s recreate the data as they do in K&J

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1) 
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Now we can fit a random forest model to this dataset

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,
ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

rfImp1
##         Overall
## V1   8.84289608
## V2   6.74508245
## V3   0.67830653
## V4   7.75934674
## V5   2.23628276
## V6   0.11429887
## V7   0.03724747
## V8  -0.05349642
## V9  -0.04495617
## V10  0.03863205

In terms of variable importance, the uninformative predictors are not being used by this model since they have some of the lower importance scores being returned.

# Add correlated predictor
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1

# Fit separate random forest variable
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)

rfImp2
##                 Overall
## V1          6.008319352
## V2          6.308908170
## V3          0.571604465
## V4          7.187015958
## V5          2.131040245
## V6          0.211304611
## V7          0.025100355
## V8         -0.116980037
## V9         -0.003679481
## V10         0.024878337
## duplicate1  3.618101735

We see this duplicate variable has a relatively higher importance factor, while V1 has a now lower importance factor. If we add another duplicate variable (from V2) we’ll see those importance factors

# Add correlated predictor
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1

# Fit separate random forest variable
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)

rfImp3
##                Overall
## V1          4.80558793
## V2          6.24446517
## V3          0.44795506
## V4          7.36291029
## V5          2.28470921
## V6          0.25517978
## V7         -0.01177118
## V8         -0.06210267
## V9         -0.03187802
## V10         0.01454121
## duplicate1  2.49523514
## duplicate2  2.96244300

Here we see the importance for V1 drops again with a second highly-correlated variable added.

# Traing conditional inference tree
bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
baggedTree <- cforest(y ~ ., data = simulated, controls = bagCtrl)

varimp(baggedTree, conditional=TRUE)
##            V1            V2            V3            V4            V5 
##  0.8201613769  4.3762948432  0.0002629888  5.9107383417  0.7994246735 
##            V6            V7            V8            V9           V10 
## -0.0090871345  0.0350111052 -0.0061488850 -0.0057271660  0.0026289503 
##    duplicate1    duplicate2 
##  0.0750520312  0.5457170569

We see a similar pattern here of the uninformative predictors (V6 - V10) don’t have strong importance in the conditional forest.

Let’s train on the same dataset with different tree types

library(Cubist)
# Cubist model
predictors <- simulated %>% select(-c("y"))
cubistTuned <- cubist(predictors, simulated$y)

varImp(cubistTuned)
##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate1       0
## duplicate2       0
# Boosted model
gbmGrid <- expand.grid(interaction.depth = seq(1, 5, by = 2), 
                       n.trees = seq(100, 1000, by = 200), 
                       shrinkage=c(0.1, 1.0),
                       n.minobsinnode=seq(1, 5, 2))
gbmTuned <- train(predictors, simulated$y, method = "gbm",
                  tuneGrid = gbmGrid,verbose = FALSE)


varImp(gbmTuned)
## gbm variable importance
## 
##             Overall
## V4         100.0000
## V2          80.0712
## V1          53.9485
## V5          41.4160
## V3          31.9544
## duplicate1  25.8739
## duplicate2  24.6748
## V6           3.3719
## V7           2.5856
## V10          0.3880
## V9           0.2258
## V8           0.0000

In both cases we see the uninformative variables are not the most important for these models as well. In general, strong correlation will be pulled through in most models, regardless of algorithm changed

Exercise 8.2

Use a simulation to show tree bias with different granularities.

library(rpart)

# Re-sumlate data from Friedman
set.seed(12345)
simulated2 <- mlbench.friedman1(150, sd = 1) 
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
trainX <- simulated2 %>% dplyr::select(-c("y"))
trainY <- simulated2 $y
rpartTune <- train(trainX, trainY, method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))


varImp(rpartTune)
## rpart2 variable importance
## 
##     Overall
## V1  100.000
## V2   83.378
## V5   82.766
## V3   70.448
## V4   40.410
## V7   27.813
## V10  21.770
## V8   15.978
## V9    8.287
## V6    0.000

Exercise 8.3

The learning rate controls the contributions of weak learners, which could be dependent on factors that aren’t that important in predicting the outcome in the underlying data. In this case, the lower learning rate is less prone to overfitting to training data, since it will have contributions from a wider base of input variables, and not be as dependent on one fortunate training split.

Increasing the interaction depth of each tree would likely result in a higher variance in individual learners, since additional tree splits would be introduced. All things being equal, this would result in a more gradual slope in the Variable importance plots, akin to what we’d see in the left-hand diagram

Exercise 8.7

Let’s use the same bit of code as we did before to set up our dataset:

# Load chemical data
library(AppliedPredictiveModeling)
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:modeltools':
## 
##     prior
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
library(nnet)
library(ModelMetrics)
## 
## Attaching package: 'ModelMetrics'
## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, recall, sensitivity, specificity
## The following object is masked from 'package:base':
## 
##     kappa
data("ChemicalManufacturingProcess")

chemical <- ChemicalManufacturingProcess
chemical_features <- chemical %>% dplyr::select(-c("Yield"))

We’ll impute this data the same way we did in HW7

# Impute chemical yield data
imputed <- preProcess(chemical,
                       method = c("knnImpute"))

trans <- predict(imputed, chemical)

We’ll also set up a train-test split as well

# Split into train and test splits
#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(trans), replace=TRUE, prob=c(0.8,0.2))
train  <- trans[sample, ]
train_yield <- train$Yield
train <- train %>% 
  dplyr::select(-c("Yield"))
test <- trans[!sample, ]
test_yield <- test$Yield
test <- test  %>% 
  dplyr::select(-c("Yield"))

First, we’ll train a gradient-boosting model via gbm

gbmReg <- train(train, train_yield, method = "gbm",
       tuneGrid = gbmGrid, verbose = FALSE)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
# Calculate RMSE of gradient boosted tree
gbmPred <- predict(gbmReg, test)
ModelMetrics::rmse(test_yield, gbmPred)
## [1] 0.5177699

Next, we can train a decision tree modelusing rpart

rpartReg <- train(train, train_yield, method = "rpart2",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

# Calculate RMSE of gradient boosted tree
rpartPred <- predict(rpartReg, test)
ModelMetrics::rmse(test_yield, rpartPred)
## [1] 0.8569067

Alternatively, we can train a random Forest model as wel using the method employed in K&J

library(randomForest)
rfModel <- randomForest(train, train_yield)



rfPred <- predict(rfModel, test)
ModelMetrics::rmse(test_yield, rfPred)
## [1] 0.6332397

Of our three models, our gradient-boosted and random forest perform the best in terms of RMSE. For this exercise, let’s gow with the random forest model

varImp(rfModel)
##                            Overall
## BiologicalMaterial01    1.67902382
## BiologicalMaterial02    3.40147646
## BiologicalMaterial03    5.24196093
## BiologicalMaterial04    2.36760632
## BiologicalMaterial05    1.47799119
## BiologicalMaterial06    7.48916035
## BiologicalMaterial07    0.01027183
## BiologicalMaterial08    2.07018614
## BiologicalMaterial09    1.55990457
## BiologicalMaterial10    0.74444586
## BiologicalMaterial11    3.31591748
## BiologicalMaterial12    3.93808689
## ManufacturingProcess01  1.39752441
## ManufacturingProcess02  0.88357900
## ManufacturingProcess03  0.57250982
## ManufacturingProcess04  1.27432140
## ManufacturingProcess05  1.54684552
## ManufacturingProcess06  2.39323873
## ManufacturingProcess07  0.06210052
## ManufacturingProcess08  0.09496793
## ManufacturingProcess09  3.71617432
## ManufacturingProcess10  1.04094596
## ManufacturingProcess11  1.26345005
## ManufacturingProcess12  0.64178056
## ManufacturingProcess13 10.90914605
## ManufacturingProcess14  0.71845775
## ManufacturingProcess15  0.88028614
## ManufacturingProcess16  0.77796395
## ManufacturingProcess17  4.92705146
## ManufacturingProcess18  1.20971720
## ManufacturingProcess19  0.81927941
## ManufacturingProcess20  1.67848016
## ManufacturingProcess21  0.89207950
## ManufacturingProcess22  0.95389797
## ManufacturingProcess23  0.60696971
## ManufacturingProcess24  0.86889551
## ManufacturingProcess25  0.79445574
## ManufacturingProcess26  0.95388372
## ManufacturingProcess27  1.64539807
## ManufacturingProcess28  3.94965411
## ManufacturingProcess29  1.74582360
## ManufacturingProcess30  1.33879634
## ManufacturingProcess31  4.86413758
## ManufacturingProcess32 29.79971956
## ManufacturingProcess33  2.60481728
## ManufacturingProcess34  1.00719264
## ManufacturingProcess35  0.79415002
## ManufacturingProcess36  4.06508284
## ManufacturingProcess37  0.98999732
## ManufacturingProcess38  0.20530543
## ManufacturingProcess39  0.82221283
## ManufacturingProcess40  0.08066279
## ManufacturingProcess41  0.12686738
## ManufacturingProcess42  0.73164040
## ManufacturingProcess43  1.47097926
## ManufacturingProcess44  0.42830306
## ManufacturingProcess45  0.42059994

We see the Biological variables to be dominated by our random forest model.This is different than previous exercises, in which manufacturing processes dominated the variable importance plots

We can plot the tree trained via rpart using the rpart.plot functions

library(rpart.plot)

rpartTree <- rpart(train_yield ~ ., data = cbind(train, train_yield))#, method = "reg")
rpart.plot(rpartTree)