library(tidyverse)
library(caret)
library(party)
library(randomForest)
library(gbm)
First, let’s recreate the data as they do in K&J
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Now we can fit a random forest model to this dataset
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.84289608
## V2 6.74508245
## V3 0.67830653
## V4 7.75934674
## V5 2.23628276
## V6 0.11429887
## V7 0.03724747
## V8 -0.05349642
## V9 -0.04495617
## V10 0.03863205
In terms of variable importance, the uninformative predictors are not being used by this model since they have some of the lower importance scores being returned.
# Add correlated predictor
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
# Fit separate random forest variable
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.008319352
## V2 6.308908170
## V3 0.571604465
## V4 7.187015958
## V5 2.131040245
## V6 0.211304611
## V7 0.025100355
## V8 -0.116980037
## V9 -0.003679481
## V10 0.024878337
## duplicate1 3.618101735
We see this duplicate variable has a relatively higher importance
factor, while V1 has a now lower importance factor. If we
add another duplicate variable (from V2) we’ll see those
importance factors
# Add correlated predictor
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
# Fit separate random forest variable
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.80558793
## V2 6.24446517
## V3 0.44795506
## V4 7.36291029
## V5 2.28470921
## V6 0.25517978
## V7 -0.01177118
## V8 -0.06210267
## V9 -0.03187802
## V10 0.01454121
## duplicate1 2.49523514
## duplicate2 2.96244300
Here we see the importance for V1 drops again
with a second highly-correlated variable added.
# Traing conditional inference tree
bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
baggedTree <- cforest(y ~ ., data = simulated, controls = bagCtrl)
varimp(baggedTree, conditional=TRUE)
## V1 V2 V3 V4 V5
## 0.8201613769 4.3762948432 0.0002629888 5.9107383417 0.7994246735
## V6 V7 V8 V9 V10
## -0.0090871345 0.0350111052 -0.0061488850 -0.0057271660 0.0026289503
## duplicate1 duplicate2
## 0.0750520312 0.5457170569
We see a similar pattern here of the uninformative predictors
(V6 - V10) don’t have strong importance in the conditional
forest.
Let’s train on the same dataset with different tree types
library(Cubist)
# Cubist model
predictors <- simulated %>% select(-c("y"))
cubistTuned <- cubist(predictors, simulated$y)
varImp(cubistTuned)
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate1 0
## duplicate2 0
# Boosted model
gbmGrid <- expand.grid(interaction.depth = seq(1, 5, by = 2),
n.trees = seq(100, 1000, by = 200),
shrinkage=c(0.1, 1.0),
n.minobsinnode=seq(1, 5, 2))
gbmTuned <- train(predictors, simulated$y, method = "gbm",
tuneGrid = gbmGrid,verbose = FALSE)
varImp(gbmTuned)
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 80.0712
## V1 53.9485
## V5 41.4160
## V3 31.9544
## duplicate1 25.8739
## duplicate2 24.6748
## V6 3.3719
## V7 2.5856
## V10 0.3880
## V9 0.2258
## V8 0.0000
In both cases we see the uninformative variables are not the most important for these models as well. In general, strong correlation will be pulled through in most models, regardless of algorithm changed
Use a simulation to show tree bias with different granularities.
library(rpart)
# Re-sumlate data from Friedman
set.seed(12345)
simulated2 <- mlbench.friedman1(150, sd = 1)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
trainX <- simulated2 %>% dplyr::select(-c("y"))
trainY <- simulated2 $y
rpartTune <- train(trainX, trainY, method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
varImp(rpartTune)
## rpart2 variable importance
##
## Overall
## V1 100.000
## V2 83.378
## V5 82.766
## V3 70.448
## V4 40.410
## V7 27.813
## V10 21.770
## V8 15.978
## V9 8.287
## V6 0.000
The learning rate controls the contributions of weak learners, which could be dependent on factors that aren’t that important in predicting the outcome in the underlying data. In this case, the lower learning rate is less prone to overfitting to training data, since it will have contributions from a wider base of input variables, and not be as dependent on one fortunate training split.
Increasing the interaction depth of each tree would likely result in a higher variance in individual learners, since additional tree splits would be introduced. All things being equal, this would result in a more gradual slope in the Variable importance plots, akin to what we’d see in the left-hand diagram
Let’s use the same bit of code as we did before to set up our dataset:
# Load chemical data
library(AppliedPredictiveModeling)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:modeltools':
##
## prior
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
library(nnet)
library(ModelMetrics)
##
## Attaching package: 'ModelMetrics'
## The following objects are masked from 'package:caret':
##
## confusionMatrix, precision, recall, sensitivity, specificity
## The following object is masked from 'package:base':
##
## kappa
data("ChemicalManufacturingProcess")
chemical <- ChemicalManufacturingProcess
chemical_features <- chemical %>% dplyr::select(-c("Yield"))
We’ll impute this data the same way we did in HW7
# Impute chemical yield data
imputed <- preProcess(chemical,
method = c("knnImpute"))
trans <- predict(imputed, chemical)
We’ll also set up a train-test split as well
# Split into train and test splits
#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(trans), replace=TRUE, prob=c(0.8,0.2))
train <- trans[sample, ]
train_yield <- train$Yield
train <- train %>%
dplyr::select(-c("Yield"))
test <- trans[!sample, ]
test_yield <- test$Yield
test <- test %>%
dplyr::select(-c("Yield"))
First, we’ll train a gradient-boosting model via gbm
gbmReg <- train(train, train_yield, method = "gbm",
tuneGrid = gbmGrid, verbose = FALSE)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
# Calculate RMSE of gradient boosted tree
gbmPred <- predict(gbmReg, test)
ModelMetrics::rmse(test_yield, gbmPred)
## [1] 0.5177699
Next, we can train a decision tree modelusing rpart
rpartReg <- train(train, train_yield, method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
# Calculate RMSE of gradient boosted tree
rpartPred <- predict(rpartReg, test)
ModelMetrics::rmse(test_yield, rpartPred)
## [1] 0.8569067
Alternatively, we can train a random Forest model as wel using the method employed in K&J
library(randomForest)
rfModel <- randomForest(train, train_yield)
rfPred <- predict(rfModel, test)
ModelMetrics::rmse(test_yield, rfPred)
## [1] 0.6332397
Of our three models, our gradient-boosted and random forest perform the best in terms of RMSE. For this exercise, let’s gow with the random forest model
varImp(rfModel)
## Overall
## BiologicalMaterial01 1.67902382
## BiologicalMaterial02 3.40147646
## BiologicalMaterial03 5.24196093
## BiologicalMaterial04 2.36760632
## BiologicalMaterial05 1.47799119
## BiologicalMaterial06 7.48916035
## BiologicalMaterial07 0.01027183
## BiologicalMaterial08 2.07018614
## BiologicalMaterial09 1.55990457
## BiologicalMaterial10 0.74444586
## BiologicalMaterial11 3.31591748
## BiologicalMaterial12 3.93808689
## ManufacturingProcess01 1.39752441
## ManufacturingProcess02 0.88357900
## ManufacturingProcess03 0.57250982
## ManufacturingProcess04 1.27432140
## ManufacturingProcess05 1.54684552
## ManufacturingProcess06 2.39323873
## ManufacturingProcess07 0.06210052
## ManufacturingProcess08 0.09496793
## ManufacturingProcess09 3.71617432
## ManufacturingProcess10 1.04094596
## ManufacturingProcess11 1.26345005
## ManufacturingProcess12 0.64178056
## ManufacturingProcess13 10.90914605
## ManufacturingProcess14 0.71845775
## ManufacturingProcess15 0.88028614
## ManufacturingProcess16 0.77796395
## ManufacturingProcess17 4.92705146
## ManufacturingProcess18 1.20971720
## ManufacturingProcess19 0.81927941
## ManufacturingProcess20 1.67848016
## ManufacturingProcess21 0.89207950
## ManufacturingProcess22 0.95389797
## ManufacturingProcess23 0.60696971
## ManufacturingProcess24 0.86889551
## ManufacturingProcess25 0.79445574
## ManufacturingProcess26 0.95388372
## ManufacturingProcess27 1.64539807
## ManufacturingProcess28 3.94965411
## ManufacturingProcess29 1.74582360
## ManufacturingProcess30 1.33879634
## ManufacturingProcess31 4.86413758
## ManufacturingProcess32 29.79971956
## ManufacturingProcess33 2.60481728
## ManufacturingProcess34 1.00719264
## ManufacturingProcess35 0.79415002
## ManufacturingProcess36 4.06508284
## ManufacturingProcess37 0.98999732
## ManufacturingProcess38 0.20530543
## ManufacturingProcess39 0.82221283
## ManufacturingProcess40 0.08066279
## ManufacturingProcess41 0.12686738
## ManufacturingProcess42 0.73164040
## ManufacturingProcess43 1.47097926
## ManufacturingProcess44 0.42830306
## ManufacturingProcess45 0.42059994
We see the Biological variables to be dominated by our random forest model.This is different than previous exercises, in which manufacturing processes dominated the variable importance plots
We can plot the tree trained via rpart using the
rpart.plot functions
library(rpart.plot)
rpartTree <- rpart(train_yield ~ ., data = cbind(train, train_yield))#, method = "reg")
rpart.plot(rpartTree)