In Kuhn and Johnson do problems 8.1, 8.2, 8.3, and 8.7. Please submit a link to your Rpubs and submit the .rmd file as well.
library(AppliedPredictiveModeling)
library(dplyr)
library(forecast)
library(ggplot2)
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(earth)
library(RANN)
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
Variance Importance:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
rfImp1 %>%
mutate (var = rownames(rfImp1)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'blue') +
labs(title = 'Variable Importance' , y = 'Variable')
Based on the rfImp1
, uninformative predictors V6 – V10
were not significantly used in the random forest model. Predictors (in
order) 1, 4, 2, 5, 3 were significantly used in the random forest
model.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
#creating copy of original data
simulated2 <- simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
Variance Importance:
model2 <- randomForest(y ~ ., data = simulated2, importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
order(-rfImp2)
## [1] 4 2 1 11 5 3 6 10 9 7 8
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
rfImp2 %>%
mutate (var = rownames(rfImp2)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'blue') +
labs(title = 'Variable Importance' , y = 'Variable')
Adding a highly correlated shifted the importance score for all predictors. Predictor V1 was demoted from being the most important to third most important along side it’s highly correlated duplicate at fourth place.
(c) Use the cforest
function in the
party
package to fit a random forest model using
conditional inference trees. The party
package function
varimp
can calculate predictor importance. The
conditional
argument of that function toggles between the
traditional importance measure and the modified version described in
Strobl et al. (2007). Do these importances show the same pattern as the
traditional random forest model?
Variance Importance Orders:
library(party)
model3 <- cforest(y ~., data = simulated)
order(-varimp(model3, conditional = FALSE)) #default conditional: FALSE
## [1] 1 4 2 5 7 3 9 6 10 8
order(-varimp(model3, conditional = TRUE))
## [1] 4 1 2 5 3 6 9 8 10 7
No, the varimp
importances for the cforest
model of the simulated data are not in the same order as the importances
of the randomForest
model. The variable importances for
each are in different order, however uninformative predictors V6 – V10
are still at the bottom of all the model’s variable importance
lists.
randomForest
& caret varImp
: 1 4 2 5 3
6 7 10 9 8
cforest
& party varimp
conditional =
False: 1 4 2 5 3 6 9 7 8 10
cforest
& party varimp
conditional = True:
4 1 2 5 3 7 9 6 8 10
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Variable Importance:
library(Cubist)
simulated_x <- subset(simulated, select = -c(y))
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
order(-varImp(cubist_model))
## [1] 1 3 4 2 5 6 7 8 9 10
Variable Importance:
library(gbm)
set.seed(624)
gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)
## var rel.inf
## V4 V4 30.5402762
## V1 V1 26.9937619
## V2 V2 22.6797905
## V5 V5 11.6078356
## V3 V3 7.4358856
## V9 V9 0.4541585
## V6 V6 0.2882918
## V7 V7 0.0000000
## V8 V8 0.0000000
## V10 V10 0.0000000
It looks like the Cubist model has V1 as the most important variable, and the Bagged Trees has V4 has the most important predictor. The uninformative predictors V6 – V10 were not significantly used in both models.
Use a simulation to show tree bias with different granularities.
Variance of each predictor:
set.seed(200)
#samples for predictors
low <- sample(0:50, 500, replace = T)
medium <- sample(0:500, 500, replace = T)
high <- sample(0:5000, 500, replace = T)
#response
y <- low + medium + high + rnorm(250)
#check variance of predictors
var(low)
## [1] 213.2169
var(medium)
## [1] 21668.07
var(high)
## [1] 2065662
Variable importance:
sim_data <- data.frame(low, medium, high, y)
diff_gran_model <- randomForest(y ~., data = sim_data, importance = TRUE, ntree = 1000)
varImp(diff_gran_model, scale=FALSE)
## Overall
## low 6282.771
## medium 28643.290
## high 3663332.591
First the sample data is created using sample
. The
low
variable has the highest granularity and is confirmed
by it’s variance. Variables medium
and high
have medium and high variance (respectively) in comparison to
low
. Higher variance is indicative of lower
granularity.
The resulting tree model indeed confirms tree bias where the variables with the highest variance (lowest granularity) are ranked with highest importance.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the left (shrinkage
= 0.1,
bag.fraction
= 0.1) focuses its importance on just the
first few predictors because the fraction of the training set
observations is very small, at 0.1 or 10%. Since a small portion of the
data is being used to create the trees, it focuses it’s importance only
on a small amount of predictors. The shrinkage
parameter or
learning rate is set to the highest recommended value of what “usually
works” in the documentation for the gbm
function. The model
on the right (shrinkage
= 0.9, bag.fraction
=
0.9) spreads importance across more predictors because it is using 90%
of the training set to create its trees. The larger
shrinkage
parameter suggests less trees have been created
in this model.
The learning rate is a tuning parameter used in the model. A higher learning rate means that larger fraction of each tree’s predictions are added to the final prediction. As a consequence more of the same predictors will be selected among the trees.
(b) Which model do you think would be more predictive of
other samples?
Of the two models, the model on the left (shrinkage
= 0.1,
bag.fraction
= 0.1) would be more predictive of the samples
given the amount of trees it should have in comparison to the model on
the right right (shrinkage
= 0.9, bag.fraction
= 0.9).
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Interaction depth, also known as tree size or tree complexity,
controls the maximum size of each tree. As noted above, shrinkage is
inversely related to interaction depth. The left (shrinkage
= 0.1, bag.fraction
= 0.1) might benefit from an increase
in interaction depth, while the right might not since the shrinkage
parameter is fairly high.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
#load data
data(ChemicalManufacturingProcess)
set.seed(624)
#impute, remove near zero values, and remove correlated values using caret
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)
index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)
train_chem <- full_data[index_chem,]
test_chem <- full_data[-index_chem,]
train_predictors <- train_chem[-c(1)]
test_predictors <- test_chem[-c(1)]
(a) Which tree-based regression model gives the optimal resampling and test set performance?
The Random Forest model provides optimal metrics, with the lowest RMSE of 0.67, and an \(R^2\) value of 0.73.
set.seed(624)
#fit the model
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rf_model
##
## Call:
## randomForest(x = train_predictors, y = train_chem$Yield, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 0.3711346
## % Var explained: 59.85
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 0.6715320 0.7323224 0.4828902
set.seed(624)
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.9239 nan 0.0010 0.0004
## 2 0.9231 nan 0.0010 0.0006
## 3 0.9224 nan 0.0010 0.0007
## 4 0.9217 nan 0.0010 0.0006
## 5 0.9211 nan 0.0010 0.0006
## 6 0.9203 nan 0.0010 0.0006
## 7 0.9196 nan 0.0010 0.0007
## 8 0.9190 nan 0.0010 0.0006
## 9 0.9183 nan 0.0010 0.0007
## 10 0.9175 nan 0.0010 0.0005
## 20 0.9111 nan 0.0010 0.0007
## 40 0.8980 nan 0.0010 0.0006
## 60 0.8865 nan 0.0010 0.0004
## 80 0.8743 nan 0.0010 0.0005
## 100 0.8628 nan 0.0010 0.0002
gbm_model
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 46 predictors of which 7 had non-zero influence.
gbmPred <- predict(gbm_model, newdata = test_predictors)
postResample(pred = gbmPred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 1.1097582 0.5627084 0.8634997
set.seed(624)
cube_model <- cubist(train_predictors, train_chem$Yield)
cube_model
##
## Call:
## cubist.default(x = train_predictors, y = train_chem$Yield)
##
## Number of samples: 144
## Number of predictors: 46
##
## Number of committees: 1
## Number of rules: 2
cubePred <- predict(cube_model, newdata = test_predictors)
postResample(pred = cubePred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 0.6724398 0.6903309 0.5159054
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The predictors dominating the list is
ManufacturingProcess
with 6 in the top ten, and the rest
BiologicalProcess
. This combination is fairly consistent
with the importance variables of the non-linear model, not so much the
linear model.
Top 10:
ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03,
ManufacturingProcess13, ManufacturingProcess36, BiologicalMaterial11,
ManufacturingProcess09, BiologicalMaterial08, ManufacturingProcess28,
ManufacturingProcess11
head(varImp(rf_model),10)
## Overall
## BiologicalMaterial01 2.166860
## BiologicalMaterial03 11.767253
## BiologicalMaterial05 3.656294
## BiologicalMaterial06 12.943335
## BiologicalMaterial08 8.546350
## BiologicalMaterial09 7.019902
## BiologicalMaterial10 2.524260
## BiologicalMaterial11 9.396847
## ManufacturingProcess01 6.146426
## ManufacturingProcess02 4.739992
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Yes, this view of the data provides additional knowledge about the biological or process predictors and their relationship with yield. Using the plot we can see the root node is the ManufacturingProcess32 predictor and is then split between two Biological Material predictors. Seven of the ten terminal nodes are predicted by Manufacturing Process variables and remaining three by Biological Material. This suggests that the Manufacturing Process variables are strong predictors of the yield response variable.
library(rpart)
library(rpart.plot)
#train single tree model
rpart_tree <- rpart(Yield ~., data = train_chem)
#produce tree plot
rpart.plot(rpart_tree)