Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
library(AppliedPredictiveModeling)
library(dplyr)
library(forecast)
library(ggplot2)
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(earth)
library(RANN)
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
Variance Importance:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
order(-rfImp1)
## [1] 1 4 2 5 3 6 10 7 9 8
rfImp1
## Overall
## V1 8.84289608
## V2 6.74508245
## V3 0.67830653
## V4 7.75934674
## V5 2.23628276
## V6 0.11429887
## V7 0.03724747
## V8 -0.05349642
## V9 -0.04495617
## V10 0.03863205
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Based on the rfImp1, uninformative predictors V6 – V10
were not significantly used in the random forest model. Predictors (in
order) 1, 4, 2, 5, 3 were significantly used in the random forest
model.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
#copy of original data
simulated2 <- simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
Variance Importance:
model2 <- randomForest(y ~ ., data = simulated2, importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
order(-rfImp2)
## [1] 4 2 1 11 5 3 6 7 10 9 8
rfImp2
## Overall
## V1 6.008319352
## V2 6.308908170
## V3 0.571604465
## V4 7.187015958
## V5 2.131040245
## V6 0.211304611
## V7 0.025100355
## V8 -0.116980037
## V9 -0.003679481
## V10 0.024878337
## duplicate1 3.618101735
Adding a highly correlated shifted the importance score for all predictors. Predictor V1 was demoted from being the most important to third most important along side it’s highly correlated duplicate at fourth place.
(c) Use the cforest function in the
party package to fit a random forest model using
conditional inference trees. The party package function
varimp can calculate predictor importance. The
conditional argument of that function toggles between the
traditional importance measure and the modified version described in
Strobl et al. (2007). Do these importances show the same pattern as the
traditional random forest model?
Variance Importance Orders:
library(party)
model3 <- cforest(y ~., data = simulated)
order(-varimp(model3, conditional = FALSE))
## [1] 1 4 2 5 7 8 3 9 6 10
order(-varimp(model3, conditional = TRUE))
## [1] 4 1 2 5 6 3 7 9 8 10
No, the varimp importance for the cforest
model of the simulated data are not in the same order as the importance
of the randomForest model.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Variable Importance:
library(Cubist)
simulated_x <- subset(simulated, select = -c(y))
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
order(-varImp(cubist_model))
## [1] 1 3 4 2 5 6 7 8 9 10
Variable Importance:
library(gbm)
set.seed(624)
gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)
## var rel.inf
## V4 V4 30.5402762
## V1 V1 26.9937619
## V2 V2 22.6797905
## V5 V5 11.6078356
## V3 V3 7.4358856
## V9 V9 0.4541585
## V6 V6 0.2882918
## V7 V7 0.0000000
## V8 V8 0.0000000
## V10 V10 0.0000000
It looks like the Cubist model has V1 as the most important variable, and the Bagged Trees has V4 has the most important predictor. The uninformative predictors V6 – V10 were not significantly used in both models.
Use a simulation to show tree bias with different granularities.
Variance of each predictor:
set.seed(624)
#samples for predictors
low <- sample(0:50, 500, replace = T)
medium <- sample(0:500, 500, replace = T)
high <- sample(0:5000, 500, replace = T)
#response
y <- low + medium + high + rnorm(250)
#check variance of predictors
var(low)
## [1] 221.9536
var(medium)
## [1] 21752.82
var(high)
## [1] 2016913
Variable importance:
sim_data <- data.frame(low, medium, high, y)
diff_gran_model <- randomForest(y ~., data = sim_data, importance = TRUE, ntree = 1000)
varImp(diff_gran_model, scale=FALSE)
## Overall
## low -19996.77
## medium 15806.52
## high 3605673.96
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
#load data
data(ChemicalManufacturingProcess)
set.seed(624)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)
index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)
train_chem <- full_data[index_chem,]
test_chem <- full_data[-index_chem,]
train_predictors <- train_chem[-c(1)]
test_predictors <- test_chem[-c(1)]
(a) Which tree-based regression model gives the optimal re sampling and test set performance?
set.seed(624)
#fit the model
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rf_model
##
## Call:
## randomForest(x = train_predictors, y = train_chem$Yield, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 0.3711346
## % Var explained: 59.85
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 0.6715320 0.7323224 0.4828902
set.seed(624)
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.9239 nan 0.0010 0.0004
## 2 0.9231 nan 0.0010 0.0006
## 3 0.9224 nan 0.0010 0.0007
## 4 0.9217 nan 0.0010 0.0006
## 5 0.9211 nan 0.0010 0.0006
## 6 0.9203 nan 0.0010 0.0006
## 7 0.9196 nan 0.0010 0.0007
## 8 0.9190 nan 0.0010 0.0006
## 9 0.9183 nan 0.0010 0.0007
## 10 0.9175 nan 0.0010 0.0005
## 20 0.9111 nan 0.0010 0.0007
## 40 0.8980 nan 0.0010 0.0006
## 60 0.8865 nan 0.0010 0.0004
## 80 0.8743 nan 0.0010 0.0005
## 100 0.8628 nan 0.0010 0.0002
gbm_model
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 46 predictors of which 7 had non-zero influence.
gbmPred <- predict(gbm_model, newdata = test_predictors)
postResample(pred = gbmPred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 1.1097582 0.5627084 0.8634997
set.seed(624)
cube_model <- cubist(train_predictors, train_chem$Yield)
cube_model
##
## Call:
## cubist.default(x = train_predictors, y = train_chem$Yield)
##
## Number of samples: 144
## Number of predictors: 46
##
## Number of committees: 1
## Number of rules: 2
cubePred <- predict(cube_model, newdata = test_predictors)
postResample(pred = cubePred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 0.6724398 0.6903309 0.5159054
The Random Forest model provides optimal metrics, with the lowest RMSE of 0.67, and an \(R^2\) value of 0.73.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Top 10:
ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03,
ManufacturingProcess13, ManufacturingProcess36, BiologicalMaterial11,
ManufacturingProcess09, BiologicalMaterial08, ManufacturingProcess28,
ManufacturingProcess11
head(varImp(rf_model),10)
## Overall
## BiologicalMaterial01 2.166860
## BiologicalMaterial03 11.767253
## BiologicalMaterial05 3.656294
## BiologicalMaterial06 12.943335
## BiologicalMaterial08 8.546350
## BiologicalMaterial09 7.019902
## BiologicalMaterial10 2.524260
## BiologicalMaterial11 9.396847
## ManufacturingProcess01 6.146426
## ManufacturingProcess02 4.739992
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart)
library(rpart.plot)
rpart_tree <- rpart(Yield ~., data = train_chem)
rpart.plot(rpart_tree)
Yes, this view of the data provides additional knowledge.
Using the plot we can see the root node is the ManufacturingProcess32 predictor and is then split between two Biological Material predictors. Seven of the ten terminal nodes are predicted by Manufacturing Process variables and remaining three by Biological Material.