Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
#load libraries
library(tidyr)
library(mice)
## Warning: package 'mice' was built under R version 4.3.3
## Warning in check_dep_version(): ABI version mismatch:
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(MASS)
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Loading required package: lattice
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.3.3
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggplot2)
library(VIM)
## Warning: package 'VIM' was built under R version 4.3.3
## Loading required package: colorspace
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(rpart)
library(partykit)
## Warning: package 'partykit' was built under R version 4.3.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.3.3
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
set.seed(333)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
# a. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
rf_1 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rf_1_imp <- varImp(rf_1, scale = FALSE)
print(rf_1_imp)
## Overall
## V1 7.953508251
## V2 7.686331812
## V3 1.367135564
## V4 13.814340695
## V5 2.930750358
## V6 0.119768734
## V7 0.002923786
## V8 -0.072054632
## V9 -0.056535920
## V10 0.097763537
Did the random forest model significantly use the uninformative predictors (V6 - V10)?
The random forest model did not really rely on uninformative predictors. The importance scores indicate that V4 is the most influential variable, while V6 to V10 are the least significant.
# b. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9406464
simulated2<-simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
## [1] 0.9502844
# Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
rf_2 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rf_2_imp <- varImp(rf_2, scale = FALSE)
print(rf_2_imp)
## Overall
## V1 5.31690476
## V2 7.15194214
## V3 1.39504150
## V4 13.04942000
## V5 2.49398857
## V6 0.22126549
## V7 0.13165793
## V8 0.04335948
## V9 -0.14917840
## V10 0.10852052
## duplicate1 4.17741292
V1’s importance score decreased by 2 points, making the duplicate somewhat impactful. If we added another predictor, this trend could continue, further dilution of V1’s importance.
rf_3 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rf_3_imp <- varImp(rf_3, scale = FALSE)
print(rf_3_imp)
## Overall
## V1 5.46624990
## V2 7.08452120
## V3 1.16552410
## V4 13.14092205
## V5 2.37814431
## V6 0.15409637
## V7 0.09585130
## V8 -0.01920279
## V9 -0.07383881
## V10 0.09274903
## duplicate1 4.10428714
I exprected a larger reduction, but it did drop a bit.
# c Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
cf_1 <- cforest(y ~., data = simulated)
varimp(cf_1)
## V1 V2 V3 V4 V5 V6
## 6.43100945 6.75401287 0.77174897 13.16959867 2.40175412 0.10780937
## V7 V8 V9 V10 duplicate1
## 0.02996153 -0.13174632 -0.12692662 0.01024388 4.48054450
This output tracks quite a bit with the above.
#d. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
#Boosted
set.seed(432)
gbm_1<- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_1)
## var rel.inf
## V4 V4 37.5637698
## V2 V2 18.2557319
## V1 V1 17.2356104
## V5 V5 10.5171041
## duplicate1 duplicate1 8.2298976
## V3 V3 7.0216397
## V9 V9 0.4337265
## V8 V8 0.3901455
## V7 V7 0.3523744
## V6 V6 0.0000000
## V10 V10 0.0000000
#Cubist
cub_simulated_x <- subset(simulated, select = -c(y))
cubist_1 <- cubist(x = cub_simulated_x, y = simulated$y, committees = 100)
varImp(cubist_1)
## Overall
## V3 48.5
## V2 60.0
## V1 61.0
## V4 56.5
## duplicate1 17.0
## V5 30.0
## V9 8.5
## V6 5.0
## V8 4.5
## V7 4.0
## V10 1.5
For Boosted Trees, the pattern seems to persist, as we see that V4 continues to be the most influential, while V6 - V10 are least. For Cubist, the output differs with regard to scoring, but the same pattern persists here as well.
set.seed(432)
#predictor samples and response
low <- sample(0:50, 500, replace = T)
med <- sample(0:500, 500, replace = T)
high <- sample(0:5000, 500, replace = T)
y <- low + med + high + rnorm(250)
#predictor variance combined
#print(var(low))
#print(var(med))
#print(var(high))
sim_var1 <- data.frame(low, med, high, y)
gran_1 <- randomForest(y ~., data = sim_var1, importance = TRUE, ntree = 1000)
varImp(gran_1, scale=FALSE)
## Overall
## low -7556.368
## med 28541.539
## high 3843340.856
The above model reinforces tree bias wherein higher variance within variables leads to highest importance.
#8.3 - In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
//image removed for knitting purposes
The left model, with a shrinkage of 0.1 and a bag fraction of 0.1, uses only 10% of the training data, which focuses its importance on a limited number of predictors. The higher learning rate is set to the commonly recommended value, but the graphic indicates that smaller learning rates generally require more trees such that we can minimize error. Alternatively the right model,(shrinkage: 0.9, bag fraction: 0.9) uses 90% of the training data, disseminating importance across the predictors. This higher shrinkage means fewer trees, making the model more v ulnerable to rapid but perhaps overfitting learning.
The left model’s slower learning rate and smaller dataset ought to produce more accuracy. Distributing weight across predictors reduces error over multiple iterations.
Increasing interaction depth expands the number of nodes for a tree, which probably ought to distribute importance evenly, which mitigates the slope of their importance. # 8.7 - Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
# load
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
set.seed(432)
data(ChemicalManufacturingProcess)
imp_chem_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
imp_chem <- predict(imp_chem_data, ChemicalManufacturingProcess)
imp_chem_split <- createDataPartition(imp_chem$Yield, p=.75, list=F)
imp_chem_train <- imp_chem[imp_chem_split,]
imp_chem_test <- imp_chem[-imp_chem_split,]
chem_train <- imp_chem_train[-c(1)]
chem_test<- imp_chem_test[-c(1)]
# a - Which tree-based regression model gives the optimal resampling and test set performance?
# Random Forest
set.seed(432)
rf_chem1 <- randomForest(chem_train, imp_chem_train$Yield, importance = TRUE, ntree = 1000)
chem_rf_pred <- predict(rf_chem1, newdata = chem_test)
postResample(pred = chem_rf_pred, obs = imp_chem_test$Yield)
## RMSE Rsquared MAE
## 0.6624754 0.6488302 0.5226948
# Boosted trees
set.seed(432)
bt_1 <- gbm.fit(chem_train, imp_chem_train$Yield, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.9733 nan 0.0010 0.0007
## 2 0.9722 nan 0.0010 0.0009
## 3 0.9714 nan 0.0010 0.0006
## 4 0.9706 nan 0.0010 0.0009
## 5 0.9698 nan 0.0010 0.0009
## 6 0.9690 nan 0.0010 0.0008
## 7 0.9682 nan 0.0010 0.0007
## 8 0.9674 nan 0.0010 0.0008
## 9 0.9664 nan 0.0010 0.0009
## 10 0.9656 nan 0.0010 0.0009
## 20 0.9572 nan 0.0010 0.0007
## 40 0.9410 nan 0.0010 0.0008
## 60 0.9259 nan 0.0010 0.0008
## 80 0.9112 nan 0.0010 0.0008
## 100 0.8964 nan 0.0010 0.0003
bt_1_pred <- predict(bt_1, newdata = chem_test)
## Using 100 trees...
postResample(pred = bt_1_pred, obs = imp_chem_test$Yield)
## RMSE Rsquared MAE
## 0.9980328 0.3005775 0.8049624
#Cubist
set.seed(432)
cube_1 <- cubist(chem_train, imp_chem_train$Yield)
cube_1_pred <- predict(cube_1, newdata = chem_test)
postResample(pred = cube_1_pred, obs = imp_chem_test$Yield)
## RMSE Rsquared MAE
## 0.6740630 0.5957371 0.5457711
Random Forest enjoys the lowest RMSE and the highest R-squared value.
# b. - Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImpPlot(rf_chem1, cex = 0.7)
ManufacturingProcess32 remains the most important variable. 70% of the
top ten are manufacturing. The contribution of variables is largely
consistent across non-linear models, but the variable contribution of
the lowest-ranked variables is minimal since the Random Forest considers
every variable.
# c- Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield? Kuhn, Max; Johnson, Kjell. Applied Predictive Modeling (p. 220). Springer New York. Kindle Edition.
# single tree plot
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
chem_rpart_tree <- rpart(Yield ~., data = imp_chem_train)
rpart.plot(chem_rpart_tree)
#plot(chem_part2)
The analytical value of the above image, in addition to illustrating the important role that key variables have in impacting yield, is to demonstrate the levels at which the variables impact yield, enhancing our understanding the of the relationship between inputs and outputs.
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.