In this homework assignment I will be submitting exercises 8.1, 8.2, 8.3 and 8.7 from the Kuhn and Johnson Applied Predictive Modeling Book.
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
Did the random forest model significantly use the uninformative predic tors (V6– V10)?
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ ggplot2::margin() masks randomForest::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rfImp1 %>% arrange(desc(Overall))
## Overall
## V1 8.732235404
## V4 7.615118809
## V2 6.415369387
## V5 2.023524577
## V3 0.763591825
## V6 0.165111172
## V7 -0.005961659
## V10 -0.074944788
## V9 -0.095292651
## V8 -0.166362581
There are 10 predictors in our recreated simulated data, when fitting a random forest model, the varImp function returns the importance scores showing that the uninformative predictors (v6 - V10) ranked in the bottom 5 of importance. This means that the uninformative predictors were not significantly used by the random forest.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model2, scale = FALSE)
rfImp1 %>% arrange(desc(Overall))
## Overall
## V4 7.04752238
## V2 6.06896061
## V1 5.69119973
## duplicate1 4.28331581
## V5 1.87238438
## V3 0.62970218
## V6 0.13569065
## V10 0.02894814
## V9 0.00840438
## V7 -0.01345645
## V8 -0.04370565
When we added another variable to our dataset with a high correlation (0.946) to V1, this reduced the importance of v1 from 8.73 to 5.69. It is also interesting to not that the other predictor importance values all decreased, however maintained their order of importance.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model3, scale = FALSE)
rfImp1 %>% arrange(desc(Overall))
## Overall
## V4 7.04870917
## V2 6.52816504
## V1 4.91687329
## duplicate1 3.80068234
## V5 2.03115561
## duplicate2 1.87721959
## V3 0.58711552
## V6 0.14213148
## V7 0.10991985
## V10 0.09230576
## V9 -0.01075028
## V8 -0.08405687
When adding in yet another variable with high correlation to V1, this decreased the importance of V1 even further from 5.69 to 4.91. Not as great as a decrease from the inputting the first highly correlated variable, but still a decrease. The other predictors are generally the same as before, but some saw an increase in importance while the majority decreased.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
model4 <- cforest(y ~ ., data = simulated[, -c(12,13)])
cfImp1 <- varimp(model4, conditional = TRUE)
cfImp1
## V1 V2 V3 V4 V5
## 5.699856e+00 5.190114e+00 6.219633e-03 6.741766e+00 1.182547e+00
## V6 V7 V8 V9 V10
## -1.104739e-02 1.491985e-05 -8.383608e-03 -7.355064e-03 -2.342891e-02
Comparing the traditional random forest and the conditional inference random forest, the importance scores are generally in the same pattern with some variables seeing significant changes. Our top predictor with the conditional inference is still V1, and v10 is seen to have significantly less importance than seen in the traditional model. In our conditional inference model we also see a lower total overall score.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
#install.packages('gbm')
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbm_tuning_grid <- expand.grid(
interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10
)
set.seed(123)
gbm_model <- train(
y ~ .,
data = simulated[, -c(12,13)],
method = "gbm",
tuneGrid = gbm_tuning_grid,
verbose = FALSE
)
rfImp5 <- varImp(gbm_model$finalModel, scale = FALSE)
rfImp5 %>% arrange(desc(Overall))
## Overall
## V4 45297.7048
## V1 39668.4129
## V2 34199.2495
## V5 18115.4820
## V3 11953.5268
## V7 1607.1278
## V6 1461.3972
## V9 1007.2205
## V8 898.7862
## V10 670.2965
set.seed(123)
cubist <- train(y ~ .,
data = simulated[, -c(12,13)],
method = "cubist")
rfImp6 <- varImp(cubist$finalModel, scale = FALSE)
rfImp6 %>% arrange(desc(Overall))
## Overall
## V1 72.0
## V2 54.5
## V4 49.0
## V3 42.0
## V5 40.0
## V6 11.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
The pattern is generally the same from our cubist and boosted trees when compared to the traditional random forest. In our boosted trees and cubist we see that v10 ranks the lowest similarly to our conditional inference model, unlike in our traditional model.
Use a simulation to show tree bias with different granularities.
#install.packages('rpart')
#install.packages('partykit')
library(rpart)
library(mlbench)
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
set.seed(123)
simulated2 <- mlbench.friedman1(200, sd = 1)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
rpart_tree <- rpart(y ~ ., data = simulated2)
plot(as.party(rpart_tree), gp = gpar(fontsize = 7))
rpart_importance <- rpart_tree$variable.importance
rpart_importance
## V4 V2 V1 V8 V6 V5 V9 V3
## 1821.5226 1124.7965 405.5857 373.6558 316.4571 273.4891 195.1682 190.4739
## V7 V10
## 173.4785 166.9086
library(dplyr)
simulated2 %>%
summarise(across(everything(), n_distinct))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 y
## 1 200 200 200 200 200 200 200 200 200 200 200
ggplot(simulated2) +
geom_boxplot(aes(y = V1, x = "V1")) +
geom_boxplot(aes(y = V2, x = "V2")) +
geom_boxplot(aes(y = V3, x = "V3")) +
geom_boxplot(aes(y = V4, x = "V4")) +
geom_boxplot(aes(y = V5, x = "V5")) +
geom_boxplot(aes(y = V6, x = "V6")) +
geom_boxplot(aes(y = V7, x = "V7")) +
geom_boxplot(aes(y = V8, x = "V8")) +
geom_boxplot(aes(y = V9, x = "V9")) +
geom_boxplot(aes(y = V10, x = "V10")) +
ggtitle("Boxplot Comparison of Predictor Distributions ") +
theme_minimal()
In our simulation V4 has the greatest importance, and with the number of unique values being equal for all predictors, this is due to having the greatest range of values or smoothest distribution.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The bagging fraction influence the amount of data being looked at, and the learning rate is the fraction of the current predicted value being summed with the previous predicted value. The model on the left (smaller bagging fraction and learning rate) will take into account more predictors due to the smaller bagging fraction, and learning rate. The model on the left will also take more time and computational resources. The model on the right with a greater fraction and learning rate will tend to over fit, it will undergo less iterations due to its greater bagging fraction, and weigh the contribution of more influential predictors greater.
Which model do you think would be more predictive of other samples?
The model with the smaller bagging fraction and learning rate would be more predictive of other samples by virtue of more iterations.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing interaction depth would smooth out the steep drops in importance in both models. Less important predictors would be able to contribute more than when compared to a lower interaction depth.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
# Processing the chemical maufacturing as done before
library(AppliedPredictiveModeling)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
data("ChemicalManufacturingProcess")
# Imputing with MICE
chemical_imputed <- mice(data = ChemicalManufacturingProcess,m=5,maxit=40,meth='pmm',seed=500,printFlag = FALSE)
## Warning: Number of logged events: 5400
data_imputed <- complete(chemical_imputed,1)
# Removing near zero variance predictors
library(caret)
nzv <- nearZeroVar(data_imputed)
remaining_predictors <- data_imputed[,-nzv]
# Removing highly correlated predictors
high_correlation <- findCorrelation(cor(remaining_predictors[,2:57]), cutoff = .9, exact = TRUE)
length(high_correlation)
## [1] 10
filtered_remaining <- remaining_predictors[,-high_correlation]
# Creating split and test/train
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)
train.x <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]
test.x <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]
Which tree-based regression model gives the optimal resampling and test set performance?
CART Model
set.seed(123)
cartTune <- train(train.x,train.y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartPred <- predict(cartTune, test.x)
postResample(cartPred, test.y)
## RMSE Rsquared MAE
## 1.5453885 0.4297774 1.1307447
Random Forest Model
set.seed(123)
rfTune <- randomForest(train.x, train.y,
importance = TRUE,
ntree = 1000)
rfpred <- predict(rfTune, test.x)
postResample(rfpred, test.y)
## RMSE Rsquared MAE
## 1.4116847 0.5431736 1.1001017
Boosted Tree Model
set.seed(123)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
gbmTune <- train(train.x, train.y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test.x)
postResample(gbmPred, test.y)
## RMSE Rsquared MAE
## 1.4452030 0.4949607 1.1391348
Cubist Model
set.seed(123)
cubistTuned <- train(train.x, train.y,
method = "cubist")
cubistPred <- predict(cubistTuned, test.x)
postResample(cubistPred, test.y)
## RMSE Rsquared MAE
## 1.274279 0.603611 1.049768
Our Cubist model is by far the best model - it has the lowest RMSE value at 0.897 and largest Rsquared at 0.76
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(cubistTuned, scale = TRUE)
## cubist variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## ManufacturingProcess32 100
## ManufacturingProcess09 100
## ManufacturingProcess33 50
## ManufacturingProcess35 50
## ManufacturingProcess01 50
## BiologicalMaterial02 45
## ManufacturingProcess29 0
## ManufacturingProcess34 0
## ManufacturingProcess10 0
## ManufacturingProcess42 0
## ManufacturingProcess40 0
## ManufacturingProcess12 0
## ManufacturingProcess27 0
## BiologicalMaterial09 0
## ManufacturingProcess38 0
## ManufacturingProcess19 0
## ManufacturingProcess15 0
## ManufacturingProcess03 0
## ManufacturingProcess21 0
## ManufacturingProcess43 0
The most important predictors based on the optimal tree based model are manufacturing process 32 followed by process 13. Manufacturing processes dominate the list for the tree based model, all 20 in the top 20 are manufacturing processes.
# Optimal linear method from homework 7 was found to be the PLS model
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)
train.p <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]
test.p <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]
p_pro <- c("nzv", "center","scale")
tr_ctrl =trainControl(method = "boot", number=5)
pls_fit <-train(train.p, train.y, method="pls", tuneLength = 20, preProcess=p_pro,trControl=tr_ctrl)
varImp(pls_fit)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 89.73
## ManufacturingProcess13 76.69
## ManufacturingProcess09 69.11
## ManufacturingProcess33 66.88
## BiologicalMaterial02 62.18
## BiologicalMaterial06 60.56
## BiologicalMaterial08 54.72
## ManufacturingProcess06 53.61
## BiologicalMaterial12 51.27
## ManufacturingProcess12 49.35
## BiologicalMaterial04 48.79
## ManufacturingProcess04 45.96
## ManufacturingProcess29 44.33
## ManufacturingProcess11 39.78
## ManufacturingProcess34 36.70
## ManufacturingProcess02 34.08
## ManufacturingProcess20 32.76
## BiologicalMaterial10 31.51
## ManufacturingProcess37 31.28
# Optimal non-linear method from homework 8 was found to be the KNN model
knnTune <- train(train.x, train.y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
varImp(knnTune)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## BiologicalMaterial06 100.00
## BiologicalMaterial04 88.76
## BiologicalMaterial02 84.07
## ManufacturingProcess04 79.30
## ManufacturingProcess31 69.52
## ManufacturingProcess36 59.37
## ManufacturingProcess02 57.24
## ManufacturingProcess32 57.07
## BiologicalMaterial05 56.01
## ManufacturingProcess29 55.19
## ManufacturingProcess20 54.60
## ManufacturingProcess15 51.10
## BiologicalMaterial12 50.01
## ManufacturingProcess13 48.39
## ManufacturingProcess27 47.03
## ManufacturingProcess18 44.56
## ManufacturingProcess06 37.98
## ManufacturingProcess01 37.67
## ManufacturingProcess09 36.81
## BiologicalMaterial09 36.59
Both our optimal linear and nonlinear methods ranked various biological material predictors in the top ten importance. Our trees based model importance ranking is more similar to the linear PLS model.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpartTree <- rpart(Yield ~ ., data = filtered_remaining)
plot(as.party(rpartTree),ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))
Seeing the distribution yields this way is easier to associate the relationships when compared with correlation values or other methods. The tree view allows me to make the connection that if certain processes or materials are used yield can still be maximized. I can imagine how if a subject matter expert were to see this visual, they could reinforce recommendations for alternative methods/materials in a business case, given the alternative is viable.