Introduction

In this homework assignment I will be submitting exercises 8.1, 8.2, 8.3 and 8.7 from the Kuhn and Johnson Applied Predictive Modeling Book.

Question 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

A.

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predic tors (V6– V10)?

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()  masks randomForest::combine()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ purrr::lift()     masks caret::lift()
## ✖ ggplot2::margin() masks randomForest::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rfImp1 %>% arrange(desc(Overall))
##          Overall
## V1   8.732235404
## V4   7.615118809
## V2   6.415369387
## V5   2.023524577
## V3   0.763591825
## V6   0.165111172
## V7  -0.005961659
## V10 -0.074944788
## V9  -0.095292651
## V8  -0.166362581

There are 10 predictors in our recreated simulated data, when fitting a random forest model, the varImp function returns the importance scores showing that the uninformative predictors (v6 - V10) ranked in the bottom 5 of importance. This means that the uninformative predictors were not significantly used by the random forest.

B.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model2, scale = FALSE)

rfImp1 %>% arrange(desc(Overall))
##                Overall
## V4          7.04752238
## V2          6.06896061
## V1          5.69119973
## duplicate1  4.28331581
## V5          1.87238438
## V3          0.62970218
## V6          0.13569065
## V10         0.02894814
## V9          0.00840438
## V7         -0.01345645
## V8         -0.04370565

When we added another variable to our dataset with a high correlation (0.946) to V1, this reduced the importance of v1 from 8.73 to 5.69. It is also interesting to not that the other predictor importance values all decreased, however maintained their order of importance.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model3, scale = FALSE)

rfImp1 %>% arrange(desc(Overall))
##                Overall
## V4          7.04870917
## V2          6.52816504
## V1          4.91687329
## duplicate1  3.80068234
## V5          2.03115561
## duplicate2  1.87721959
## V3          0.58711552
## V6          0.14213148
## V7          0.10991985
## V10         0.09230576
## V9         -0.01075028
## V8         -0.08405687

When adding in yet another variable with high correlation to V1, this decreased the importance of V1 even further from 5.69 to 4.91. Not as great as a decrease from the inputting the first highly correlated variable, but still a decrease. The other predictors are generally the same as before, but some saw an increase in importance while the majority decreased.

C.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
model4 <- cforest(y ~ ., data = simulated[, -c(12,13)])

cfImp1 <- varimp(model4, conditional = TRUE)
cfImp1
##            V1            V2            V3            V4            V5 
##  5.699856e+00  5.190114e+00  6.219633e-03  6.741766e+00  1.182547e+00 
##            V6            V7            V8            V9           V10 
## -1.104739e-02  1.491985e-05 -8.383608e-03 -7.355064e-03 -2.342891e-02

Comparing the traditional random forest and the conditional inference random forest, the importance scores are generally in the same pattern with some variables seeing significant changes. Our top predictor with the conditional inference is still V1, and v10 is seen to have significantly less importance than seen in the traditional model. In our conditional inference model we also see a lower total overall score.

D.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

#install.packages('gbm')
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbm_tuning_grid <- expand.grid(
  interaction.depth = seq(1, 7, by = 2),
  n.trees = seq(100, 1000, by = 50),
  shrinkage = c(0.01, 0.1),
  n.minobsinnode = 10
)

set.seed(123)

gbm_model <- train(
  y ~ ., 
  data = simulated[, -c(12,13)],
  method = "gbm",
  tuneGrid = gbm_tuning_grid,
  verbose = FALSE
)

rfImp5 <- varImp(gbm_model$finalModel, scale = FALSE)

rfImp5 %>% arrange(desc(Overall))
##        Overall
## V4  45297.7048
## V1  39668.4129
## V2  34199.2495
## V5  18115.4820
## V3  11953.5268
## V7   1607.1278
## V6   1461.3972
## V9   1007.2205
## V8    898.7862
## V10   670.2965

Cubist

set.seed(123)

cubist <- train(y ~ .,
                data = simulated[, -c(12,13)], 
                method = "cubist")
rfImp6 <- varImp(cubist$finalModel, scale = FALSE)
rfImp6 %>% arrange(desc(Overall))
##     Overall
## V1     72.0
## V2     54.5
## V4     49.0
## V3     42.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

The pattern is generally the same from our cubist and boosted trees when compared to the traditional random forest. In our boosted trees and cubist we see that v10 ranks the lowest similarly to our conditional inference model, unlike in our traditional model.

8.2

Use a simulation to show tree bias with different granularities.

#install.packages('rpart')
#install.packages('partykit')
library(rpart)
library(mlbench)
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
set.seed(123)

simulated2 <- mlbench.friedman1(200, sd = 1)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y" 


rpart_tree <- rpart(y ~ ., data = simulated2)

plot(as.party(rpart_tree), gp = gpar(fontsize = 7))

rpart_importance <- rpart_tree$variable.importance  
rpart_importance
##        V4        V2        V1        V8        V6        V5        V9        V3 
## 1821.5226 1124.7965  405.5857  373.6558  316.4571  273.4891  195.1682  190.4739 
##        V7       V10 
##  173.4785  166.9086
library(dplyr)
simulated2 %>%
  summarise(across(everything(), n_distinct))
##    V1  V2  V3  V4  V5  V6  V7  V8  V9 V10   y
## 1 200 200 200 200 200 200 200 200 200 200 200
ggplot(simulated2) +
  geom_boxplot(aes(y = V1, x = "V1")) +
  geom_boxplot(aes(y = V2, x = "V2")) +
  geom_boxplot(aes(y = V3, x = "V3")) +
  geom_boxplot(aes(y = V4, x = "V4")) +
  geom_boxplot(aes(y = V5, x = "V5")) +
  geom_boxplot(aes(y = V6, x = "V6")) +
  geom_boxplot(aes(y = V7, x = "V7")) +
  geom_boxplot(aes(y = V8, x = "V8")) +
  geom_boxplot(aes(y = V9, x = "V9")) +
  geom_boxplot(aes(y = V10, x = "V10")) +
  ggtitle("Boxplot Comparison of Predictor Distributions ") +
  theme_minimal()

In our simulation V4 has the greatest importance, and with the number of unique values being equal for all predictors, this is due to having the greatest range of values or smoothest distribution.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

A.

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction influence the amount of data being looked at, and the learning rate is the fraction of the current predicted value being summed with the previous predicted value. The model on the left (smaller bagging fraction and learning rate) will take into account more predictors due to the smaller bagging fraction, and learning rate. The model on the left will also take more time and computational resources. The model on the right with a greater fraction and learning rate will tend to over fit, it will undergo less iterations due to its greater bagging fraction, and weigh the contribution of more influential predictors greater.

B.

Which model do you think would be more predictive of other samples?

The model with the smaller bagging fraction and learning rate would be more predictive of other samples by virtue of more iterations.

C.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth would smooth out the steep drops in importance in both models. Less important predictors would be able to contribute more than when compared to a lower interaction depth.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

# Processing the chemical maufacturing as done before

library(AppliedPredictiveModeling)
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
data("ChemicalManufacturingProcess")

# Imputing with MICE
chemical_imputed <- mice(data = ChemicalManufacturingProcess,m=5,maxit=40,meth='pmm',seed=500,printFlag = FALSE)
## Warning: Number of logged events: 5400
data_imputed <- complete(chemical_imputed,1)

# Removing near zero variance predictors

library(caret)
nzv <- nearZeroVar(data_imputed)
remaining_predictors <- data_imputed[,-nzv]

# Removing highly correlated predictors
high_correlation <- findCorrelation(cor(remaining_predictors[,2:57]), cutoff = .9, exact = TRUE)
length(high_correlation)
## [1] 10
filtered_remaining <- remaining_predictors[,-high_correlation]
# Creating split and test/train
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)

train.x <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]

test.x <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]

A.

Which tree-based regression model gives the optimal resampling and test set performance?

CART Model

set.seed(123)

cartTune <- train(train.x,train.y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartPred <- predict(cartTune, test.x)

postResample(cartPred, test.y)
##      RMSE  Rsquared       MAE 
## 1.5453885 0.4297774 1.1307447

Random Forest Model

set.seed(123)

rfTune <- randomForest(train.x, train.y, 
                        importance = TRUE,
                        ntree = 1000)

rfpred <- predict(rfTune, test.x)

postResample(rfpred, test.y)
##      RMSE  Rsquared       MAE 
## 1.4116847 0.5431736 1.1001017

Boosted Tree Model

set.seed(123)

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)


gbmTune <- train(train.x, train.y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test.x)

postResample(gbmPred, test.y)
##      RMSE  Rsquared       MAE 
## 1.4452030 0.4949607 1.1391348

Cubist Model

set.seed(123)

cubistTuned <- train(train.x, train.y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test.x)

postResample(cubistPred, test.y)
##     RMSE Rsquared      MAE 
## 1.274279 0.603611 1.049768

Our Cubist model is by far the best model - it has the lowest RMSE value at 0.897 and largest Rsquared at 0.76

B.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cubistTuned, scale = TRUE)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess32     100
## ManufacturingProcess09     100
## ManufacturingProcess33      50
## ManufacturingProcess35      50
## ManufacturingProcess01      50
## BiologicalMaterial02        45
## ManufacturingProcess29       0
## ManufacturingProcess34       0
## ManufacturingProcess10       0
## ManufacturingProcess42       0
## ManufacturingProcess40       0
## ManufacturingProcess12       0
## ManufacturingProcess27       0
## BiologicalMaterial09         0
## ManufacturingProcess38       0
## ManufacturingProcess19       0
## ManufacturingProcess15       0
## ManufacturingProcess03       0
## ManufacturingProcess21       0
## ManufacturingProcess43       0

The most important predictors based on the optimal tree based model are manufacturing process 32 followed by process 13. Manufacturing processes dominate the list for the tree based model, all 20 in the top 20 are manufacturing processes.

# Optimal linear method from homework 7 was found to be the PLS model
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)

train.p <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]

test.p <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]

p_pro <- c("nzv", "center","scale")
tr_ctrl =trainControl(method = "boot", number=5)

pls_fit <-train(train.p, train.y, method="pls", tuneLength = 20, preProcess=p_pro,trControl=tr_ctrl)

varImp(pls_fit)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   89.73
## ManufacturingProcess13   76.69
## ManufacturingProcess09   69.11
## ManufacturingProcess33   66.88
## BiologicalMaterial02     62.18
## BiologicalMaterial06     60.56
## BiologicalMaterial08     54.72
## ManufacturingProcess06   53.61
## BiologicalMaterial12     51.27
## ManufacturingProcess12   49.35
## BiologicalMaterial04     48.79
## ManufacturingProcess04   45.96
## ManufacturingProcess29   44.33
## ManufacturingProcess11   39.78
## ManufacturingProcess34   36.70
## ManufacturingProcess02   34.08
## ManufacturingProcess20   32.76
## BiologicalMaterial10     31.51
## ManufacturingProcess37   31.28
# Optimal non-linear method from homework 8 was found to be the KNN model
knnTune <- train(train.x, train.y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

varImp(knnTune)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## BiologicalMaterial06    100.00
## BiologicalMaterial04     88.76
## BiologicalMaterial02     84.07
## ManufacturingProcess04   79.30
## ManufacturingProcess31   69.52
## ManufacturingProcess36   59.37
## ManufacturingProcess02   57.24
## ManufacturingProcess32   57.07
## BiologicalMaterial05     56.01
## ManufacturingProcess29   55.19
## ManufacturingProcess20   54.60
## ManufacturingProcess15   51.10
## BiologicalMaterial12     50.01
## ManufacturingProcess13   48.39
## ManufacturingProcess27   47.03
## ManufacturingProcess18   44.56
## ManufacturingProcess06   37.98
## ManufacturingProcess01   37.67
## ManufacturingProcess09   36.81
## BiologicalMaterial09     36.59

Both our optimal linear and nonlinear methods ranked various biological material predictors in the top ten importance. Our trees based model importance ranking is more similar to the linear PLS model.

C.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree <- rpart(Yield ~ ., data = filtered_remaining)

plot(as.party(rpartTree),ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))

Seeing the distribution yields this way is easier to associate the relationships when compared with correlation values or other methods. The tree view allows me to make the connection that if certain processes or materials are used yield can still be maximized. I can imagine how if a subject matter expert were to see this visual, they could reinforce recommendations for alternative methods/materials in a business case, given the alternative is viable.