Data624 Homework 9 Adriana

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the

variable importance scores:

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

set.seed(123)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##          Overall
## V1   8.630761154
## V2   6.378205752
## V3   0.737853335
## V4   7.545618226
## V5   2.169481327
## V6   0.097829221
## V7   0.100908417
## V8  -0.194478838
## V9  -0.093326598
## V10 -0.006066088

Did the random forest model significantly use the uninformative predictors (V6– V10)?

The random forest model did not significantly use the uninformatice predictors.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9434714

set.seed(123)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.60934625
## V2          6.13991751
## V3          0.53066612
## V4          7.04240325
## V5          2.05540886
## V6          0.25353686
## V7          0.05590159
## V8         -0.08706789
## V9         -0.02669762
## V10        -0.05221862
## duplicate1  4.18092896

Fit another random forest model to these data. Did the importance score for V1 change?

The importance score for V1 decreased when another predictor that is highly correlated with V1 was added. V2 decreased a small amount.

What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9467591

set.seed(123)
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

##                Overall
## V1          4.19524108
## V2          6.30087909
## V3          0.47538115
## V4          6.74081646
## V5          1.79320815
## V6          0.13052071
## V7          0.02737334
## V8         -0.05312351
## V9          0.03491343
## V10        -0.01877389
## duplicate1  2.62768604
## duplicate2  3.31830537

The variable importance of V1 decreases with each correlated variable we introduce.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Stroblet al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

## Warning: package 'party' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.3.3

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.3.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.3.3

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])

rfImp4 <- varimp(model4, conditional = TRUE)

rfImp4

##           V1           V2           V3           V4           V5           V6 
##  5.378093181  5.364475885  0.027914682  6.628766556  1.304961634 -0.010635023 
##           V7           V8           V9          V10 
##  0.011137367 -0.025004721 -0.009870118 -0.023312382

The variable importance for this model shows similar results to the previous traditional random forest models. where V1, V2 and V4 are the most important variables.

(d) Repeat this process with diﬀerent tree models, such as boosted trees and Cubist. Does the same pattern occur?

library("gbm")

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(123)

gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp5 <- varImp(gbmTune, scale = FALSE)

rfImp5

## gbm variable importance
## 
##     Overall
## V4  45297.7
## V1  39668.4
## V2  34199.2
## V5  18115.5
## V3  11953.5
## V7   1607.1
## V6   1461.4
## V9   1007.2
## V8    898.8
## V10   670.3

library(Cubist)

## Warning: package 'Cubist' was built under R version 4.3.3

set.seed(123)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned, scale = FALSE)

rfImp6

## cubist variable importance
## 
##     Overall
## V1     72.0
## V2     54.5
## V4     49.0
## V3     42.0
## V5     40.0
## V6     11.0
## V8      0.0
## V10     0.0
## V7      0.0
## V9      0.0

These both produce similar results, where V1, V2, and V4 are the most important.

8.2. Use a simulation to show tree bias with diﬀerent granularities.

set.seed(123)
v1 <- runif(500,1,250)
v2 <- runif(500,1,10)
v3 <- runif(500,1,500)
y <- v1+v3
df <- data.frame(v1,v2,v3,y)

head(df)

##          v1       v2        v3        y
## 1  72.60680 4.182455 137.53774 210.1445
## 2 197.28798 4.297973 297.33960 494.6276
## 3 102.83525 3.583901  80.93222 183.7675
## 4 220.87133 1.719756 426.86169 647.7330
## 5 235.17635 4.289088 424.02184 659.1982
## 6  12.34357 2.602124 239.46552 251.8091

simMod <- randomForest(y~.,data = df,importance = TRUE,ntree = 100)
varImp(simMod,scale = TRUE)

##      Overall
## v1 45.694763
## v2 -2.104113
## v3 83.356770

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters aﬀect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

$Fig.8.24: A comparison of variable importance magnitudes for diﬀering values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to 0.1 in the left figure. Both are set to 0.9 in the right figure.$ The left plot has both tuning parameters set to 0.1. A lower bagging fraction means each tree is trained on only 10% of the training data, introducing more randomness.This results in a more distributed importance pattern among distributors. The right plot has both tuning parameters set to 0.9. A higher bagging fraction means each tree is trained on 90% of the traning data. There is less randomness in this model, which results in a model that is focused on a small number of variables. The larger learning rate contributes to this as well.

(b) Which model do you think would be more predictive of other samples?

I believe the model on the right would be more predictive of other samples. The larger learning rate allows the model to focus on the most predictive variables. This would be beneficial for a more accurate prediction of other samples.

(c) How would increasing interaction depth aﬀect the slope of predictor importance for either model in Fig.8.24?

If the interaction depth would be increased, then I believe

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing

process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(ipred)

## Warning: package 'ipred' was built under R version 4.3.3

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

sum(is.na(ChemicalManufacturingProcess))

## [1] 106

missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(missing, ChemicalManufacturingProcess)

sum(is.na(Chemical))

## [1] 0

set.seed(123)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]

(a) Which tree-based regression model gives the optimal resampling and testset performance?

Random Forest

set.seed(123)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.2500628 0.5652829 0.9556864

Boosted Tree

set.seed(123)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.3331638 0.4719081 0.9674425

Cubist

set.seed(123)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.0585622 0.6721253 0.9127560

rbind(bagged = postResample(baggedPred, test_y),
      randomForest = postResample(rfPred, test_y),
      cubist = postResample(cubistPred, test_y))

##                  RMSE  Rsquared       MAE
## bagged       1.333164 0.4719081 0.9674425
## randomForest 1.250063 0.5652829 0.9556864
## cubist       1.058562 0.6721253 0.9127560

The lowest RMSE is provided by the cubist model. Which means it gives the optimal resampling and testset performance

(b) Which predictors are most important in the optimal tree-based regression

model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cubistTuned, scale = TRUE)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess17  100.00
## ManufacturingProcess32   95.83
## BiologicalMaterial06     44.79
## ManufacturingProcess13   41.67
## BiologicalMaterial12     36.46
## BiologicalMaterial02     30.21
## ManufacturingProcess09   26.04
## ManufacturingProcess04   21.88
## ManufacturingProcess45   20.83
## ManufacturingProcess37   20.83
## ManufacturingProcess33   17.71
## BiologicalMaterial08     16.67
## ManufacturingProcess39   14.58
## BiologicalMaterial05     13.54
## ManufacturingProcess27   12.50
## ManufacturingProcess11   11.46
## BiologicalMaterial03     11.46
## BiologicalMaterial11     10.42
## ManufacturingProcess15   10.42
## BiologicalMaterial04     10.42

ManufacturingProcess 32 is the top contributor. Out of the top 10 variables, only 3 are Biological, the majority are manufacturing variables. They are similar to the results from the optimal lineal/non-linear models.

(c) Plot the optimal single tree with the distribution of yield in the terminal

nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])
rpart.plot::rpart.plot(rpartTree)

This plot does a great job of helping to visualize the importance of the variables and the relationships between them.