Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

#load libraries
library(tidyr)
library(mice)

## Warning: package 'mice' was built under R version 4.3.3

## Warning in check_dep_version(): ABI version mismatch: 
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

library(MASS)
library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Loading required package: lattice

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(party)

## Warning: package 'party' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.3.3

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.3.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.3.3

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(Cubist)

## Warning: package 'Cubist' was built under R version 4.3.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:party':
## 
##     where

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(ggplot2)
library(VIM)

## Warning: package 'VIM' was built under R version 4.3.3

## Loading required package: colorspace

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

library(rpart)
library(partykit)

## Warning: package 'partykit' was built under R version 4.3.3

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 4.3.3

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

8.1 - Recreate the simulated data from Exercise 7.2

set.seed(333)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

# a. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
rf_1 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rf_1_imp <- varImp(rf_1, scale = FALSE)

print(rf_1_imp)

##          Overall
## V1   7.953508251
## V2   7.686331812
## V3   1.367135564
## V4  13.814340695
## V5   2.930750358
## V6   0.119768734
## V7   0.002923786
## V8  -0.072054632
## V9  -0.056535920
## V10  0.097763537

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

The random forest model did not really rely on uninformative predictors. The importance scores indicate that V4 is the most influential variable, while V6 to V10 are the least significant.

# b. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9406464

simulated2<-simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)

## [1] 0.9502844

# Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
rf_2 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rf_2_imp <- varImp(rf_2, scale = FALSE)
print(rf_2_imp)

##                Overall
## V1          5.31690476
## V2          7.15194214
## V3          1.39504150
## V4         13.04942000
## V5          2.49398857
## V6          0.22126549
## V7          0.13165793
## V8          0.04335948
## V9         -0.14917840
## V10         0.10852052
## duplicate1  4.17741292

V1’s importance score decreased by 2 points, making the duplicate somewhat impactful. If we added another predictor, this trend could continue, further dilution of V1’s importance.

rf_3 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rf_3_imp <- varImp(rf_3, scale = FALSE)
print(rf_3_imp)

##                Overall
## V1          5.46624990
## V2          7.08452120
## V3          1.16552410
## V4         13.14092205
## V5          2.37814431
## V6          0.15409637
## V7          0.09585130
## V8         -0.01920279
## V9         -0.07383881
## V10         0.09274903
## duplicate1  4.10428714

I exprected a larger reduction, but it did drop a bit.

# c Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
cf_1 <- cforest(y ~., data = simulated)
varimp(cf_1)

##          V1          V2          V3          V4          V5          V6 
##  6.43100945  6.75401287  0.77174897 13.16959867  2.40175412  0.10780937 
##          V7          V8          V9         V10  duplicate1 
##  0.02996153 -0.13174632 -0.12692662  0.01024388  4.48054450

This output tracks quite a bit with the above.

#d. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
#Boosted
set.seed(432)
gbm_1<- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_1)

##                   var    rel.inf
## V4                 V4 37.5637698
## V2                 V2 18.2557319
## V1                 V1 17.2356104
## V5                 V5 10.5171041
## duplicate1 duplicate1  8.2298976
## V3                 V3  7.0216397
## V9                 V9  0.4337265
## V8                 V8  0.3901455
## V7                 V7  0.3523744
## V6                 V6  0.0000000
## V10               V10  0.0000000

#Cubist
cub_simulated_x <- subset(simulated, select = -c(y))

cubist_1 <- cubist(x = cub_simulated_x, y = simulated$y, committees = 100)

varImp(cubist_1)

##            Overall
## V3            48.5
## V2            60.0
## V1            61.0
## V4            56.5
## duplicate1    17.0
## V5            30.0
## V9             8.5
## V6             5.0
## V8             4.5
## V7             4.0
## V10            1.5

For Boosted Trees, the pattern seems to persist, as we see that V4 continues to be the most influential, while V6 - V10 are least. For Cubist, the output differs with regard to scoring, but the same pattern persists here as well.

8.2 - Use a simulation to show tree bias with different granularities.

set.seed(432)
#predictor samples and response
low <- sample(0:50, 500, replace = T)
med <- sample(0:500, 500, replace = T)
high <- sample(0:5000, 500, replace = T)
y <- low + med + high + rnorm(250)

#predictor variance combined
#print(var(low))
#print(var(med))
#print(var(high))
sim_var1 <- data.frame(low, med, high, y)
gran_1 <- randomForest(y ~., data = sim_var1, importance = TRUE, ntree = 1000)

varImp(gran_1, scale=FALSE)

##          Overall
## low    -7556.368
## med    28541.539
## high 3843340.856

The above model reinforces tree bias wherein higher variance within variables leads to highest importance.

#8.3 - In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

//image removed for knitting purposes

a - Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors.

The left model, with a shrinkage of 0.1 and a bag fraction of 0.1, uses only 10% of the training data, which focuses its importance on a limited number of predictors. The higher learning rate is set to the commonly recommended value, but the graphic indicates that smaller learning rates generally require more trees such that we can minimize error. Alternatively the right model,(shrinkage: 0.9, bag fraction: 0.9) uses 90% of the training data, disseminating importance across the predictors. This higher shrinkage means fewer trees, making the model more v ulnerable to rapid but perhaps overfitting learning.

b - Which model do you think would be more predictive of other samples?

The left model’s slower learning rate and smaller dataset ought to produce more accuracy. Distributing weight across predictors reduces error over multiple iterations.

c - How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth expands the number of nodes for a tree, which probably ought to distribute importance evenly, which mitigates the slope of their importance. # 8.7 - Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

# load
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3

set.seed(432)
data(ChemicalManufacturingProcess)
imp_chem_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
imp_chem <- predict(imp_chem_data, ChemicalManufacturingProcess)

imp_chem_split <- createDataPartition(imp_chem$Yield, p=.75, list=F)
imp_chem_train <-  imp_chem[imp_chem_split,] 
imp_chem_test <- imp_chem[-imp_chem_split,]
chem_train <- imp_chem_train[-c(1)]
chem_test<-  imp_chem_test[-c(1)]

# a - Which tree-based regression model gives the optimal resampling and test set performance?

# Random Forest
set.seed(432)
rf_chem1 <- randomForest(chem_train, imp_chem_train$Yield, importance = TRUE, ntree = 1000)
chem_rf_pred <- predict(rf_chem1, newdata = chem_test)
postResample(pred = chem_rf_pred, obs = imp_chem_test$Yield)

##      RMSE  Rsquared       MAE 
## 0.6624754 0.6488302 0.5226948

# Boosted trees
set.seed(432)
bt_1 <- gbm.fit(chem_train, imp_chem_train$Yield, distribution = "gaussian")

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9733             nan     0.0010    0.0007
##      2        0.9722             nan     0.0010    0.0009
##      3        0.9714             nan     0.0010    0.0006
##      4        0.9706             nan     0.0010    0.0009
##      5        0.9698             nan     0.0010    0.0009
##      6        0.9690             nan     0.0010    0.0008
##      7        0.9682             nan     0.0010    0.0007
##      8        0.9674             nan     0.0010    0.0008
##      9        0.9664             nan     0.0010    0.0009
##     10        0.9656             nan     0.0010    0.0009
##     20        0.9572             nan     0.0010    0.0007
##     40        0.9410             nan     0.0010    0.0008
##     60        0.9259             nan     0.0010    0.0008
##     80        0.9112             nan     0.0010    0.0008
##    100        0.8964             nan     0.0010    0.0003

bt_1_pred <- predict(bt_1, newdata = chem_test)

## Using 100 trees...

postResample(pred = bt_1_pred, obs = imp_chem_test$Yield)

##      RMSE  Rsquared       MAE 
## 0.9980328 0.3005775 0.8049624

#Cubist
set.seed(432)
cube_1 <- cubist(chem_train, imp_chem_train$Yield)
cube_1_pred <- predict(cube_1, newdata = chem_test)
postResample(pred = cube_1_pred, obs = imp_chem_test$Yield)

##      RMSE  Rsquared       MAE 
## 0.6740630 0.5957371 0.5457711

Random Forest enjoys the lowest RMSE and the highest R-squared value.

# b. - Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImpPlot(rf_chem1, cex = 0.7)

ManufacturingProcess32 remains the most important variable. 70% of the top ten are manufacturing. The contribution of variables is largely consistent across non-linear models, but the variable contribution of the lowest-ranked variables is minimal since the Random Forest considers every variable.

# c- Plot the optimal single tree with the distribution of yield in the terminal  nodes. Does this view of the data provide additional knowledge about the  biological or process predictors and their relationship with yield? Kuhn, Max; Johnson, Kjell. Applied Predictive Modeling (p. 220). Springer New York. Kindle Edition. 

# single tree plot
library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

chem_rpart_tree <- rpart(Yield ~., data = imp_chem_train)
rpart.plot(chem_rpart_tree)

#plot(chem_part2)

The analytical value of the above image, in addition to illustrating the important role that key variables have in impacting yield, is to demonstrate the levels at which the variables impact yield, enhancing our understanding the of the relationship between inputs and outputs.

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

DATA 624.HW9

Evan McLaughlin

2024-11-15