Homework 9

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(base)
library(mlbench)
library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(earth)

## Warning: package 'earth' was built under R version 4.3.2

## Loading required package: Formula
## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.3.2

## Loading required package: plotrix

library(AppliedPredictiveModeling)

Ask

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1

Recreate the simulated data from Exercise 7.2

library(mlbench)

set.seed(2300)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

A

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model signifanctly use the uninformative predictors (V6 - V10)?

rfImp1

##         Overall
## V1   6.36572567
## V2   8.25707320
## V3   0.89141019
## V4  11.85545008
## V5   1.19036926
## V6   0.09291041
## V7   0.02788279
## V8  -0.02131354
## V9  -0.02722549
## V10  0.04217345

Since the values of V6 - V10 are low values show that the value are not significant.

B

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1 
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9500839

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                 Overall
## V1          4.520698011
## V2          7.434887446
## V3          0.865751098
## V4         11.343105263
## V5          1.473259145
## V6          0.080454120
## V7         -0.005710817
## V8         -0.009087140
## V9         -0.072791169
## V10         0.185376693
## duplicate1  2.984178606

The importance score for V1 did change and decrease in importance. The highest predictor is now V4. The duplicate value is not as high as the V1 but is one of the top 5 important predictors.

C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

model3 <- cforest(y ~ ., data = simulated)

rfImp3 <- varimp(model3, conditional = TRUE)
rfImp3

##         V1         V2         V3         V4         V5         V6         V7 
##  3.5708634  5.8719628  0.2139685 10.2715544  1.2609252 -0.2726849 -0.1731168 
##         V8         V9        V10 duplicate1 
## -0.3310973 -0.2938416 -0.1053778  1.7839058

The importances show a similar pattern s V6 to V10 have low levels of importance but differ in pattern.

D

library(Cubist)

set.seed(2300)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

simulated_x <- subset(simulated, select = -c(y))

cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)

rfImp4<-varImp(cubist_model)

arrange(rfImp4, Overall)

##     Overall
## V10     0.0
## V6      1.0
## V8      2.0
## V9      5.5
## V7      8.0
## V3     34.5
## V5     45.5
## V4     57.0
## V1     60.5
## V2     69.5

V6-V10 show a lower importance, similar to the previous output in Part C. The most important predictor in the Cubist model is V2, while in the previous model it was V4.

8.2

Use a simulation to show tree bias with different granularities.

V1 <- runif(1000, 2,1000)
V2 <- runif(1000, 50,500)
V3 <- rnorm(1000, 500,10)
y <- V2 + V1 

df <- data.frame(V1, V2, V3, y)
model5 <- cforest(y ~ ., data = df, ntree = 10)

#unconditional
rfImp5 <- varimp(model5, conditional = FALSE)

rfImp5

##          V1          V2          V3 
## 115843.0794  25780.0917   -229.9497

The tree model confirms tree bias where highest variance, and therefore the lowest granularity, get ranked with highest importance.

8.3

A

The model on the right has importance on the first few predictors because the bagging and shrinkage is set at 90% which captures more of the training set and therefore there are less trees created.

The model on the left has a more distributed spread of importance because the bagging and skrinking parameters are low. This causes a more greedy model and the model is more likely to categorize other predictors of importance.

B

The left model will likely have a better performance than the right because of the lower bagging rate.

C

The slope will increase as the interaction depth increases since the interaction depth controls the interaction terms and a higher interaction depth would include more lesser important predictors.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)

## [1] 176  58

set.seed(123)
chem_impute <- preProcess(ChemicalManufacturingProcess, method=c('center','knnImpute'))
df <- predict(chem_impute, ChemicalManufacturingProcess)

dfx <- df |> dplyr::select(-Yield)
dfy <- df |> dplyr:: select(Yield)

set.seed(123)
chem_train <- createDataPartition(dfy$Yield, p = .80, list= FALSE)
x_train <- dfx[chem_train,]
x_test <- dfx[-chem_train,]
y_train <- dfy[chem_train,]
y_test <- dfy[-chem_train,]

A

Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

set.seed(123)
rf_model<- train(x_train, y_train, method='rf', tuneLength = 10)
rfPred <- predict(rf_model, x_test)

postResample(rfPred, y_test)

##      RMSE  Rsquared       MAE 
## 0.6637506 0.5834200 0.5052129

Cube

cube_model <- cubist(x_train, y_train)
cubePred <- predict(cube_model, x_test)

postResample(cubePred, y_test)

##      RMSE  Rsquared       MAE 
## 0.6531324 0.5866293 0.5631796

The cube model gives the most optimal performance between the two methods. The R^2 is higher and the error is lower in the Cube model than the Random Forest.

B

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

imp<-varImp(cube_model)
head(imp,10)

##                        Overall
## ManufacturingProcess32    79.0
## ManufacturingProcess17    57.0
## ManufacturingProcess13    43.0
## ManufacturingProcess11    32.5
## BiologicalMaterial02      31.0
## ManufacturingProcess09    21.5
## ManufacturingProcess28    21.5
## ManufacturingProcess37    21.0
## BiologicalMaterial11      10.0
## ManufacturingProcess24    10.0

The most important variables are the ManufacturingProcess32, 17, 13 and 11. Unlike the non linear model, the BiologicalMaterial02 also made it to the top 5 of most important variables. For the linear model the most important predictors was ‘ManufacturingProcess32’,‘ManufacturingProcess13’,‘BiologicalMaterial06’. In both models, the Manufactuing Process dominates the importance list.

C

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.2

#train single tree model
rpart_tree <- rpart(Yield ~., data = df)

#produce tree plot
rpart.plot(rpart_tree)

The graph shows that the root node ManufacturingProcess32 gets splot between the BiologicalMaterial or the ManufacturingProcess. The graph shows the correlation between the variables/predictors and the Yield. The higher correlation path shows the subsequent flow of the decision tree.

Homework 9

Moiya Josephs

2024-04-14

Ask

8.1

A

B

C

D

8.2

8.3

A

B

C

8.7

A

Random Forest

Cube

B

C