library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(base)
library(mlbench)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(earth)
## Warning: package 'earth' was built under R version 4.3.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.2
## Loading required package: plotrix
library(AppliedPredictiveModeling)
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
Recreate the simulated data from Exercise 7.2
library(mlbench)
set.seed(2300)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model signifanctly use the uninformative predictors (V6 - V10)?
rfImp1
## Overall
## V1 6.36572567
## V2 8.25707320
## V3 0.89141019
## V4 11.85545008
## V5 1.19036926
## V6 0.09291041
## V7 0.02788279
## V8 -0.02131354
## V9 -0.02722549
## V10 0.04217345
Since the values of V6 - V10 are low values show that the value are not significant.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9500839
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 4.520698011
## V2 7.434887446
## V3 0.865751098
## V4 11.343105263
## V5 1.473259145
## V6 0.080454120
## V7 -0.005710817
## V8 -0.009087140
## V9 -0.072791169
## V10 0.185376693
## duplicate1 2.984178606
The importance score for V1 did change and decrease in importance. The highest predictor is now V4. The duplicate value is not as high as the V1 but is one of the top 5 important predictors.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
model3 <- cforest(y ~ ., data = simulated)
rfImp3 <- varimp(model3, conditional = TRUE)
rfImp3
## V1 V2 V3 V4 V5 V6 V7
## 3.5708634 5.8719628 0.2139685 10.2715544 1.2609252 -0.2726849 -0.1731168
## V8 V9 V10 duplicate1
## -0.3310973 -0.2938416 -0.1053778 1.7839058
The importances show a similar pattern s V6 to V10 have low levels of importance but differ in pattern.
library(Cubist)
set.seed(2300)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
simulated_x <- subset(simulated, select = -c(y))
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
rfImp4<-varImp(cubist_model)
arrange(rfImp4, Overall)
## Overall
## V10 0.0
## V6 1.0
## V8 2.0
## V9 5.5
## V7 8.0
## V3 34.5
## V5 45.5
## V4 57.0
## V1 60.5
## V2 69.5
V6-V10 show a lower importance, similar to the previous output in Part C. The most important predictor in the Cubist model is V2, while in the previous model it was V4.
Use a simulation to show tree bias with different granularities.
V1 <- runif(1000, 2,1000)
V2 <- runif(1000, 50,500)
V3 <- rnorm(1000, 500,10)
y <- V2 + V1
df <- data.frame(V1, V2, V3, y)
model5 <- cforest(y ~ ., data = df, ntree = 10)
#unconditional
rfImp5 <- varimp(model5, conditional = FALSE)
rfImp5
## V1 V2 V3
## 115843.0794 25780.0917 -229.9497
The tree model confirms tree bias where highest variance, and therefore the lowest granularity, get ranked with highest importance.
The model on the right has importance on the first few predictors because the bagging and shrinkage is set at 90% which captures more of the training set and therefore there are less trees created.
The model on the left has a more distributed spread of importance because the bagging and skrinking parameters are low. This causes a more greedy model and the model is more likely to categorize other predictors of importance.
The left model will likely have a better performance than the right because of the lower bagging rate.
The slope will increase as the interaction depth increases since the interaction depth controls the interaction terms and a higher interaction depth would include more lesser important predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176 58
set.seed(123)
chem_impute <- preProcess(ChemicalManufacturingProcess, method=c('center','knnImpute'))
df <- predict(chem_impute, ChemicalManufacturingProcess)
dfx <- df |> dplyr::select(-Yield)
dfy <- df |> dplyr:: select(Yield)
set.seed(123)
chem_train <- createDataPartition(dfy$Yield, p = .80, list= FALSE)
x_train <- dfx[chem_train,]
x_test <- dfx[-chem_train,]
y_train <- dfy[chem_train,]
y_test <- dfy[-chem_train,]
Which tree-based regression model gives the optimal resampling and test set performance?
set.seed(123)
rf_model<- train(x_train, y_train, method='rf', tuneLength = 10)
rfPred <- predict(rf_model, x_test)
postResample(rfPred, y_test)
## RMSE Rsquared MAE
## 0.6637506 0.5834200 0.5052129
cube_model <- cubist(x_train, y_train)
cubePred <- predict(cube_model, x_test)
postResample(cubePred, y_test)
## RMSE Rsquared MAE
## 0.6531324 0.5866293 0.5631796
The cube model gives the most optimal performance between the two methods. The R^2 is higher and the error is lower in the Cube model than the Random Forest.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
imp<-varImp(cube_model)
head(imp,10)
## Overall
## ManufacturingProcess32 79.0
## ManufacturingProcess17 57.0
## ManufacturingProcess13 43.0
## ManufacturingProcess11 32.5
## BiologicalMaterial02 31.0
## ManufacturingProcess09 21.5
## ManufacturingProcess28 21.5
## ManufacturingProcess37 21.0
## BiologicalMaterial11 10.0
## ManufacturingProcess24 10.0
The most important variables are the ManufacturingProcess32, 17, 13 and 11. Unlike the non linear model, the BiologicalMaterial02 also made it to the top 5 of most important variables. For the linear model the most important predictors was ‘ManufacturingProcess32’,‘ManufacturingProcess13’,‘BiologicalMaterial06’. In both models, the Manufactuing Process dominates the importance list.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
#train single tree model
rpart_tree <- rpart(Yield ~., data = df)
#produce tree plot
rpart.plot(rpart_tree)
The graph shows that the root node ManufacturingProcess32 gets splot between the BiologicalMaterial or the ManufacturingProcess. The graph shows the correlation between the variables/predictors and the Yield. The higher correlation path shows the subsequent flow of the decision tree.