library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.2.3
library(ipred)
## Warning: package 'ipred' was built under R version 4.2.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.2.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The model places more importance on 1,2, 4 and somewhat 5 (8.7, 6.4, 7.6, and 2.0, respectively) while 6-10 appear to have little (0.16 and below).
library(party)
## Warning: package 'party' was built under R version 4.2.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.2.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.2.3
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
model3 <- cforest(y ~ ., data = simulated)
rfImp3 <- varimp(model3, conditional = TRUE) %>%
as.data.frame()
rfImp3
## .
## V1 1.8986239733
## V2 4.8021626697
## V3 0.0229993405
## V4 6.0471706526
## V5 0.9850544288
## V6 -0.0119652434
## V7 -0.0104327775
## V8 -0.0104862941
## V9 0.0004516316
## V10 -0.0074652543
## duplicate1 1.9703660210
Using the cforest function yields the same imporance patterns as the traditional random forest model, where V1, V2, V4 and 5 show more importance than the others.
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.2.3
model4 <- cubist(simulated[, colnames(simulated)[colnames(simulated) != 'y']],
simulated$y)
rfImp4 <- varImp(model4, scale = FALSE)
rfImp4
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
Yes, the same pattern occurs.
set.seed(1234)
#simulate 100 samples, predictor x, nonlinear relationship
n <- 100
x <- runif(n, -5, 5)
y <- ifelse(x^2 + rnorm(n) > 1, 1, 0)
df <- data.frame(x = x, y = as.factor(y))
ctree_granular <- ctree(y ~ x, df, controls = ctree_control(mincriterion = 0.1))
plot(ctree_granular)
The model on the right has a higher learning ratio and as the ratio increases towards 1, models will have less predictors needed for the response variable. It also has a higher bagging fraction (0.9 per the figure in the text), meaning less predictors will be important.
In terms of the model on the right as the bagging fraction and learning ratio increase, the model’s performance will be inverse. Therefore, with more predictors being identified with more of a distributed importance, I think the model on the left would be more predictive of other samples.
As interaction depth increases, the importance of predictors should spread over more predictors and the lines in the figure would spread further right.
set.seed(789)
data(ChemicalManufacturingProcess)
# imputation
imp <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(imp, ChemicalManufacturingProcess)
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(135)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]
# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]
Random Forest
set.seed(246)
rfModel <- randomForest(train_x, train_y, importance = TRUE, ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y)
## RMSE Rsquared MAE
## 0.9327773 0.8182165 0.7298731
Cubist
set.seed(246)
cubist_train <- train(train_x, train_y, method = "cubist")
cubistPred <- predict(cubist_train, test_x)
postResample(cubistPred, test_y)
## RMSE Rsquared MAE
## 0.9571867 0.7379501 0.7042463
Single Tree
set.seed(246)
cart_train <- train(train_x, train_y,method = "rpart",tuneLength = 10,trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartPred <- predict(cart_train, test_x)
postResample(cartPred, test_y)
## RMSE Rsquared MAE
## 1.3913659 0.4252951 1.0420744
Bagged Trees
set.seed(246)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y)
## RMSE Rsquared MAE
## 1.0442505 0.7046695 0.7522740
Cubist has the lower RMSE of 0.9316272 giving it the best test set performance.
varImp(cubist_train, scale = TRUE)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 67.327
## ManufacturingProcess09 62.376
## ManufacturingProcess04 28.713
## ManufacturingProcess29 26.733
## BiologicalMaterial02 24.752
## BiologicalMaterial03 19.802
## ManufacturingProcess39 17.822
## ManufacturingProcess24 17.822
## ManufacturingProcess13 17.822
## ManufacturingProcess33 16.832
## ManufacturingProcess01 16.832
## BiologicalMaterial06 16.832
## ManufacturingProcess14 14.851
## ManufacturingProcess26 12.871
## ManufacturingProcess37 11.881
## ManufacturingProcess07 11.881
## BiologicalMaterial12 10.891
## ManufacturingProcess10 9.901
## ManufacturingProcess21 8.911
All of the models seem to show manufacturing processess atop, though in this model, biological materials 6,3,2 do show significance compared to the optimal linear and nonlinear models. Manufacturing 32 shows dominance as it did in the others, meaning it is seemingly important in the process.
library(rpart)
## Warning: package 'rpart' was built under R version 4.2.3
plot(varImp(cubist_train), top = 10)
tree <- rpart(Yield ~ ., data = Chemical)
plot(tree)