library(AppliedPredictiveModeling)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6– V10)?
print(rfImp1)
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
With scores this low, the uninformative predictors were not used. All of them have scores under .2 out of 10. V1 and V4 were the most informative predictors in this model.
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
c_model <- cforest(y~., data=simulated)
imp_con <- varimp(c_model, conditional = TRUE)
print(imp_con)
## V1 V2 V3 V4 V5 V6
## 3.87666529 2.32554116 0.02041033 7.85292146 1.00530847 0.06644453
## V7 V8 V9 V10
## -0.02061033 0.01306147 -0.01333696 -0.02609467
The pattern between the two models are very similar, with V4, V1 and V2 being the most important. The difference is using the cforest function, our model puts even more emphasis on V4 specifically. This model allows us to account for multicollinearity between our variables.
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2),
n.trees = seq(100, 1000, by = 100),
shrinkage = 0.1,
n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated,
tuneGrid = gbmGrid, verbose = FALSE,
method = 'gbm')
gbm_imp<-varImp(model_gbm)
print(gbm_imp)
## gbm variable importance
##
## Overall
## V4 100.000
## V1 60.412
## V2 43.192
## V3 30.157
## V5 27.386
## V9 3.726
## V6 2.717
## V10 2.012
## V8 1.112
## V7 0.000
V6-V10 are still playing a minimal role in our model, if any. V4 is clearly the most significant variable, with V1 and V2 behind it.
library(Cubist)
cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp <- varImp(cubist)
print(cubist_imp)
## Overall
## V1 91.0
## V2 91.0
## V3 65.5
## V4 67.5
## V5 50.0
## V8 13.5
## V10 3.0
## V6 0.0
## V7 0.0
## V9 0.0
Like the other models, the cubist shows V4 as the most important variable in the model, followed by V2 and V1. We see a consistent pattern across all models tested in which variables are most and least important.
n <- 100
train <- data.frame(
a = rnorm(n, mean = 0, sd = 1),
b = rnorm(n, mean = 5, sd = 5),
c = rnorm(n, mean = 100, sd = 10)
) |>
mutate(
y = (a+b+c) * rnorm(n, 1, sd = 0.2)
)
trainIndex <- createDataPartition(train$y, p = 0.8, list = FALSE)
trainData <- train[trainIndex, ]
testData <- train[-trainIndex, ]
rpartTune <- train(
y ~ .,
data = trainData,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv")
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(rpartTune)
Here we can see at different levels of granularity of investigating the data, we have different values for RMSE. It seems that a parameter of .05 would be the best at minimizing RMSE without us being concerned of overfitting, since there are models using even more granular analysis that produce larger error values.
The right hand model uses a fraction of .9 for both bagging fraction and learning rate, so the model is training quickly by looking at the majority of the data at once as opposed to the left hand model, with its bagging fraction of .1, only uses 10% of the total data to fit its model.
The two models represent two extremes. I feel the left-hand model has too small of a window into the data to be useful for predictions where the right-hand model is overtrained on the data and may not generalize well to new data points. If I had to choose one, I would imagine the left-hand model would be more predictive of other samples due specifically to the lower learning rate.
There would not likely be as much of an impact on the right-hand model. Using 90% of the data for each model makes it so there is likely less to be revealed by increasing the depth of the interactions. A greater impact would be seen on the left-hand model. Looking at interactions more deeply may help to expose patterns within the data that may be missed due to the smaller size of each sample.
data("ChemicalManufacturingProcess")
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(RANN)
knn <- preProcess(ChemicalManufacturingProcess,
method = c("center","scale","knnImpute"))
knnpre <- predict(knn, ChemicalManufacturingProcess)
missmap(knnpre, col = c("yellow", "navy"))
near_zero_chem <- nearZeroVar(knnpre)
filtered_chem <- knnpre[, -near_zero_chem]
ncol(filtered_chem)
## [1] 57
set.seed(79)
index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)
# We need to exclude 'Yield' column since it is the predicted variable)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]
# Only using the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem <- filtered_chem[-index_chem, "Yield", drop = TRUE]
set.seed(200)
cartTune <- train(X_train_chem, y_train_chem,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartTune
## CART
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01512226 0.7157069 0.5228782 0.5650310
## 0.01615008 0.7187977 0.5165319 0.5661045
## 0.01752915 0.7243224 0.5112672 0.5668065
## 0.01867819 0.7155849 0.5220335 0.5618331
## 0.03190252 0.7187842 0.5077316 0.5778305
## 0.03656443 0.7351140 0.4887174 0.5981998
## 0.04929112 0.7197995 0.4959550 0.5803143
## 0.05938974 0.7186304 0.5012085 0.5768006
## 0.10224046 0.7343931 0.4787643 0.5838474
## 0.40438968 0.9059819 0.2991268 0.7188346
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01867819.
rpartPred <- predict(cartTune, newdata = X_test_chem)
postResample(pred = rpartPred, obs = y_test_chem)
## RMSE Rsquared MAE
## 0.7801803 0.4903291 0.6155187
For the single tree model, the optimized complexity parameter is at 0.0493. Using the model on our test data, we can see that our model accounts for about 49% of the variance in that data.
set.seed(200)
library(ipred)
baggedTree <- ipredbagg(y_train_chem, X_train_chem)
baggPred <- predict(baggedTree, newdata = X_test_chem)
postResample(pred = baggPred, obs = y_test_chem)
## RMSE Rsquared MAE
## 0.6519623 0.6496543 0.5133644
Going by the Rsquared produced by this model, the bagged trees model is a strong improvement over the single tree. The bagged trees model is accounting for nearly 65% of the variance in our testing data, an increase in performance of about double.
set.seed(200)
rforest <- train(X_train_chem, y_train_chem,
method = "rf",
tuneLength = 5,
trControl = trainControl(method = "cv"))
rfPred <- predict(rforest, newdata = X_test_chem)
postResample(pred = rfPred, obs = y_test_chem)
## RMSE Rsquared MAE
## 0.5903720 0.7166334 0.4464794
The random forest model performs even stronger than the bagged trees, since the random forest model is accounting for 71.7% of the variance at its optimal value of mtry of 15.
set.seed(200)
boosted <- train(X_train_chem, y_train_chem,
method = "gbm",
verbose = FALSE,
tuneLength = 5,
trControl = trainControl(method = "cv"))
boostPred <- predict(boosted, newdata = X_test_chem)
postResample(pred = boostPred, obs = y_test_chem)
## RMSE Rsquared MAE
## 0.5419035 0.7511395 0.4426527
Once again, this model captures more variance than the last. For the boosted trees, over 75% of the variance is captured by the model, making it the best predictor yet.
set.seed(200)
cubist_model <- train(
x = X_train_chem,
y = y_train_chem,
method = "cubist"
)
cubPred <- predict(cubist_model, newdata = X_test_chem)
postResample(pred = cubPred, obs = y_test_chem)
## RMSE Rsquared MAE
## 0.4707244 0.8037697 0.3424563
The pattern of improvement in our model continues. The cubist model captures over 80% of the variance.
The optimal resampling and test set performance was seen using a cubist regression model. Not only did it provide the highest value for r squared (.804), it also yielded the lowest value for RMSE (.470).
cubistImp <- varImp(cubist_model)
print(cubistImp)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 78.125
## BiologicalMaterial03 55.208
## ManufacturingProcess13 54.167
## BiologicalMaterial02 44.792
## ManufacturingProcess33 37.500
## ManufacturingProcess04 31.250
## ManufacturingProcess09 31.250
## ManufacturingProcess19 27.083
## ManufacturingProcess11 26.042
## ManufacturingProcess29 23.958
## ManufacturingProcess31 22.917
## ManufacturingProcess27 16.667
## ManufacturingProcess39 15.625
## ManufacturingProcess24 15.625
## ManufacturingProcess25 14.583
## BiologicalMaterial12 12.500
## BiologicalMaterial01 10.417
## BiologicalMaterial04 10.417
## ManufacturingProcess22 9.375
This list is dominated primarily by manufacturing processes, making up 8 of the top 10. This is consistent with what I saw in the optimal linear model and optimal non-linear model. The exact order and weights of the different processes vary across the models, but processes 32 and 17 are consistently the most important.
library(rpart)
cartTune$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 138.983800 -0.02054704
## 2) ManufacturingProcess32< 0.191596 82 39.340670 -0.56378470
## 4) BiologicalMaterial12< -0.6009825 33 12.970970 -0.95039380
## 8) BiologicalMaterial05>=0.4725039 7 1.926762 -1.65683400 *
## 9) BiologicalMaterial05< 0.4725039 26 6.610272 -0.76019830 *
## 5) BiologicalMaterial12>=-0.6009825 49 18.115480 -0.30341530
## 10) ManufacturingProcess25>=-0.0004697389 37 7.425615 -0.51635580 *
## 11) ManufacturingProcess25< -0.0004697389 12 3.839197 0.35315120 *
## 3) ManufacturingProcess32>=0.191596 62 43.439550 0.69792860
## 6) ManufacturingProcess17>=-1.31689 53 25.512250 0.50064950
## 12) ManufacturingProcess09< -0.4721252 12 5.729688 -0.07171796 *
## 13) ManufacturingProcess09>=-0.4721252 41 14.700700 0.66817170 *
## 7) ManufacturingProcess17< -1.31689 9 3.717524 1.85968300 *
rpart.plot::rpart.plot(cartTune$finalModel)
The tree view of the data is helpful in visualizing the relationship between the variables mentioned and their effect on the overall yield. We can see more clearly where the biological and manufacturing elements interact to affect the final yield. For instance, BioMaterial12 was not within the top 10 important variables in the cubist model, but here is is a major linchpin in the tree. The differences within that variable are not statistically significant on their own, only when combined with the following step, incorporating additional variables.