library(mlbench)
library(randomForest)
library(caret)
library(tidyverse)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(rJava)
library(RWeka)
library(rpart.plot)
library(doParallel)
cores <- detectCores()
cl <- makeCluster(2)
registerDoParallel(cl)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000,
mtry = 3
)
print(model1)
##
## Call:
## randomForest(formula = y ~ ., data = simulated, importance = TRUE, ntree = 1000, mtry = 3)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.754258
## % Var explained: 72.3
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |>
arrange(desc(Overall))
As we see Random Forest model did not significantly used noise predictors V6-V10. Their importance score values are very close to zeroes. Since noise variables have no real relationship with the outcome, shuffling them during permutation has a very tiny affect on the model’s performance. But Random Forest builds a lot of deep trees with random bootstrap samples of predictors, therefore noise variable occasionally ended up by being used in split by the chance. So when permutation importance is calculated by the model, these small random effects produce very small fluctuations in the error. In the result, noise variables ended up with these small importance values.
Also even V3 is considered as informative variable in the formula, Random Forest struggles to identify these symmetric quadratic curves with just 200 observations. Also the model uses a lot of different small rules for V3 across different trees and none of them considered as very strong. As a result, the permutation step does not detect a large drop in accuracy when V3 is shuffled, so its importance appears low in this model.
set.seed(200)
model_cforest <- cforest(y ~ ., data = simulated,
control = cforest_unbiased(ntree = 1000, mtry = 3))
imp_trad <- varimp(model_cforest, conditional = FALSE)
imp_cond <- varimp(model_cforest, conditional = TRUE)
comparisons <- data.frame(
Standard_RF = rfImp3$Overall,
#CF_Unconditional = imp_trad,
CF_Conditional = imp_cond
)
rownames(comparisons) <- rownames(rfImp3)
comparisons[order(-comparisons$CF_Conditional), ]
The Conditional Inference Forest (cforest) does not
follow the same pattern as the traditional Random Forest model when
dealing with highly correlated predictors.
In the standard Random Forest model, we saw an effect where the importance of the \(V1\) signal was split across its duplicates, allowing predictors \(V4\) and \(V2\) to falsely appear more dominant. However, by using the conditional importance measure, the model partials out these correlations. It strictly evaluates the unique contribution of each variable.
As a result, in the CF_Conditional column, the
importance scores for \(V1\) and its
duplicates all dropped significantly. This is because the algorithm
recognizes that \(V1\),
duplicate1, and duplicate2 have very similar
information. So once the model has one of them, the others provide very
little new information. While this leads to lower absolute importance
values for the \(V1\) and its
duplicates group, it provides a much better assessment of a variable’s
unique predictive power compared to the traditional method.
ctrl <- trainControl(method = "cv", number = 10)
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 500, by = 25),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = 10)
set.seed(200)
gbmTune <- train(y ~ ., data = simulated,
method = "gbm",
tuneGrid = gbmGrid,
trControl = ctrl,
verbose = FALSE)
gbm_varImp <- varImp(gbmTune, scale = TRUE)
gbm_varImp
## gbm variable importance
##
## Overall
## V4 100.000
## V2 75.082
## V1 68.799
## V5 41.947
## V3 32.800
## duplicate1 23.194
## V6 5.230
## V10 3.619
## V8 3.467
## V7 2.872
## V9 2.794
## duplicate2 0.000
plot(gbm_varImp)
cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100),
neighbors = c(0, 1, 3, 6, 9))
set.seed(200)
cubistTune <- train(y ~ .,
data = simulated,
method = "cubist",
tuneGrid = cubistGrid,
trControl = ctrl)
cubistTune$bestTune
feature_names <- colnames(simulated)[-ncol(simulated)]
final_comparison <- data.frame(Variable = feature_names, stringsAsFactors = FALSE)
get_imp <- function(mod_imp) {
imp_df <- as.data.frame(mod_imp$importance)
if("Overall" %in% names(imp_df)) return(imp_df[feature_names, "Overall"])
return(imp_df[feature_names, 1])
}
imp_rf_scaled <- varImp(model3, scale = TRUE)
imp_gbm_scaled <- varImp(gbmTune, scale = TRUE)
imp_cubist_scaled <- varImp(cubistTune, scale = TRUE)
#final_comparison$RF <- get_imp(imp_rf_scale)
final_comparison$GBM <- get_imp(imp_gbm_scaled)
final_comparison$Cubist <- get_imp(imp_cubist_scaled)
final_comparison[is.na(final_comparison)] <- 0
final_comparison %>% arrange(desc(Cubist))
The results confirm that while Random Forest model3
splits importance between correlated variables, GBM and Cubist are much
more efficient at isolating the actual signal.
Random Forest shares importance across \(V1\) and its clones. This makes \(V1\) look weaker than it really is (dropping below \(V4\) and \(V2\)), creating a misleading rankings of importance.
GBM puts \(V1\) significantly
higher than the duplicate1. Because it’s a sequential
learner, it grabs the \(V1\)’s signal
early, leaving very little error for the duplicate to fix.
Cubist model shows the closest pattern to the original data. It gives \(V1\) a highest importance and the duplicates received no importance. Since it uses many linear models internally, it recognizes the redundancy as mathematical instability and simply drops highly correlated clones.
For this exercise I simulate dataset using a latent signal z, then express it in seven different granularities:
x_cont : continuous (200+ unique values)
x_20 : 20‑level ordinal
x_10 : 10‑level ordinal
x_5 : 5‑level ordinal
x_3 : 3‑level ordinal
x_bin : binary
x_noise : pure noise (no relationship to z)
All except x_noise contain the same true information,
just expressed at different granualities.
set.seed(200)
n <- 200
z <- rnorm(n)
x_cont <- z
x_20 <- cut(z, breaks = 20, labels = FALSE)
x_10 <- cut(z, breaks = 10, labels = FALSE)
x_5 <- cut(z, breaks = 5, labels = FALSE)
x_3 <- cut(z, breaks = 3, labels = FALSE)
x_bin <- ifelse(z > 0, 1, 0)
x_20 <- x_20 + rnorm(n, 0, 0.001)
x_10 <- x_10 + rnorm(n, 0, 0.001)
x_5 <- x_5 + rnorm(n, 0, 0.001)
x_3 <- x_3 + rnorm(n, 0, 0.001)
x_bin <- x_bin + rnorm(n, 0, 0.001)
x_noise <- rnorm(n)
y <- 4*z + rnorm(n, sd = 4)
sim_data <- data.frame(
y,
x_cont,
x_20,
x_10,
x_5,
x_3,
x_bin,
x_noise
)
head(sim_data, 15)
# str(sim_data)
sum(is.na(sim_data$y))
## [1] 0
nearZeroVar(sim_data, saveMetrics = TRUE)
I will use CART (rpart2), CTree (ctree) and
Random Forest (rf) models to compare how their
variable-importance biases change with this data
fit_rpart2 <- train(
y ~ .,
data = sim_data,
method = "rpart2",
tuneLength = 10
)
fit_ctree <- train(
y ~ .,
data = sim_data,
method = "ctree"
)
fit_rf <- train(
y ~ .,
data = sim_data,
method = "rf",
importance = TRUE
)
imp_rpart2 <- varImp(fit_rpart2)$importance
imp_ctree <- varImp(fit_ctree)$importance
imp_rf <- varImp(fit_rf)$importance
comparison <- data.frame(
Variable = rownames(imp_rpart2),
CART_rpart2 = imp_rpart2$Overall,
CTree = imp_ctree$Overall,
RandomForest = imp_rf$Overall
)
comparison_long <- comparison |>
pivot_longer(
cols = -Variable,
names_to = "Model",
values_to = "Importance"
) |>
group_by(Variable)|>
mutate(mean_imp = mean(Importance)) |>
ungroup() |>
arrange(desc(Variable)) |>
mutate(Variable = factor(Variable, levels = unique(Variable)))
ggplot(comparison_long, aes(x = Variable, y = Importance, fill = Model)) +
geom_col(position = position_dodge(width = 0.8), width = 0.75) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Variable Importance Comparison Across Tree Models",
y = "Importance",
x = "Predictor"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
legend.position = "right",
plot.title = element_text(face = "bold", size = 18),
panel.grid.minor =element_line()
)
comparison |>
arrange(desc(CART_rpart2))
Across the three tree models, the variable‑importance patterns show how differently each model reacts to predictors that all carry the same underlying signal.
The CART model (rpart2) leans heavily toward
variables with more split points, so x_cont and the
x_20, x_10andx_5 end up at the top even though
they don’t actually contain more information than other variables. And
as expected x_bin got lowest importance since it only
consists of two values. This is the usual granularity bias, and we can
see it pretty clearly here.
CART naturally prefers predictors with more possible split points, which
is why the continuous and the variables with more bins usually get
highest importance.
On the other hand, the CTree (ctree) model spreads
importance much more evenly across all the signal variables, because its
splitting method doesn’t push to having more possible cut points. So
even x_bin and x_3 get high importance, which
makes sense since they reflect the same informative structure.
Random Forest (rf) model also favors the mid‑level
granularity predictors like x_10, x_bin and
x_3, but the tree bias is not as extreme as CART, and
importance is more distributed because of the averaging across
trees.
In all models, the noise variable importance is at zero, which
confirms the simulation is behaving correctly.
So even all these predictors carry the same informative signal, the
models see them differently because of how their splitting methods
work.
library(AppliedPredictiveModeling)
data(solubility)
ls(pattern = '^solT')
## [1] "solTestX" "solTestXtrans" "solTestY" "solTrainX"
## [5] "solTrainXtrans" "solTrainY"
trainingData <- solTrainXtrans
trainingData$Solubility <- solTrainY
dim(trainingData)
## [1] 951 229
ctrl <- trainControl(method = "repeatedcv",
repeats = 5)
set.seed(200)
gbm_0101 <- train(
Solubility~.,
data = trainingData,
method = "gbm",
trControl = ctrl,
verbose = FALSE,
bag.fraction = 0.1,
tuneGrid = data.frame(
n.trees = 1000,
interaction.depth = 2,
shrinkage = 0.1,
n.minobsinnode = 10
)
)
gbm_0909 <- train(
Solubility~.,
data = trainingData,
method = "gbm",
trControl = ctrl,
verbose = FALSE,
bag.fraction = 0.9,
tuneGrid = data.frame(
n.trees = 1000,
interaction.depth = 2,
shrinkage = 0.9,
n.minobsinnode = 10
)
)
imp_0101 <- varImp(gbm_0101)$importance |>
rownames_to_column("Variable") |>
rename(Importance = Overall) |>
arrange(desc(Importance)) |>
slice(1:20) |>
mutate(Model = "shrinkage = 0.1, bag = 0.1")
imp_0909 <- varImp(gbm_0909)$importance |>
rownames_to_column("Variable") |>
rename(Importance = Overall) |>
arrange(desc(Importance)) |>
slice(1:20) |>
mutate(Model = "shrinkage = 0.9, bag = 0.9")
imp_both <- bind_rows(imp_0101, imp_0909)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.5.3
ggplot(imp_both,
aes(x = reorder_within(Variable, Importance, Model),
y = Importance)) +
geom_col() +
coord_flip() +
facet_wrap(~ Model, scales = "free_y") +
scale_x_reordered() +
labs(
title = "Top 20 Important Variables for Each GBM Model",
x = "Variable",
y = "Importance"
) +
theme_minimal()
GBM (shrinkage = 09, bag = 0.9)
On the right model, bagging fraction is 0.9, so each tree sees 90% of
all data. Because of that, the strongest variable
(NumCarbon) has highest chances to show strong signal in
almost every tree. So model keeps picking it again and again almost in
each tree.
At the same time right model combined with high learning rate (0.9),
this model makes very aggressive moves toward correcting the residuals.
As a result, most of the error is explained in the first few trees,
leaving little signal for later trees. This causes the model to rely
heavily on the variables used early, concentrating variable
importance.
GBM (shrinkage = 0.1, bag = 0.1)
Combined with low learning rate (0.1), the model on the left makes
very small updates to the predicted values at each iteration. It only
partially corrects the residuals, so a lot of error is still left to be
explained after each tree.
At the same time, bagging fraction is 0.1, so each tree sees only a
small random part (10%) of the data. Because of that, the strongest
variable has much smaller chances to look dominant in every tree, and
different variables can appear as important depending on the
sample.
Because learning is slow and data is noisier compared to right model,
the left model keep using different variables across many trees. As a
result, residuals are reduced more gradually, and multiple predictors
used more over time.
The left model
(shrinkage = 0.1, bag = 0.1) has better
chances to produce more accurate predictions on new data.
This is because the low learning rate here forces the model to learn
slowly and not overfit too fast. It makes smaller updates, so the model
builds up the prediction gradually and captures more general patterns of
the data.
Also, the low bagging fraction introduces randomness, since each tree
only sees a small subset of the data. This helps reduce variance and
makes the model more robust.
On the other hand, the right model
(shrinkage = 09, bag = 0.9) learns the
pattern very aggressively and uses almost all of the data at each step.
It quickly fits the strongest predictors and reduces error fast, but
this usually can lead to overfitting early splits and relying too much
on a few variables.
Increasing interaction depth makes trees more complex and allows them to capture interactions between variables.
Since GBM (shrinkage = 09, bag = 0.9)
model already relies heavily on a few strong predictors early,
increasing depth will make those variables even more dominant. This
happens because model will create more complex splits and combinations,
so importance becomes even more concentrated and the slope gets
steeper.
For GBM (shrinkage = 09, bag = 0.9),
increasing depth will cause more variables to appear in each tree and
contribute to predictions. Since the model is already using different
predictors because of small learning rate and low bagging fraction, this
will spread importance even more and make the slope more flatter.
Also increasing interaction depth can help model fit training data better, because it can capture more complex relationships. But at the same time it can start chasing the noise, not real patterns, so it can overfit and perform worse on unseen data.
So it really depends on the data. If there are real interactions in data, then increasing depth is useful. But if not, if can force the model to overfit on future predictions.
library(AppliedPredictiveModeling)
library(caret)
library(glmnet)
library(pls)
library(dplyr)
library(ggplot2)
library(lattice)
library(RANN)
library(purrr)
library(e1071)
library(patchwork)
library(corrplot)
library(writexl)
options(scipen = 999)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
allowParallel = TRUE
)
library(ggplot2)
library(patchwork)
plot_model_diagnostics <- function(obs, pred, model_name = "Model") {
diag_df <- data.frame(
observed = as.numeric(obs),
predicted = as.numeric(pred),
residuals = as.numeric(obs) - as.numeric(pred)
)
p1 <- ggplot(data = diag_df, aes(x = predicted, y = observed)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, linetype = 2) +
labs(
title = paste("Predicted vs Observed:", model_name),
x = "Predicted",
y = "Observed"
)
p2 <- ggplot(data = diag_df, aes(x = predicted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
labs(
title = paste("Residuals vs Predicted:", model_name),
x = "Predicted",
y = "Residuals"
)
p1 + p2
}
get_cv_results <- function(model, model_name) {
best <- model$bestTune
res <- model$results
for (param in names(best)) {
res <- res[res[[param]] == best[[param]], ]
}
data.frame(
Model = model_name,
CV_RMSE = res$RMSE,
CV_R2 = res$Rsquared,
CV_MAE = res$MAE
)
}
get_test_results <- function(obs, pred, model_name) {
res <- postResample(pred = pred, obs = obs)
data.frame(
Model = model_name,
Test_RMSE = res["RMSE"],
Test_R2 = res["Rsquared"],
Test_MAE = res["MAE"]
)
}
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176 58
yield <- ChemicalManufacturingProcess[, 1]
processPredictors <- ChemicalManufacturingProcess[, -1]
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8)
sum(is.na(processPredictors))
## [1] 106
The first step was checking the predictor matrix for missing values. There were 106 missing cells across all measurements, which is still a relatively small amount compared with the full dataset (57 predictors and 176 observations) . To impute NA’s I choosed “bagged tree” imputation because it can better preserve nonlinear relationships between variables.
set.seed(200)
imputation <- preProcess(
processPredictors,
method = 'bagImpute'
)
processPredictors_imputed <- predict(imputation, processPredictors)
sum(is.na(processPredictors_imputed))
## [1] 0
nzv_vars <- nearZeroVar(processPredictors_imputed)
nzv_vars
## [1] 7
processPredictors_imputed <- processPredictors_imputed[, -nzv_vars]
Near-zero variance filtering removed 1 predictors that contained almost no useful variation.
I split data into training and test sets using a 75/25 split.
set.seed(200)
train_index <- createDataPartition(yield, p= 0.75, list = FALSE)
train_procPred <- processPredictors_imputed[train_index,]
test_procPred <- processPredictors_imputed[-train_index,]
train_yield <- yield[train_index]
test_yield <- yield[-train_index]
skewness(train_yield)
## [1] 0.2865166
hist(train_yield, breaks = 20)
Before fitting the model, I checked the distribution of the response variable. The yield variable only showed mild right skew, with skewness equal to 0.287, and the histogram looked close to symmetric. Because of this, I kept the response on its original scale to make the final predictions easier to interpret.
cor_matrix <- cor(train_procPred, use = "pairwise.complete.obs")
corrplot(
cor_matrix,
order = 'hclust',
tl.cex = 0.2
)
The correlation matrix shows several strong clusters among variables. Since elastic net is specifically designed to handle correlated predictors through shrinkage, this supports using it as the main modeling approach.
high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)
train_procPred_filtered <- train_procPred[, -high_corr]
test_procPred_filtered <- test_procPred[, -high_corr]
Using a 0.9 cutoff, 10 highly correlated predictors were identified as potential redundant variables.
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
allowParallel = TRUE
)
Single regression tree: rpart
rpart_grid <- data.frame(
cp = seq(0.001, 0.1, length = 30)
)
start_time <- Sys.time()
set.seed(200)
tree_model <- train(
x = train_procPred_filtered,
y = train_yield,
method = "rpart",
trControl = ctrl,
tuneGrid = rpart_grid
)
end_time <- Sys.time()
rpart_train_time <- end_time - start_time
rpart_train_time
## Time difference of 3.791131 secs
Random forest: rf
For this model I tune mtry which controls how many
random predictors the model looks at when making each split in a
tree.
rf_grid <- data.frame(
mtry = c(2, 5, 10, 15, 20, 30, 40)
)
start_time <- Sys.time()
set.seed(200)
rf_model <- train(
x = train_procPred_filtered,
y = train_yield,
method = "rf",
trControl = ctrl,
tuneGrid = rf_grid,
ntree = 500,
importance = TRUE
)
end_time <- Sys.time()
rf_train_time <- end_time - start_time
rf_train_time
## Time difference of 38.81446 secs
Here model tunes number of trees, depth, learning rate, and minimum node size.
gbm_grid <- expand.grid(
n.trees = c(100, 500, 1000),
interaction.depth = c(1, 3, 5),
shrinkage = c(0.01, 0.05, 0.1),
n.minobsinnode = c(5, 10, 25)
)
start_time <- Sys.time()
set.seed(200)
gbm_model <- train(
x = train_procPred_filtered,
y = train_yield,
method = "gbm",
trControl = ctrl,
tuneGrid = gbm_grid,
verbose = FALSE
)
end_time <- Sys.time()
gbm_train_time <- end_time - start_time
gbm_train_time
## Time difference of 1.567774 mins
Tunes committees and neighbors.
committees control how many models Cubist builds one after
another. More committees means the model keeps fixing previous mistakes,
so it becomes more complex and fits data better, but can also
overfit.
neighbors control how many nearby data points model uses
to adjust predictions. If neighbors is 0, prediction comes
only from the model. If it is larger, the model adjusts prediction based
on average of similar observations.
cubist_grid <- expand.grid(
committees = c(1, 5, 10, 25, 50),
neighbors = c(0, 1, 3, 5, 9)
)
start_time <- Sys.time()
set.seed(200)
cubist_model <- train(
x = train_procPred_filtered,
y = train_yield,
method = "cubist",
trControl = ctrl,
tuneGrid = cubist_grid
)
end_time <- Sys.time()
cubist_train_time <- end_time - start_time
cubist_train_time
## Time difference of 16.20378 secs
Caret’s M5 model has limited tuning through
train(), so this is usually kept simple
set.seed(200)
start_time <- Sys.time()
m5_model <- train(
x = train_procPred_filtered,
y = train_yield,
method = "M5",
trControl = ctrl
)
end_time <- Sys.time()
M5_train_time <- end_time - start_time
M5_train_time
## Time difference of 11.79469 secs
set.seed(200)
cv_results <- bind_rows(
get_cv_results(tree_model, "Single Tree"),
get_cv_results(rf_model, "Random Forest"),
get_cv_results(gbm_model, "GBM"),
get_cv_results(cubist_model, "Cubist"),
get_cv_results(m5_model, "M5")
)
#cv_results |> arrange(CV_RMSE)
set.seed(200)
pred_tree <- predict(tree_model, test_procPred_filtered)
pred_rf <- predict(rf_model, test_procPred_filtered)
pred_gbm <- predict(gbm_model, test_procPred_filtered)
pred_cubist <- predict(cubist_model, test_procPred_filtered)
pred_m5 <- predict(m5_model, test_procPred_filtered)
test_results <- bind_rows(
get_test_results(test_yield, pred_tree, "Single Tree"),
get_test_results(test_yield, pred_rf, "Random Forest"),
get_test_results(test_yield, pred_gbm, "GBM"),
get_test_results(test_yield, pred_cubist, "Cubist"),
get_test_results(test_yield, pred_m5, "M5")
)
#test_results |> arrange(Test_RMSE)
model_train_times <- data.frame(
Model = c("Cubist", "Random Forest", "GBM", "Single Tree", "M5"),
Train_Time = c(cubist_train_time,
rf_train_time,
gbm_train_time,
rpart_train_time,
M5_train_time)
)
final_results <- test_results |>
left_join(model_train_times, by = "Model")
#final_results
cv_results |> arrange(CV_RMSE)
final_results |> arrange(Test_RMSE)
Based on cross-validation, GBM and Cubist have almost identical train RMSE = 1.07, at the same Cubist have slightly better \(R^2\).
But test set results shows that Cubist clearly outperformed other models with lowest RMSE and highest R².
So the optimal tree-based model for this data is Cubist, not GBM.
Cubist model test results:
RMSE = 0.8708
\(R^2\) = 0.79
Training time is 17 seconds, which is much faster compared with Random Forest (40 seconds) and GBM (1 minute 37 seconds).
cubist_imp_vars <- varImp(cubist_model)
cubist_imp_vars
## cubist variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32 81.720
## ManufacturingProcess09 40.860
## ManufacturingProcess39 35.484
## ManufacturingProcess33 31.183
## BiologicalMaterial11 30.108
## ManufacturingProcess13 30.108
## ManufacturingProcess28 23.656
## ManufacturingProcess04 23.656
## ManufacturingProcess06 22.581
## ManufacturingProcess19 20.430
## BiologicalMaterial06 18.280
## ManufacturingProcess10 17.204
## ManufacturingProcess14 17.204
## ManufacturingProcess15 16.129
## ManufacturingProcess37 12.903
## BiologicalMaterial05 11.828
## BiologicalMaterial03 10.753
## ManufacturingProcess11 8.602
## ManufacturingProcess03 7.527
cubist_top15 <- cubist_imp_vars$importance |>
tibble::rownames_to_column("Predictor") |>
arrange(desc(Overall)) |>
slice(1:20)
ggplot(cubist_top15, aes(x = reorder(Predictor, Overall), y = Overall)) +
geom_col() +
coord_flip() +
labs(title = "Top 20 Important Predictors (Cubist)") +
theme_minimal()
MARS model’s important predictors
Predictor Overall
ManufacturingProcess32 100.00000
ManufacturingProcess09 46.53772
ManufacturingProcess13 0.00000
ENET Model important predictors
Predictor Overall
ManufacturingProcess32 100.00000
ManufacturingProcess17 80.31063
BiologicalMaterial06 75.08969
ManufacturingProcess13 74.46564
ManufacturingProcess36 68.75818
BiologicalMaterial03 67.10754
ManufacturingProcess06 66.37746
ManufacturingProcess09 65.28374
ManufacturingProcess33 46.90589
BiologicalMaterial08 44.97856
The most important predictors in Cubist tree-based model are mostly
manufacturing process variables. The top predictors include
ManufacturingProcess17, ManufacturingProcess32, ManufacturingProcess09,
and several others. Only a few biological variables appear in the top
list, such as BiologicalMaterial11 and
BiologicalMaterial06.
As we can see manufacturing process variables clearly dominate the Cubist importance list. Most of the top predictors come from manufacturing process values, and this suggests that yield is more strongly affected by manufacturing process measurements rather than biological ones.
If compared to previous exercise models, there are some similarities.
For example, ManufacturingProcess32 appears as the most
important predictor in all Cubist, MARS, and ENET models.
ManufacturingProcess09 and
ManufacturingProcess13 also appear across multiple models,
showing they are consistently important variables.
But there are also some differences. The Cubist model uses a larger range of predictors and splits importance across many variables, while MARS focuses on only a couple of key predictors. ENET includes balanced mix of both process and biological variables, showing that linear models can pick up different relationships compared to tree-based models.
Cubist captures nonlinear relationships and interactions in this data, so it uses more predictors compared to MARS or ENET. This is why its importance is more spread out and includes more process variables.
rpart.plot(
tree_model$finalModel,
type = 1,
extra = 101,
fallen.leaves = TRUE
)
This single tree shows how outcome values changes depend on different
splits and gives a simple way to understand the data.
The first split is on ManufacturingProcess32, which means
this variable is the most important. When
ManufacturingProcess32 is higher (>= 160), the yield
values are >= 40.
Just like in Cubist model, a single tree model splits are based on
manufacturing process variables, like
ManufacturingProcess17, ManufacturingProcess06
and others. This shows that manufacturing variables have stronger effect
on yield. Biological variables appear deeper in the tree, so here their
effect is more conditional and depends on process conditions.
This shows that yield depends on combination of variables, not just one. Also variables interact with each other, because splits happen one after another.
Overall, the tree gives good explanation and shows dominance of manufacturing variables, but it is simple and cannot capture all complex relationships, which is why Cubist performed better.