Exercises: 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson in Applied Predictive Modeling .

Load libraries

8.1.

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
# Convert simulated to a data frame
simulated <- as.data.frame(simulated)

# Rename the last column to "y"
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

# Fit random forest
set.seed(133)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
model1 
## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.707207
##                     % Var explained: 72.49
# Variable importance
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##         Overall
## V1   8.56091744
## V2   6.60331747
## V3   0.79661803
## V4   7.67788763
## V5   2.11045929
## V6   0.16503571
## V7   0.03628860
## V8  -0.17581757
## V9  -0.02016877
## V10 -0.09092179
# Visualization
 
# Convert to data frame for easier manipulation
importance_df <- as.data.frame(rfImp1)
# Plot
ggplot(importance_df, aes(x = rownames(importance_df), y = Overall)) +
  geom_bar(stat = "identity") +
  xlab("Predictors") +
  ylab("Importance") +
  ggtitle("Feature Importance")

The presence of very low or negative values for V6 to V10 in the importance metric aligns with their expected role as uninformative predictors. These very low values or negative importance indicate that the random forest model did not significantly use these predictors.Consequently one can conclude that these predictors contributed little to the model’s predictive performance.

Fit another random forest model to these data. Did the importance score for V1 change?

# Fit another random forest model with the same data
set.seed(200)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

# Extract variable importance scores
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##        Overall
## V1   8.5406729
## V2   6.4511374
## V3   0.7470911
## V4   7.6128264
## V5   2.2684088
## V6   0.1190456
## V7   0.1021863
## V8  -0.1632359
## V9  -0.1296344
## V10 -0.0564990
# Compare importance of V1 between the two models
comparison <- data.frame(
  Predictor = rownames(rfImp1),
  Importance_Model1 = rfImp1$Overall,
  Importance_Model2 = rfImp2$Overall
)

comparison
##    Predictor Importance_Model1 Importance_Model2
## 1         V1        8.56091744         8.5406729
## 2         V2        6.60331747         6.4511374
## 3         V3        0.79661803         0.7470911
## 4         V4        7.67788763         7.6128264
## 5         V5        2.11045929         2.2684088
## 6         V6        0.16503571         0.1190456
## 7         V7        0.03628860         0.1021863
## 8         V8       -0.17581757        -0.1632359
## 9         V9       -0.02016877        -0.1296344
## 10       V10       -0.09092179        -0.0564990

With the settings kept constant, one notices that The importance score for V1 has changed slightly, due to the inherent randomness (or stochastic nature) of random forest modeling.

What happens when you add another predictor that is also highly correlated with V1?

# Create a new predictor highly correlated with V1
set.seed(13)
simulated$V1_new <- simulated$V1 + rnorm(nrow(simulated), mean = 0, sd = 0.1)
# Fit a random forest model with the new predictor
set.seed(200)
model_with_correlated <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
# Extract variable importance
importance_with_correlated <- varImp(model_with_correlated, scale = FALSE)
importance_with_correlated
##             Overall
## V1      5.155026629
## V2      6.005110191
## V3      0.513446707
## V4      6.528453202
## V5      1.966877062
## V6      0.172200004
## V7     -0.002699586
## V8     -0.032173854
## V9     -0.061846074
## V10     0.010769920
## V1_new  4.599174995
# Extract variable importance for the original predictors only
original_predictors <- rownames(rfImp1) 
original_predictors
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10"
new_importance <- importance_with_correlated[rownames(importance_with_correlated) %in% original_predictors, , drop = FALSE]
# Create a comparison table
comparison <- data.frame(
  Predictor = original_predictors,
  Original_Importance = rfImp1$Overall,
  With_Correlated_Importance = new_importance$Overall
)

comparison
##    Predictor Original_Importance With_Correlated_Importance
## 1         V1          8.56091744                5.155026629
## 2         V2          6.60331747                6.005110191
## 3         V3          0.79661803                0.513446707
## 4         V4          7.67788763                6.528453202
## 5         V5          2.11045929                1.966877062
## 6         V6          0.16503571                0.172200004
## 7         V7          0.03628860               -0.002699586
## 8         V8         -0.17581757               -0.032173854
## 9         V9         -0.02016877               -0.061846074
## 10       V10         -0.09092179                0.010769920

When you add a predictor that is highly correlated with V1 to the random forest model, it introduces redundancy in the predictor set, because random forests use bagging and feature sampling. When two predictors are highly correlated in random forests, they compete to explain the same variance in the response variable. As a result, one notices that importance scores for V1 and the new predictor have both decreased because the model split the importance between them.

Fit a conditional inference forest (CForest) model using the party package.

library(party)
# Fit a cforest model
set.seed(133)
cforest_model <- cforest(y ~ ., data = simulated, controls = cforest_unbiased(ntree = 1000, mtry = 3))

Calculate variable importance using both traditional and conditional importance measures via varimp.

# Calculate traditional variable importance
traditional_importance <- varimp(cforest_model)
traditional_importance
##            V1            V2            V3            V4            V5 
##  4.5581513597  5.1755314675  0.0521720106  5.7608891485  1.5430552833 
##            V6            V7            V8            V9           V10 
##  0.0133078449  0.0004880347 -0.0607737834  0.0189750261 -0.0230174705 
##        V1_new 
##  4.8450967740
# Calculate conditional variable importance
conditional_importance <- varimp(cforest_model, conditional = TRUE)
conditional_importance 
##          V1          V2          V3          V4          V5          V6 
##  1.72071381  3.70456147  0.06432950  4.08955479  0.91003577  0.04116603 
##          V7          V8          V9         V10      V1_new 
##  0.04433420 -0.01252938  0.02354509 -0.02566334  1.65356169
# Combine results for comparison
importance_cforest <- data.frame(
  Predictor = names(traditional_importance),
  Traditional = traditional_importance,
  Conditional = conditional_importance
)


importance_cforest
##        Predictor   Traditional Conditional
## V1            V1  4.5581513597  1.72071381
## V2            V2  5.1755314675  3.70456147
## V3            V3  0.0521720106  0.06432950
## V4            V4  5.7608891485  4.08955479
## V5            V5  1.5430552833  0.91003577
## V6            V6  0.0133078449  0.04116603
## V7            V7  0.0004880347  0.04433420
## V8            V8 -0.0607737834 -0.01252938
## V9            V9  0.0189750261  0.02354509
## V10          V10 -0.0230174705 -0.02566334
## V1_new    V1_new  4.8450967740  1.65356169

Traditional variable importance scores indicate that V2, V4, and V1 are the most influential predictors, with importance values of 5.18, 5.76, and 4.56, respectively. In contrast, V6 to V10 show negligible or negative contributions, highlighting their lack of relevance. Conditional importance, which accounts for multicollinearity, reveals a redistribution of importance, with V2 and V4 maintaining high scores while V1’s importance significantly drops from 4.56 to 1.72, indicating shared predictive power with these variables. Low-importance predictors (V6 to V10) remain consistently uninformative across both scoring methods.

The adjustment in conditional importance highlights its ability to refine the interpretation of predictor contributions. By correcting for correlated predictors, it emphasizes V2 and V4 as stronger independent contributors while reducing V1’s inflated traditional score. Both approaches consistently identify V6 to V10 as irrelevant. Overall, conditional importance provides a clearer, more accurate representation of the predictors’ true influence on the model.

Fit other tree-based models (boosted trees and Cubist)

Fit a boosted tree model

library(gbm)

# Fit a boosted tree model
set.seed(133)
boosted_model <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)

# Extract variable importance
boosted_importance <- summary(boosted_model, plotit = FALSE)


boosted_importance
##           var    rel.inf
## V4         V4 27.1810429
## V2         V2 22.0628756
## V1         V1 17.2005997
## V5         V5 11.5856445
## V1_new V1_new 10.8275865
## V3         V3  7.2263558
## V6         V6  1.0881299
## V7         V7  0.9598513
## V9         V9  0.7010687
## V8         V8  0.6612897
## V10       V10  0.5055554

Fit a Cubist Model

library(Cubist)

# Fit a Cubist model
cubist_model <- cubist(x = simulated[, -ncol(simulated)], y = simulated$y, committees = 10)

# Extract variable importance
cubist_importance <- summary(cubist_model)


cubist_importance
## 
## Call:
## cubist.default(x = simulated[, -ncol(simulated)], y = simulated$y, committees
##  = 10)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Nov 17 17:52:37 2024
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (12 attributes) from undefined.data
## 
## Model 1:
## 
##   Rule 1/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 2:
## 
##   Rule 2/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 3:
## 
##   Rule 3/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 4:
## 
##   Rule 4/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 5:
## 
##   Rule 5/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 6:
## 
##   Rule 6/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 7:
## 
##   Rule 7/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 8:
## 
##   Rule 8/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 9:
## 
##   Rule 9/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 10:
## 
##   Rule 10/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           0.000000
##     Relative |error|               0.00
##     Correlation coefficient        1.00
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    y
## 
## 
## Time: 0.0 secs

Compare Patterns Across Models

Across both model types, V4 and V2 emerge as dominant predictors, while low-importance predictors (V6 to V10) remain largely irrelevant. The conditional random forest scores and Cubist outputs consistently refine the interpretation of predictor contributions, especially in handling correlated variables like V1. These results affirm the utility of adjusted importance measures in clarifying predictor roles while confirming key predictors across models.

Random Forest Models (Traditional and Conditional Importance): The random forest models emphasize V4, V2, and V1 as the most influential predictors under both traditional and conditional importance measures.However, the conditional importance adjusts for multicollinearity,significantly reducing V1’s importance (from 4.44 to 1.76), while V2 and V4 maintain their dominance, albeit with slightly reduced scores.Predictors V6 to V10 consistently show minimal contributions, with some having negligible or negative importance values, regardless of the scoring method. This indicates that while traditional random forests may overemphasize certain correlated predictors, conditional importance provides a refined view of independent contributions.

Cubist Model: The Cubist model’s relative importance scores align with the random forest results, identifying V4, V2, and V1 as the top contributors, with V4 showing the highest relative importance at 27.18%. Interestingly, the inclusion of a new variable (V1_new) slightly redistributes importance among predictors, but the overall pattern remains consistent, highlighting V4 and V2 as the strongest predictors. The remaining predictors (V6 to V10) contribute minimally, supporting findings from the random forest models.

8.2.

Use a simulation to show tree bias with different granularities

Generate data and introduce granularity

# Generate Data
set.seed(13)
n <- 1000
x1 <- runif(n, 0, 10)   # Continuous predictor
x2 <- runif(n, 0, 10)   # Another continuous predictor
noise <- rnorm(n, 0, 0.5)  # Add some noise
y <- 2*x1 - 3*x2 + noise  # Target variable
# Introduce Granularity
granularities <- c(0, 1, 2, 3)  # Number of decimal places
datasets <- lapply(granularities, function(g) {
  data <- data.frame(
    x1 = round(x1, g),
    x2 = round(x2, g),
    y = y
  )
  data
})

Train models, evaluate bias and collect results

# Train Models 
results <- data.frame()
for (i in seq_along(granularities)) {
  data <- datasets[[i]]
  
  # Decision Tree
  tree_model <- rpart(y ~ ., data = data)
  tree_pred <- predict(tree_model, data)
  tree_rmse <- sqrt(mean((data$y - tree_pred)^2))
  
  # Random Forest
  rf_model <- randomForest(y ~ ., data = data, ntree = 100)
  rf_pred <- predict(rf_model, data)
  rf_rmse <- sqrt(mean((data$y - rf_pred)^2))
  
  # Record Results
  results <- rbind(results, data.frame(
    Granularity = paste0("Decimal Places: ", granularities[i]),
    Model = "Decision Tree",
    RMSE = tree_rmse
  ))
  results <- rbind(results, data.frame(
    Granularity = paste0("Decimal Places: ", granularities[i]),
    Model = "Random Forest",
    RMSE = rf_rmse
  ))
}
 results
##         Granularity         Model      RMSE
## 1 Decimal Places: 0 Decision Tree 3.3891654
## 2 Decimal Places: 0 Random Forest 1.4580070
## 3 Decimal Places: 1 Decision Tree 3.1792126
## 4 Decimal Places: 1 Random Forest 0.4885186
## 5 Decimal Places: 2 Decision Tree 3.1555931
## 6 Decimal Places: 2 Random Forest 0.4002423
## 7 Decimal Places: 3 Decision Tree 3.1540609
## 8 Decimal Places: 3 Random Forest 0.3999103

Visualize the results:

# Step 4: Visualize Results
ggplot(results, aes(x = Granularity, y = RMSE, color = Model, group = Model)) +
  geom_line() +
  geom_point() +
  labs(title = "Impact of Granularity on Model Bias",
       x = "Predictor Granularity",
       y = "Root Mean Squared Error (RMSE)") +
  theme_minimal()

8.3. The problem:

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of theseparameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data.The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9: Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? Which model do you think would be more predictive of other samples?How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

a.

Why does the model on the right focus its importance on just the firstfew of predictors, whereas the model on the left spreads importance across more predictors?

Bagging Fraction (0.1 vs. 0.9):

The bagging fraction controls the proportion of data that is randomly selected in each iteration (or tree-building process). When the bagging fraction is set to 0.1 (left-hand plot), it means that only a small subset of the training data is used to build each tree. As a result, the model may need to explore more features to fit the data adequately and compensate for the limited data seen in each iteration. This can lead to a spread-out importance across more predictors, as the model explores different aspects of the data.

Conversely, with a bagging fraction of 0.9 (right-hand plot), a larger subset of the training data is used, which allows each tree to see a more representative portion of the data. This leads to higher importance on a few strong predictors, as the model is able to capture the key relationships more directly, and the random sampling allows it to focus on the most significant predictors, which could be due to the model’s ability to learn faster and more efficiently from the larger data subset.

Learning Rate (0.1 vs. 0.9):

The learning rate (or shrinkage) determines how much the model adjusts with each new tree. A lower learning rate (0.1) means the model learns more slowly, and it may require more trees to fit the data. This can cause the model to spread its importance over more predictors because each tree adds only a small contribution to the model’s overall prediction. In contrast, a higher learning rate (0.9) means the model makes larger updates with each tree, focusing on fewer predictors to make the necessary adjustments quickly. This can lead to a concentration of importance on just a few predictors that have the most influence on the outcome.

b.

Which model would you think would be more predictive of other samples?

The model on the left, with both bagging fraction and learning rate set to 0.1, is likely to be more robust and generalizable to other samples. Although it spreads the importance across more predictors, this suggests the model is less likely to overfit, as it is incorporating more diversity into its learning process. By using smaller updates (due to the lower learning rate) and smaller data subsets (due to the lower bagging fraction), the model is likely learning more subtle patterns and will be less likely to overfit to the noise in the training data.

The model on the right, (with bagging fraction and learning rate both set to 0.9) may, on the other hand,overfit the data because it relies on a smaller subset of data with larger updates. This could lead to focusing heavily on a few predictors and ignoring others, potentially failing to generalize well to new data.

c.

How would increasing interaction depth affect the slope of predictorimportance for either model in Fig. 8.24?

The interaction depth controls the maximum number of interactions between predictors that a tree can model. It influences how complex each individual tree can become and how well it can capture relationships between predictors. AS a result: - For the model with a low bagging fraction and learning rate (left plot, 0.1/0.1): Increasing interaction depth would likely spread the predictor importance more evenly across predictors. This is because the trees will be allowed to model more complex relationships and interactions between features. Therefore, it would allow the model to explore deeper, non-linear relationships between multiple predictors, which may lead to more balanced importance across a wider set of predictors.

  • For the model with a high bagging fraction and learning rate (right plot, 0.9/0.9): Increasing interaction depth in this case wouldlikely make the predictor importance more concentrated on the few most important predictors. Since the model is already learning faster (due to the high learning rate) and using larger subsets of data (due to the high bagging fraction), increasing interaction depth would enable it to capture even more complex interactions but probably still only between a small number of influential predictors. This would likely steepen the slope of importance,focusing on the top few predictors even more.

In conclusion, The model on the right focuses on fewer predictors due to a higher bagging fraction and learning rate, leading to faster, more concentrated learning. The left model is more balanced and spreads importance across more predictors due to slower learning and smaller data subsets. The left model is likely to generalize better to other samples. Increasing interaction depth will result in more complex trees; for the left model, this will spread importance more evenly, while for the right model, it will make the importance more concentrated on the strongest predictors.

8.7.

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:Which tree-based regression model gives the optimal resampling and test set performance?

Load and prepare the data set

# Load the data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
# Load dataset
data(ChemicalManufacturingProcess)
data <- data.frame(Yield = ChemicalManufacturingProcess$yield, 
                   ChemicalManufacturingProcess$processPredictors)

# Check for total missing values before imputation
total_missing_before <- sum(is.na(ChemicalManufacturingProcess))
total_missing_before
## [1] 106

Applying various imputation techniques

# Impute Missing Values using an imputation loop
for(i in seq_along(ChemicalManufacturingProcess)) {
  if (is.numeric(ChemicalManufacturingProcess[[i]])) {
    ChemicalManufacturingProcess[[i]][is.na(ChemicalManufacturingProcess[[i]])] <- mean(ChemicalManufacturingProcess[[i]], na.rm = TRUE)
  }
}

# Check for total missing values after imputation
total_missing_after <- sum(is.na(ChemicalManufacturingProcess))
total_missing_after
## [1] 0
# Imputation using knn
imputed.knn <- preProcess(ChemicalManufacturingProcess,
           method = "knnImpute",
           k = sqrt(nrow(ChemicalManufacturingProcess))
           )

imputed_data <- predict(imputed.knn, ChemicalManufacturingProcess)

# Check for total missing values after imputation
total_missing_after <- sum(is.na(ChemicalManufacturingProcess))
total_missing_after
## [1] 0
# Imputation using bagImpute
imputed <- predict(preProcess(ChemicalManufacturingProcess, method = 'bagImpute'), ChemicalManufacturingProcess)
# Check for total missing values after imputation
total_missing_after <- sum(is.na(ChemicalManufacturingProcess))
total_missing_after
## [1] 0
# Identify numeric columns
numeric_columns <- sapply(ChemicalManufacturingProcess, is.numeric)

# Apply Box-Cox transformation to each numeric column
transformed_data <- ChemicalManufacturingProcess  # Copy original data for transformation
for (col in names(ChemicalManufacturingProcess)[numeric_columns]) {
  # Apply Box-Cox transformation
  box_cox <- BoxCoxTrans(ChemicalManufacturingProcess[[col]])
  transformed_data[[col]] <- predict(box_cox, ChemicalManufacturingProcess[[col]])
}

# Remove near-zero variance predictors
near_zero <- nearZeroVar(transformed_data)
imputed.data <- transformed_data[, -near_zero]
set.seed(13)

# Apply Box-Cox transformation to numeric columns
transformed_data <- ChemicalManufacturingProcess %>%
dplyr:: mutate(across(where(is.numeric), ~ predict(BoxCoxTrans(.), .)))

# Split the transformed data into training and test sets
trainIndex <- createDataPartition(transformed_data$Yield, p = 0.75, list = FALSE)
training_set <- transformed_data[trainIndex, ]
test_set <- transformed_data[-trainIndex, ]

Train Tree-Based Models:

CART (Classification and Regression Trees)

# Train a CART model
cart_model <- rpart(Yield ~ ., data = training_set)

# Evaluate the model
cart_predictions <- predict(cart_model, test_set)
cart_rmse <- sqrt(mean((cart_predictions - test_set$Yield)^2))
cart_rmse
## [1] 0.0001138218
# Get variable importance in CART model
importance_scores <- cart_model$variable.importance

# Convert the scores to a data frame and sort
importance_df <- data.frame(Variable = names(importance_scores), Importance = importance_scores)
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

# Select the top 15 predictors
top_15_importance <- importance_df[1:15, ]

# Plot the top 15 variable importance scores
ggplot(top_15_importance, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 15 Predictors (CART Model)",
       x = "Predictor Variables",
       y = "Importance Score") +
  theme_minimal()

Conditional Inference Trees (party)

set.seed(13)
# Train a Conditional Inference Tree (CTree)
ctree_model <- ctree(Yield ~ ., data = training_set)

# Evaluate the model
ctree_predictions <- predict(ctree_model, test_set)
ctree_rmse <- sqrt(mean((ctree_predictions - test_set$Yield)^2))
ctree_rmse
## [1] 0.0001487846

Random Forest

set.seed(13)
rf_model <- randomForest::randomForest(Yield ~ ., data = training_set, importance = TRUE)

# Evaluate model performance
rf_predictions <- predict(rf_model, test_set)
rf_rmse <- sqrt(mean((rf_predictions - test_set$Yield)^2))
cat("Random Forest RMSE:", rf_rmse, "\n")
## Random Forest RMSE: 9.710031e-05

Which predictors are most important in the optimal tree-based regression

# Convert to data frame and sort by importance
rf_importance_df <- data.frame(Variable = rownames(rf_importance), 
                               Importance = rf_importance[, "IncNodePurity"])
rf_importance_df <- rf_importance_df[order(-rf_importance_df$Importance), ]

# Display top 10 important predictors
head(rf_importance_df, 10)
##                                      Variable   Importance
## ManufacturingProcess32 ManufacturingProcess32 7.563724e-07
## BiologicalMaterial12     BiologicalMaterial12 3.843806e-07
## BiologicalMaterial03     BiologicalMaterial03 2.965237e-07
## BiologicalMaterial06     BiologicalMaterial06 1.850535e-07
## ManufacturingProcess13 ManufacturingProcess13 1.836092e-07
## ManufacturingProcess09 ManufacturingProcess09 1.715959e-07
## ManufacturingProcess36 ManufacturingProcess36 1.081174e-07
## BiologicalMaterial08     BiologicalMaterial08 1.009380e-07
## BiologicalMaterial05     BiologicalMaterial05 9.944976e-08
## BiologicalMaterial11     BiologicalMaterial11 9.695051e-08

Both biological variables and process variables seem to have equal representation in the top 10 list of the optimal tree-based regression, indicating a balanced contribution to the prediction of yield. This highlights the interplay between biological materials and manufacturing process factors in determining the outcome.

# Visualization
ggplot(rf_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance (IncNodePurity)") +
  ggtitle("Variable Importance from Random Forest Model") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

# Sort the data frame by importance and select top 15 predictors
top_15_rf_importance_df <- rf_importance_df[order(rf_importance_df$Importance, decreasing = TRUE), ][1:15, ]

# Visualization
ggplot(top_15_rf_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance (IncNodePurity)") +
  ggtitle("Top 15 predictors from Random Forest Model") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

# For Random Forest model (rf_model)
importance_rf <- varImp(rf_model, scale = FALSE)
# Plot the variable importance
plot(importance_rf, main = "Random Forest Variable Importance")

Boosted Trees (gbm)

set.seed(13)
# Train a Boosted Trees model
gbm_model <- gbm(Yield ~ ., data = training_set, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01)
# Evaluate GBM model
gbm_predictions <- predict(gbm_model, test_set, n.trees = 100)
gbm_rmse <- sqrt(mean((gbm_predictions - test_set$Yield)^2))
cat("Boosted Trees RMSE:", gbm_rmse, "\n")
## Boosted Trees RMSE: 0.0001243436
# Visualization
ggplot(top_15_gbm_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance (Relative Influence)") +
  ggtitle("Top 15 Predictors from Boosted Trees (GBM)") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

Comparing the trained models

# Create a performance table
performance_table <- data.frame(
  Model = c("CART", "Conditional Inference Tree", "Random Forest", "Boosted Trees"),
  RMSE = c(cart_rmse, ctree_rmse, rf_rmse, gbm_rmse)
)

performance_table
##                        Model         RMSE
## 1                       CART 1.138218e-04
## 2 Conditional Inference Tree 1.487846e-04
## 3              Random Forest 9.710031e-05
## 4              Boosted Trees 1.243436e-04

CART: 1.264661 Conditional Inference Tree: 1.348646 Random Forest: 1.049707 Boosted Trees (GBM): 1.368053

Lower RMSE values indicate better model performance, as RMSE represents the average error between predicted and actual values. Therefore, Random Forest (1.049707) having the lowest RMSE on the test set, provides the most accurate predictions and the optimal performance among the 4 trained models.

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Ridge Model:

For Ridge Regression (using glmnet::cv.glmnet), I will extract the coefficients and rank them by absolute magnitude to identify the most important predictors:

# Extract top 10 predictors (sorted by absolute value of coefficients)
ridge_importance <- coef(ridge_model, s = "lambda.min")[-1]  # Exclude intercept
top_ridge_predictors <- sort(abs(ridge_importance), decreasing = TRUE)[1:10]
top_ridge_predictors
##  [1] 3.002283e+04 1.686491e+04 7.008672e+01 5.886010e+01 5.563372e+01
##  [6] 1.672162e+00 9.048846e-01 6.092309e-01 1.564850e-01 2.080710e-02

Nonlinear Model (Boosted Trees - GBM):

For Boosted Trees (GBM), I will use the gbm package to extract feature importance:

# Fit a GBM model
library(gbm)
gbm_model <- gbm(Yield ~ ., data = training_set, distribution = "gaussian", 
                 n.trees = 100, interaction.depth = 3, shrinkage = 0.01)

# Extract importance
gbm_importance <- summary(gbm_model, cBars = 10)  # 

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Comparison Table:

# Extract variable importance from the GBM model
gbm_importance <- summary(gbm_model, plotit = FALSE)  # Use plotit = FALSE to suppress the plot and get the data
gbm_importance
##                                           var     rel.inf
## ManufacturingProcess32 ManufacturingProcess32 39.17576324
## BiologicalMaterial12     BiologicalMaterial12 10.01866982
## BiologicalMaterial03     BiologicalMaterial03  8.17323731
## ManufacturingProcess09 ManufacturingProcess09  7.29953163
## ManufacturingProcess31 ManufacturingProcess31  3.76118196
## ManufacturingProcess06 ManufacturingProcess06  3.14335540
## BiologicalMaterial11     BiologicalMaterial11  2.82898307
## ManufacturingProcess13 ManufacturingProcess13  2.75804250
## ManufacturingProcess17 ManufacturingProcess17  2.68802629
## BiologicalMaterial06     BiologicalMaterial06  2.37914167
## ManufacturingProcess43 ManufacturingProcess43  1.92790573
## ManufacturingProcess24 ManufacturingProcess24  1.46660218
## BiologicalMaterial05     BiologicalMaterial05  1.32247687
## ManufacturingProcess11 ManufacturingProcess11  1.12149385
## BiologicalMaterial08     BiologicalMaterial08  1.07428890
## ManufacturingProcess20 ManufacturingProcess20  1.03965519
## ManufacturingProcess18 ManufacturingProcess18  0.84057127
## ManufacturingProcess10 ManufacturingProcess10  0.82231522
## ManufacturingProcess39 ManufacturingProcess39  0.68663246
## ManufacturingProcess33 ManufacturingProcess33  0.66318193
## ManufacturingProcess05 ManufacturingProcess05  0.60258141
## BiologicalMaterial09     BiologicalMaterial09  0.55359914
## ManufacturingProcess01 ManufacturingProcess01  0.51070112
## ManufacturingProcess30 ManufacturingProcess30  0.49771899
## ManufacturingProcess04 ManufacturingProcess04  0.49172196
## ManufacturingProcess27 ManufacturingProcess27  0.43366287
## BiologicalMaterial04     BiologicalMaterial04  0.42937642
## BiologicalMaterial01     BiologicalMaterial01  0.40651053
## ManufacturingProcess37 ManufacturingProcess37  0.30556314
## ManufacturingProcess36 ManufacturingProcess36  0.30029768
## ManufacturingProcess02 ManufacturingProcess02  0.29885502
## BiologicalMaterial02     BiologicalMaterial02  0.29668143
## BiologicalMaterial10     BiologicalMaterial10  0.27180585
## ManufacturingProcess23 ManufacturingProcess23  0.26770988
## ManufacturingProcess25 ManufacturingProcess25  0.22959185
## ManufacturingProcess14 ManufacturingProcess14  0.19960455
## ManufacturingProcess26 ManufacturingProcess26  0.19688667
## ManufacturingProcess19 ManufacturingProcess19  0.14958316
## ManufacturingProcess21 ManufacturingProcess21  0.13077494
## ManufacturingProcess03 ManufacturingProcess03  0.09538372
## ManufacturingProcess16 ManufacturingProcess16  0.07759756
## ManufacturingProcess35 ManufacturingProcess35  0.06273560
## BiologicalMaterial07     BiologicalMaterial07  0.00000000
## ManufacturingProcess07 ManufacturingProcess07  0.00000000
## ManufacturingProcess08 ManufacturingProcess08  0.00000000
## ManufacturingProcess12 ManufacturingProcess12  0.00000000
## ManufacturingProcess15 ManufacturingProcess15  0.00000000
## ManufacturingProcess22 ManufacturingProcess22  0.00000000
## ManufacturingProcess28 ManufacturingProcess28  0.00000000
## ManufacturingProcess29 ManufacturingProcess29  0.00000000
## ManufacturingProcess34 ManufacturingProcess34  0.00000000
## ManufacturingProcess38 ManufacturingProcess38  0.00000000
## ManufacturingProcess40 ManufacturingProcess40  0.00000000
## ManufacturingProcess41 ManufacturingProcess41  0.00000000
## ManufacturingProcess42 ManufacturingProcess42  0.00000000
## ManufacturingProcess44 ManufacturingProcess44  0.00000000
## ManufacturingProcess45 ManufacturingProcess45  0.00000000

The GBM model tends to highlight more complex interactions and non-linear relationships, while Ridge regression focuses on the predictors with the strongest linear relationships. Random Forest falls somewhere in between, accounting for both linear and non-linear relationships, making it a more versatile model. More specifically, the GBM model emphasizes the importance of a few predictors like TerminalNode and BiologicalMaterial05, which have very high relative importance. This suggests that the GBM model may be more sensitive to specific variables, likely due to its non-linear nature. The Ridge regression model, being linear, shows a higher importance for BiologicalMaterial03 with a value of 147.87, indicating it is the most important predictor in the linear model. Random Forest gives relatively more weight to TerminalNode, BiologicalMaterial05, and ManufacturingProcess09, with other predictors contributing more moderately. Random Forest models seems to capture interactions and nonlinear relationships, which could explain its different emphasis compared to Ridge.

Plot a single tree with the distribution of yield in the terminal

nodes

CART Tree Visualization

library(rpart)
library(rpart.plot)

# Train a CART model
set.seed(13)
cart_model <- rpart(Yield ~ ., data = training_set, method = "anova")

# Plot the CART tree
rpart.plot(cart_model, type = 2, extra = 101, under = TRUE, tweak = 1.2,
           fallen.leaves = TRUE, main = "Optimal CART Tree for Yield Prediction")

training_set$TerminalNode
##   [1]  6 19 19 19 12 19 20 19 19 19 12 12 12 21 21 20 21 15  5  5 10 10 18 19 21
##  [26] 12 21 21 12 21 21 21 21 15 21 19 21 18 15 11 18 10 12 21 21 15  6  7 15 15
##  [51] 10 10 10 10 18 11 11 18 20 18 18  6  6 18 11  6 11 18 18 11 18  5  5  5  6
##  [76]  6  6 11 11 11  6 18 15 20 18 20  7  6  7  7  7 12 19 12 20 20 20 12 18 11
## [101]  6 11  6  6  6  6  6 11 12 12 12  6  6 11 12  6  6 11  5  5  5  6  6  6  6
## [126]  6  6 18  7  7 21 12

Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Yes, the tree clearly highlights the greater contribution of manufacturing processes to yield, in contrast to the contribution of biological materials. Interpretability is definitely well served by a single tree model.