Introduction

This document presents solutions to a series of exercises from Chapter 6 in the book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. These exercises are designed to deepen understanding of predictive modeling techniques, including feature selection, model evaluation, and insights from predictive performance metrics.

We will address the following problems step by step:

Problem 6.2: This exercise involves predicting molecular permeability using binary molecular fingerprints. We applied feature selection, trained a Partial Least Squares (PLS) model, evaluated it on test data, and compared it to a Random Forest model to identify the most accurate and practical approach.
Problem 6.3: This exercise explores the identification of key predictors and their relationships with the response variable to optimize product yield in a chemical manufacturing process, using Random Forest and resampling methods for robust evaluation.

Each solution is accompanied by R code, results, and discussions:

Problem 6.2

(6.2.a) Load the data

# Load the required library and dataset
library(AppliedPredictiveModeling)
data(permeability)

# View basic information about the dataset
str(permeability)

##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

From the book: “The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.”

(6.2.b) Filter predictors with near-zero variance

How many predictors are left for modeling?

# Load the caret package for the nearZeroVar function
library(caret)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.2

## Loading required package: lattice

# Apply nearZeroVar to the fingerprint predictors
nzv <- nearZeroVar(fingerprints, saveMetrics = TRUE)

# Filter out predictors with near-zero variance
filtered_fingerprints <- fingerprints[, !nzv$nzv]

# Print the number of predictors remaining after filtering
num_remaining_predictors <- ncol(filtered_fingerprints)
num_remaining_predictors

## [1] 388

There are 388 predictors left for modeling.

(6.2.c) Split data, preprocess, and tune a PLS model

How many latent variables are optimal and what is the corresponding resampled estimate of R?

library(pls)

## Warning: package 'pls' was built under R version 4.3.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

# Split the data into training (80%) and testing (20%) sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(permeability, p = 0.8, list = FALSE)

# Training and testing sets
train_fingerprints <- filtered_fingerprints[train_index, ]
test_fingerprints <- filtered_fingerprints[-train_index, ]
train_permeability <- permeability[train_index]
test_permeability <- permeability[-train_index]

# Train a Partial Least Squares (PLS) model with cross-validation
pls_model <- train(
  x = train_fingerprints,
  y = train_permeability,
  method = "pls",
  tuneLength = 10,  # Tune over 10 latent variables
  trControl = trainControl(method = "cv", number = 10)  # 10-fold cross-validation
)

# Print the PLS model results
pls_model

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.91024  0.2908286  10.752868
##    2     11.87088  0.4689347   8.684835
##    3     11.75764  0.4908305   8.904269
##    4     12.01773  0.4912720   9.367372
##    5     11.82378  0.5150347   9.055927
##    6     11.44294  0.5321660   8.498770
##    7     11.41370  0.5373659   8.523544
##    8     11.74730  0.5166471   8.903613
##    9     12.20011  0.4974678   9.304159
##   10     12.61005  0.4728003   9.539143
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.

From the output of the PLS model, the optimal number of latent variables (ncomp) was determined based on minimizing the Root Mean Squared Error (RMSE) during 10-fold cross-validation.

Optimal Number of Latent Variables (ncomp): The optimal value is ncomp = 7, which was selected because it resulted in the smallest RMSE.
Corresponding Resampled Estimate of R: For ncomp = 7, the resampled estimate of R is 0.5374.

These results indicate that using 7 latent variables provides the best balance between predictive accuracy and generalization to unseen data, as evaluated by cross-validation.

Performance Metrics for the Optimal Model:
- RMSE: 1.41
- R: 0.5374
- MAE: 8.52

(6.2.d) Predict the response for the test set

Predict the response for the test set. What is the test set estimate of R?

# Predict permeability for the test set
test_predictions <- predict(pls_model, newdata = test_fingerprints)

# Calculate R-squared for the test set
test_r_squared <- cor(test_predictions, test_permeability)^2
test_r_squared

## [1] 0.3497131

The test set estimate of R was calculated by comparing the predicted permeability values to the actual permeability values in the test set. This provides an evaluation of how well the PLS model generalizes to unseen data.

Test Set R: The calculated value of R for the test set is 0.3497.

This indicates that approximately 34.97% of the variance in the test set permeability values is explained by the model. While the test set R is lower than the resampled R from cross-validation (which was 0.5374), it is not unexpected due to potential overfitting during model training or variability in the test set.

This result suggests that while the model shows moderate predictive performance, there is room for improvement, potentially through alternative models, feature selection, or further hyperparameter tuning.

(6.2.e) Try building other models

Try building other models discussed in this chapter. Do any have better predictive performance?

# Example: Random Forest model
rf_model_6_2 <- train(
  x = train_fingerprints,
  y = train_permeability,
  method = "rf",
  tuneLength = 5,  # Tune over 5 values of mtry
  trControl = trainControl(method = "cv", number = 10)  # 10-fold cross-validation
)

# Print the Random Forest model results
rf_model_6_2

## Random Forest 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 121, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##     2   11.85005  0.5655328  9.144018
##    98   10.89676  0.5966195  7.742966
##   195   11.00132  0.5900059  7.654995
##   291   11.01842  0.5874865  7.686919
##   388   11.10808  0.5772579  7.704560
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 98.

From the results of the Random Forest model, we can evaluate whether it offers better predictive performance compared to the previously built PLS model.

Random Forest Model Results

Optimal Tuning Parameter: The optimal value of mtry (number of predictors randomly sampled at each split) was 98.
Performance Metrics:
- RMSE: 10.8976 (lower than PLS, which had 11.41 for the optimal model)
- R: 0.5966 (higher than PLS, which had 0.5374 for cross-validation)
- MAE: 7.74297 (lower than PLS, which had 8.52)

Comparison of Models

Cross-Validated Performance:
- The Random Forest model outperforms the PLS model across all metrics:
  - Lower RMSE and MAE indicate that Random Forest has smaller errors on average.
  - Higher R indicates that Random Forest explains a greater proportion of variance in the data.
Generalization Potential:
- Random Forest, as an ensemble learning method, is generally more robust and less prone to overfitting than PLS, especially in high-dimensional datasets with complex relationships between predictors.
Interpretability:
- While PLS provides more interpretable results (e.g., latent components), Random Forest focuses on predictive power, which seems to be its advantage here.

In summary

The Random Forest model demonstrates better predictive performance compared to the PLS model based on cross-validated metrics.
For applications where predictive accuracy is the primary goal, Random Forest is the better choice.
However, if interpretability is essential, PLS might still have a role.

(6.2.f) Recommendation for replacing permeability experiments

Based on the results from (c) to (e), the following considerations can guide the recommendation for replacing laboratory permeability experiments with predictive models:

Model Performance:
- The Random Forest model outperformed the PLS model in terms of predictive accuracy:
  - Cross-validated \(R^2\): Random Forest (0.5966) > PLS (0.5374).
  - Test set \(R^2\): PLS achieved 0.3497; the test set performance for Random Forest would need to be evaluated to confirm its generalization ability.
- Random Forest also had lower RMSE and MAE, indicating smaller prediction errors overall.
Generalization:
- While Random Forest is likely to generalize better than PLS, test set evaluation is necessary to confirm its robustness on unseen data.
Interpretability:
- If the goal is to gain insights into the relationships between molecular fingerprints and permeability, PLS may be more suitable due to its interpretability.
- For purely predictive purposes, Random Forest is the preferred model due to its superior performance metrics.
Feasibility for Laboratory Replacement:
- Random Forest shows potential to replace laboratory experiments for permeability prediction, especially in scenarios where prediction accuracy is paramount, and insights into individual predictors are less critical.
- If interpretability is essential, or the test set \(R^2\) for Random Forest is not significantly better than PLS, a combined approach could be considered.

Final Recommendation:

Given the results, Random Forest is recommended as the primary model for replacing permeability experiments, contingent upon its test set performance confirming the observed cross-validated results.
It would also be valuable to monitor how these models perform on future data to ensure consistent predictive accuracy before completely replacing laboratory experiments.

Problem 6.3

(6.3.a) Load the data

# Load the required library and dataset
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# View basic information about the dataset
str(ChemicalManufacturingProcess)

## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.7 14.7 14.7 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...

# Separate predictors and response
predictors <- ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)]  # Exclude 'yield'
response <- ChemicalManufacturingProcess$Yield

From the book: The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(6.3.b) Impute missing values

# Load caret for preprocessing and imputation
library(caret)

# Use the preProcess function for median imputation
preprocess <- preProcess(predictors, method = "medianImpute")
imputed_predictors <- predict(preprocess, predictors)

# Check if missing values are handled
sum(is.na(imputed_predictors))  # Should return 0

## [1] 0

We used preProcess function for median imputation

(6.3.c) Split data, preprocess, and tune a model

What is the optimal value of the performance metric?

# Split the data into training (80%) and testing (20%) sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(response, p = 0.8, list = FALSE)

# Training and testing sets
train_predictors <- imputed_predictors[train_index, ]
test_predictors <- imputed_predictors[-train_index, ]
train_response <- response[train_index]
test_response <- response[-train_index]

# Train a Random Forest model with cross-validation
rf_model <- train(
  x = train_predictors,
  y = train_response,
  method = "rf",
  tuneLength = 5,  # Tune over 5 values of mtry
  trControl = trainControl(method = "cv", number = 10)  # 10-fold cross-validation
)

# Print the model details
rf_model

## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 130, 130, 129, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    1.1088034  0.7478813  0.8695597
##   15    0.6229766  0.9287201  0.4366745
##   29    0.3853600  0.9703916  0.2339461
##   43    0.2701465  0.9826543  0.1406445
##   57    0.2378435  0.9873862  0.1102605
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 57.

The Random Forest model was tuned across several values of (the number of predictors randomly sampled at each split). Based on the results, the following metrics were obtained for the optimal model:

Optimal Tuning Parameter (): The optimal value is , which resulted in the best performance metrics.
Performance Metrics:
- : 0.2378
- : 0.9874
- : 0.1103

These metrics indicate that the model explains approximately 98.74% of the variance in the data, with minimal prediction error (as reflected by the low RMSE and MAE values). The optimal value of the performance metric is therefore 0.9874 for R, which demonstrates excellent predictive accuracy.

(6.3.d) Predict the response for the test set

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Predict yield for the test set
test_predictions <- predict(rf_model, newdata = test_predictors)

# Calculate RMSE for the test set
test_rmse <- sqrt(mean((test_predictions - test_response)^2))

# Print test set RMSE
test_rmse

## [1] 0.1392793

# Compare with resampled RMSE from the training phase
rf_model$results

##   mtry      RMSE  Rsquared       MAE    RMSESD RsquaredSD      MAESD
## 1    2 1.1088034 0.7478813 0.8695597 0.2683554 0.12966646 0.17441405
## 2   15 0.6229766 0.9287201 0.4366745 0.2627290 0.03881346 0.14306262
## 3   29 0.3853600 0.9703916 0.2339461 0.2492077 0.02488059 0.10467149
## 4   43 0.2701465 0.9826543 0.1406445 0.2443090 0.02298169 0.08468996
## 5   57 0.2378435 0.9873862 0.1102605 0.2116371 0.01817200 0.06253861

The performance of the Random Forest model on the test set was evaluated using the Root Mean Squared Error (RMSE).

Test Set Performance Metric:
- The test set RMSE is , indicating a low prediction error on unseen data.
Comparison with Resampled Performance Metric on the Training Set:
- From the cross-validated training phase, the lowest RMSE for the model (with ) was .
- The test set RMSE () is significantly lower than the resampled RMSE from the training phase, suggesting that the model generalizes well and performs better on the test data than expected from the cross-validation results.

This result reflects the robustness of the Random Forest model in capturing the relationships between predictors and the response variable.

(6.3.e) Identify the most important predictors

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

importance <- varImp(rf_model, scale = TRUE)

# Extract variable importance as a data frame
importance_df <- as.data.frame(importance$importance)

# Add row names as a column for predictors
importance_df <- cbind(Predictor = rownames(importance_df), importance_df)

# Exclude 'Yield' from the importance rankings
filtered_importance <- importance_df[importance_df$Predictor != "Yield", ]

# Order by importance
filtered_importance <- filtered_importance[order(filtered_importance$Overall, decreasing = TRUE), ]

# Select the top 10 predictors
top_predictors_to_plot <- head(filtered_importance, 10)

# Plot using ggplot2
library(ggplot2)
ggplot(data = top_predictors_to_plot, aes(x = reorder(Predictor, -Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top Predictors by Importance (Excluding Yield)",
       x = "Predictor", y = "Importance") +
  theme_minimal()

From the variable importance rankings, the Random Forest model identified the following top predictors for determining product yield:

Most Important Predictors:
- The top 10 predictors, ranked by importance, include a mix of biological and process predictors:
  - Biological Predictors:
    - BiologicalMaterial02 (highest-ranked predictor overall),
    - BiologicalMaterial11, BiologicalMaterial03, and BiologicalMaterial12.
  - Process Predictors:
    - ManufacturingProcess13, ManufacturingProcess17, ManufacturingProcess04, ManufacturingProcess06, and ManufacturingProcess09.
Dominance of Predictors:
- Both biological and process predictors play significant roles.
- Biological predictors dominate slightly, accounting for 4 of the top 10 predictors and including the highest-ranked predictor, BiologicalMaterial02.
Conclusion:
- Biological predictors are critical for ensuring raw material quality, which directly impacts yield.
- Process predictors highlight opportunities to refine and optimize key manufacturing steps, further improving yield.

(6.3.f) Explore relationships between top predictors and the response

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

suppressMessages({
suppressWarnings({
# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Define the top predictors to plot
top_predictors_to_plot <- c(
  "BiologicalMaterial02", "BiologicalMaterial03", "BiologicalMaterial11", 
  "ManufacturingProcess13", "ManufacturingProcess17", "ManufacturingProcess04", 
  "BiologicalMaterial12", "ManufacturingProcess06", "ManufacturingProcess09", 
  "ManufacturingProcess09"
)

# Create an empty list to store plots
plots <- list()

# Iterate over the top predictors and create individual plots
for (predictor in top_predictors_to_plot) {
  p <- ggplot(data = ChemicalManufacturingProcess, 
              aes_string(x = predictor, y = "Yield")) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "blue") +
    ggtitle(paste("Relationship between", predictor, "and Yield")) +
    theme_minimal() +
    theme(plot.title = element_text(size = 5, face = 'bold'),
          axis.title = element_text(size = 7),
          axis.text = element_text(size = 5)) # Adjust the size as needed

  
  # Append each plot to the list
  plots[[predictor]] <- p
}

# Combine all plots into a single view using grid.arrange
do.call(grid.arrange, c(plots, ncol = 3)) # Adjust ncol for the layout
})
})

The relationships between the top predictors and yield highlight areas for improvement in both biological materials and manufacturing processes:
- Biological Predictors:
  - Predictors like BiologicalMaterial02, BiologicalMaterial03, and BiologicalMaterial11 exhibit a positive correlation with yield. This suggests that improving the quality of these biological materials could lead to better outcomes.
  - Predictors such as BiologicalMaterial12 show weaker positive trends but still indicate potential opportunities for improvement.
- Process Predictors:
  - Variables such as ManufacturingProcess13, ManufacturingProcess17, and ManufacturingProcess04 show a negative correlation with yield, indicating inefficiencies or suboptimal steps in these processes.
  - Other process-related predictors, like ManufacturingProcess06, exhibit positive trends, suggesting areas for targeted optimization.
How this helps improve yield:
- Focus on biological materials to ensure consistent and high-quality raw inputs.
- Analyze and optimize manufacturing steps that negatively impact yield, such as adjusting settings or refining workflows.
- Balance both aspects—quality inputs and efficient processes—to systematically enhance overall yield in future runs.

Solving Problems 6.2 and 6.3 from Kuhn and Johnson

Souleymane Doumbia

2024-11-24