This document presents solutions to a series of exercises from Chapter 6 in the book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. These exercises are designed to deepen understanding of predictive modeling techniques, including feature selection, model evaluation, and insights from predictive performance metrics.
We will address the following problems step by step:
Problem 6.2: This exercise involves predicting molecular permeability using binary molecular fingerprints. We applied feature selection, trained a Partial Least Squares (PLS) model, evaluated it on test data, and compared it to a Random Forest model to identify the most accurate and practical approach.
Problem 6.3: This exercise explores the identification of key predictors and their relationships with the response variable to optimize product yield in a chemical manufacturing process, using Random Forest and resampling methods for robust evaluation.
Each solution is accompanied by R code, results, and discussions:
# Load the required library and dataset
library(AppliedPredictiveModeling)
data(permeability)
# View basic information about the dataset
str(permeability)
## num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr "permeability"
fingerprints
contains the
1,107 binary molecular predictors for the 165 compounds, while
permeability
contains permeability response.”How many predictors are left for modeling?
# Load the caret package for the nearZeroVar function
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Loading required package: lattice
# Apply nearZeroVar to the fingerprint predictors
nzv <- nearZeroVar(fingerprints, saveMetrics = TRUE)
# Filter out predictors with near-zero variance
filtered_fingerprints <- fingerprints[, !nzv$nzv]
# Print the number of predictors remaining after filtering
num_remaining_predictors <- ncol(filtered_fingerprints)
num_remaining_predictors
## [1] 388
How many latent variables are optimal and what is the corresponding resampled estimate of R?
library(pls)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
# Split the data into training (80%) and testing (20%) sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(permeability, p = 0.8, list = FALSE)
# Training and testing sets
train_fingerprints <- filtered_fingerprints[train_index, ]
test_fingerprints <- filtered_fingerprints[-train_index, ]
train_permeability <- permeability[train_index]
test_permeability <- permeability[-train_index]
# Train a Partial Least Squares (PLS) model with cross-validation
pls_model <- train(
x = train_fingerprints,
y = train_permeability,
method = "pls",
tuneLength = 10, # Tune over 10 latent variables
trControl = trainControl(method = "cv", number = 10) # 10-fold cross-validation
)
# Print the PLS model results
pls_model
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.91024 0.2908286 10.752868
## 2 11.87088 0.4689347 8.684835
## 3 11.75764 0.4908305 8.904269
## 4 12.01773 0.4912720 9.367372
## 5 11.82378 0.5150347 9.055927
## 6 11.44294 0.5321660 8.498770
## 7 11.41370 0.5373659 8.523544
## 8 11.74730 0.5166471 8.903613
## 9 12.20011 0.4974678 9.304159
## 10 12.61005 0.4728003 9.539143
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
From the output of the PLS model, the optimal number of latent
variables (ncomp
) was determined based on minimizing the
Root Mean Squared Error (RMSE) during 10-fold cross-validation.
Optimal Number of Latent Variables
(ncomp
): The optimal value is
ncomp
= 7, which was selected because it
resulted in the smallest RMSE.
Corresponding Resampled Estimate of R: For
ncomp
= 7, the resampled estimate of R is
0.5374.
These results indicate that using 7 latent variables provides the best balance between predictive accuracy and generalization to unseen data, as evaluated by cross-validation.
Predict the response for the test set. What is the test set estimate of R?
# Predict permeability for the test set
test_predictions <- predict(pls_model, newdata = test_fingerprints)
# Calculate R-squared for the test set
test_r_squared <- cor(test_predictions, test_permeability)^2
test_r_squared
## [1] 0.3497131
The test set estimate of R was calculated by comparing the predicted permeability values to the actual permeability values in the test set. This provides an evaluation of how well the PLS model generalizes to unseen data.
This indicates that approximately 34.97% of the variance in the test set permeability values is explained by the model. While the test set R is lower than the resampled R from cross-validation (which was 0.5374), it is not unexpected due to potential overfitting during model training or variability in the test set.
This result suggests that while the model shows moderate predictive performance, there is room for improvement, potentially through alternative models, feature selection, or further hyperparameter tuning.
Try building other models discussed in this chapter. Do any have better predictive performance?
# Example: Random Forest model
rf_model_6_2 <- train(
x = train_fingerprints,
y = train_permeability,
method = "rf",
tuneLength = 5, # Tune over 5 values of mtry
trControl = trainControl(method = "cv", number = 10) # 10-fold cross-validation
)
# Print the Random Forest model results
rf_model_6_2
## Random Forest
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 121, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 11.85005 0.5655328 9.144018
## 98 10.89676 0.5966195 7.742966
## 195 11.00132 0.5900059 7.654995
## 291 11.01842 0.5874865 7.686919
## 388 11.10808 0.5772579 7.704560
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 98.
From the results of the Random Forest model, we can evaluate whether it offers better predictive performance compared to the previously built PLS model.
Random Forest Model Results
Optimal Tuning Parameter: The optimal value of
mtry
(number of predictors randomly sampled at each split)
was 98.
Performance Metrics:
RMSE: 10.8976 (lower than PLS, which had 11.41 for the optimal model)
R: 0.5966 (higher than PLS, which had 0.5374 for cross-validation)
MAE: 7.74297 (lower than PLS, which had 8.52)
Comparison of Models
Cross-Validated Performance:
The Random Forest model outperforms the PLS model across all metrics:
Lower RMSE and MAE indicate that Random Forest has smaller errors on average.
Higher R indicates that Random Forest explains a greater proportion of variance in the data.
Generalization Potential:
Interpretability:
In summary
The Random Forest model demonstrates better predictive performance compared to the PLS model based on cross-validated metrics.
For applications where predictive accuracy is the primary goal, Random Forest is the better choice.
However, if interpretability is essential, PLS might still have a role.
Based on the results from (c) to (e), the following considerations can guide the recommendation for replacing laboratory permeability experiments with predictive models:
Final Recommendation:
Given the results, Random Forest is recommended as the primary model for replacing permeability experiments, contingent upon its test set performance confirming the observed cross-validated results.
It would also be valuable to monitor how these models perform on future data to ensure consistent predictive accuracy before completely replacing laboratory experiments.
# Load the required library and dataset
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# View basic information about the dataset
str(ChemicalManufacturingProcess)
## 'data.frame': 176 obs. of 58 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.7 14.7 14.7 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
# Separate predictors and response
predictors <- ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)] # Exclude 'yield'
response <- ChemicalManufacturingProcess$Yield
processPredictors
contains
the 57 predictors (12 describing the input biological material and 45
describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield
for each run.# Load caret for preprocessing and imputation
library(caret)
# Use the preProcess function for median imputation
preprocess <- preProcess(predictors, method = "medianImpute")
imputed_predictors <- predict(preprocess, predictors)
# Check if missing values are handled
sum(is.na(imputed_predictors)) # Should return 0
## [1] 0
preProcess
function for median imputationWhat is the optimal value of the performance metric?
# Split the data into training (80%) and testing (20%) sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(response, p = 0.8, list = FALSE)
# Training and testing sets
train_predictors <- imputed_predictors[train_index, ]
test_predictors <- imputed_predictors[-train_index, ]
train_response <- response[train_index]
test_response <- response[-train_index]
# Train a Random Forest model with cross-validation
rf_model <- train(
x = train_predictors,
y = train_response,
method = "rf",
tuneLength = 5, # Tune over 5 values of mtry
trControl = trainControl(method = "cv", number = 10) # 10-fold cross-validation
)
# Print the model details
rf_model
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 130, 130, 129, 131, 129, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.1088034 0.7478813 0.8695597
## 15 0.6229766 0.9287201 0.4366745
## 29 0.3853600 0.9703916 0.2339461
## 43 0.2701465 0.9826543 0.1406445
## 57 0.2378435 0.9873862 0.1102605
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 57.
The Random Forest model was tuned across several values of (the number of predictors randomly sampled at each split). Based on the results, the following metrics were obtained for the optimal model:
These metrics indicate that the model explains approximately 98.74% of the variance in the data, with minimal prediction error (as reflected by the low RMSE and MAE values). The optimal value of the performance metric is therefore 0.9874 for R, which demonstrates excellent predictive accuracy.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# Predict yield for the test set
test_predictions <- predict(rf_model, newdata = test_predictors)
# Calculate RMSE for the test set
test_rmse <- sqrt(mean((test_predictions - test_response)^2))
# Print test set RMSE
test_rmse
## [1] 0.1392793
# Compare with resampled RMSE from the training phase
rf_model$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 1.1088034 0.7478813 0.8695597 0.2683554 0.12966646 0.17441405
## 2 15 0.6229766 0.9287201 0.4366745 0.2627290 0.03881346 0.14306262
## 3 29 0.3853600 0.9703916 0.2339461 0.2492077 0.02488059 0.10467149
## 4 43 0.2701465 0.9826543 0.1406445 0.2443090 0.02298169 0.08468996
## 5 57 0.2378435 0.9873862 0.1102605 0.2116371 0.01817200 0.06253861
The performance of the Random Forest model on the test set was evaluated using the Root Mean Squared Error (RMSE).
This result reflects the robustness of the Random Forest model in capturing the relationships between predictors and the response variable.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
importance <- varImp(rf_model, scale = TRUE)
# Extract variable importance as a data frame
importance_df <- as.data.frame(importance$importance)
# Add row names as a column for predictors
importance_df <- cbind(Predictor = rownames(importance_df), importance_df)
# Exclude 'Yield' from the importance rankings
filtered_importance <- importance_df[importance_df$Predictor != "Yield", ]
# Order by importance
filtered_importance <- filtered_importance[order(filtered_importance$Overall, decreasing = TRUE), ]
# Select the top 10 predictors
top_predictors_to_plot <- head(filtered_importance, 10)
# Plot using ggplot2
library(ggplot2)
ggplot(data = top_predictors_to_plot, aes(x = reorder(Predictor, -Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top Predictors by Importance (Excluding Yield)",
x = "Predictor", y = "Importance") +
theme_minimal()
From the variable importance rankings, the Random Forest model identified the following top predictors for determining product yield:
BiologicalMaterial02
(highest-ranked predictor
overall),BiologicalMaterial11
,
BiologicalMaterial03
, and
BiologicalMaterial12
.ManufacturingProcess13
,
ManufacturingProcess17
,
ManufacturingProcess04
,
ManufacturingProcess06
, and
ManufacturingProcess09
.BiologicalMaterial02
.Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
suppressMessages({
suppressWarnings({
# Load necessary libraries
library(ggplot2)
library(gridExtra)
# Define the top predictors to plot
top_predictors_to_plot <- c(
"BiologicalMaterial02", "BiologicalMaterial03", "BiologicalMaterial11",
"ManufacturingProcess13", "ManufacturingProcess17", "ManufacturingProcess04",
"BiologicalMaterial12", "ManufacturingProcess06", "ManufacturingProcess09",
"ManufacturingProcess09"
)
# Create an empty list to store plots
plots <- list()
# Iterate over the top predictors and create individual plots
for (predictor in top_predictors_to_plot) {
p <- ggplot(data = ChemicalManufacturingProcess,
aes_string(x = predictor, y = "Yield")) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
ggtitle(paste("Relationship between", predictor, "and Yield")) +
theme_minimal() +
theme(plot.title = element_text(size = 5, face = 'bold'),
axis.title = element_text(size = 7),
axis.text = element_text(size = 5)) # Adjust the size as needed
# Append each plot to the list
plots[[predictor]] <- p
}
# Combine all plots into a single view using grid.arrange
do.call(grid.arrange, c(plots, ncol = 3)) # Adjust ncol for the layout
})
})