library(tidyverse)
library(caret)
library(DataExplorer)
library(RANN)
library(ggplot2)
library(gridExtra)
library(patchwork)
library(corrplot)

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data(permeability)
dim(fingerprints)
## [1]  165 1107
dim(permeability)
## [1] 165   1

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the ‘nearZeroVar’ function from the caret package. How many predictors are left for modeling?

nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]

num_predictors_left <- ncol(filtered_fingerprints) # Number of predictors left
num_predictors_left
## [1] 388

Applying the nearZeroVar() function from the caret package filtered out the low-frequency predictors. After removing these predictors, 388 fingerprint variables remained for modeling.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

Split the data into training and test sets

set.seed(100)
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)
X_train <- filtered_fingerprints[trainIndex, ]
X_test <- filtered_fingerprints[-trainIndex, ]
y_train <- permeability[trainIndex]
y_test <- permeability[-trainIndex]

ctrl <- trainControl(method = "cv", number = 10) 

set.seed(200)
pls_model <- train(
  x = X_train,
  y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, ]
##   ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 6     6 11.01532 0.5506871 8.399358 3.380872  0.2642839 2.175713

we have split the data into a training and a test set using an 80/20 split. The predictors are centered and scaled prior to modeling. We have also tuned the Partial Least Squares (PLS) model by using 10-fold cross-validation. The optimal number of latent variables shows 6, and the corresponding resampled estimate of \(R^2\) is approximately 0.264.

Here is a side by side plot showing the model performance across latent variables (PLS with CV) by RMSE and R²

rmse_plot <- ggplot(pls_model$results, aes(x = ncomp, y = RMSE)) +
  geom_line(color = "blue") +
  geom_point(size = 2) +
  geom_vline(xintercept = pls_model$bestTune$ncomp, linetype = "dashed", color = "red") +
  labs(
    title = "Cross-Validated RMSE by Latent Variables",
    x = "Number of Latent Variables (ncomp)",
    y = "RMSE (CV)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 12))

R2_plot <- ggplot(pls_model$results, aes(x = ncomp, y = Rsquared)) +
  geom_line(color = "green") +
  geom_point(size = 2) +
  geom_vline(xintercept = pls_model$bestTune$ncomp, linetype = "dashed", color = "red") +
  labs(
    title = "Cross-Validated R² by Latent Variables",
    x = "Number of Latent Variables (ncomp)",
    y = expression(R^2)
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 12))

(rmse_plot + R2_plot) + 
  plot_annotation(title = "Model Performance Across Latent Variables (PLS with CV)")

Above plots shows Partial Least Squares (PLS) model tuning results using 10-fold cross-validation. The left plot shows the RMSE across different numbers of latent variables, while the right plot shows the corresponding \(R^2\). The vertical dashed line indicates the optimal number of components (6), chosen based on minimum RMSE and maximum \(R^2\).

(d) Predict the response for the test set. What is the test set estimate of \(R^2\)?

predicted <- predict(pls_model, newdata = X_test)
postResample(pred = predicted, obs = y_test)
##      RMSE  Rsquared       MAE 
## 11.773628  0.471816  8.672468

The tuned Partial Least Squares (PLS) model was used to predict the permeability values in the test set. The model’s performance on the test data was evaluated using postResample(), which returned the following metrics: RMSE: 11.77, \(R^2\) : 0.472 and MAE: 8.67. This indicates that the model explains approximately 47.2% of the variability in the test set permeability values, with a typical prediction error (RMSE) of about 11.77 units.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Below, we will try Ridge regression and Principal Components Regression (PCR) to find if there is a better predictive performance.

Ridge Regression:

set.seed(300) 
ridge_model <- train(
  x = X_train,
  y = y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(
    alpha = 0,                         
    lambda = 10^seq(-4, 1, length = 20) 
  ),
  trControl = trainControl(method = "cv", number = 10)
)

ridge_pred <- predict(ridge_model, newdata = X_test)

ridge_results <- postResample(pred = ridge_pred, obs = y_test)
ridge_results
##      RMSE  Rsquared       MAE 
## 12.027195  0.444071  8.723667

Principal Components Regression (PCR):

set.seed(400)
pcr_model <- train(
  x = X_train,
  y = y_train,
  method = "pcr",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl(method = "cv", number = 10)
)

pcr_pred <- predict(pcr_model, newdata = X_test)

pcr_results <- postResample(pred = pcr_pred, obs = y_test)
pcr_results
##       RMSE   Rsquared        MAE 
## 12.1993099  0.4352266  9.1541865

In addition to the PLS model, we have evaluated two other models discussed in Chapter 6: Ridge Regression and Principal Components Regression (PCR). All models were trained using 10-fold cross-validation and evaluated on the same test set.

The PLS model achieved the best predictive performance, with the lowest RMSE and the highest \(R^2\), indicating it explained more variability in the test data than the alternatives.

Ridge Regression performed slightly worse, while PCR had the lowest performance. So, among the models compared, PLS remains the best choice for predicting permeability in this dataset due to its superior performance on the test set.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

While the PLS model performed the best among those tested, it still explains less than half of the variation in permeability, and the RMSE suggests predictions are not highly accurate.

Given this, I wouldn’t recommend using the model as a full replacement for lab experiments at this stage. However, it could be useful as a pre-screening tool to help identify promising compounds and reduce the number of costly lab tests.

With more data and further tuning, the model’s accuracy might improve enough to support more confident decisions in the future.


Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in ‘Sect. 1.4’. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

dim(ChemicalManufacturingProcess)
## [1] 176  58
ls() 
##  [1] "ChemicalManufacturingProcess" "ctrl"                        
##  [3] "filtered_fingerprints"        "fingerprints"                
##  [5] "num_predictors_left"          "nzv"                         
##  [7] "pcr_model"                    "pcr_pred"                    
##  [9] "pcr_results"                  "permeability"                
## [11] "pls_model"                    "predicted"                   
## [13] "R2_plot"                      "ridge_model"                 
## [15] "ridge_pred"                   "ridge_results"               
## [17] "rmse_plot"                    "trainIndex"                  
## [19] "X_test"                       "X_train"                     
## [21] "y_test"                       "y_train"
# column names
names(ChemicalManufacturingProcess)
##  [1] "Yield"                  "BiologicalMaterial01"   "BiologicalMaterial02"  
##  [4] "BiologicalMaterial03"   "BiologicalMaterial04"   "BiologicalMaterial05"  
##  [7] "BiologicalMaterial06"   "BiologicalMaterial07"   "BiologicalMaterial08"  
## [10] "BiologicalMaterial09"   "BiologicalMaterial10"   "BiologicalMaterial11"  
## [13] "BiologicalMaterial12"   "ManufacturingProcess01" "ManufacturingProcess02"
## [16] "ManufacturingProcess03" "ManufacturingProcess04" "ManufacturingProcess05"
## [19] "ManufacturingProcess06" "ManufacturingProcess07" "ManufacturingProcess08"
## [22] "ManufacturingProcess09" "ManufacturingProcess10" "ManufacturingProcess11"
## [25] "ManufacturingProcess12" "ManufacturingProcess13" "ManufacturingProcess14"
## [28] "ManufacturingProcess15" "ManufacturingProcess16" "ManufacturingProcess17"
## [31] "ManufacturingProcess18" "ManufacturingProcess19" "ManufacturingProcess20"
## [34] "ManufacturingProcess21" "ManufacturingProcess22" "ManufacturingProcess23"
## [37] "ManufacturingProcess24" "ManufacturingProcess25" "ManufacturingProcess26"
## [40] "ManufacturingProcess27" "ManufacturingProcess28" "ManufacturingProcess29"
## [43] "ManufacturingProcess30" "ManufacturingProcess31" "ManufacturingProcess32"
## [46] "ManufacturingProcess33" "ManufacturingProcess34" "ManufacturingProcess35"
## [49] "ManufacturingProcess36" "ManufacturingProcess37" "ManufacturingProcess38"
## [52] "ManufacturingProcess39" "ManufacturingProcess40" "ManufacturingProcess41"
## [55] "ManufacturingProcess42" "ManufacturingProcess43" "ManufacturingProcess44"
## [58] "ManufacturingProcess45"

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

sum(is.na(ChemicalManufacturingProcess))
## [1] 106

The dataset ChemicalManufacturingProcess contains a small number of missing values (106 across the predictor columns). To address this, we will use the bagImpute() method from the caret package, which leverages bagged regression trees to estimate and fill in the missing values. We have chose this method because it can model nonlinear relationships and is well-suited for numeric predictors in manufacturing data.

yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]

The above code separates the response (yield) from the predictors (processPredictors), which is essential before preprocessing or modeling.

Impute missing values using bagged trees:

set.seed(500)
imputer <- preProcess(processPredictors, method = "bagImpute")
imputed_data <- predict(imputer, newdata = processPredictors)
sum(is.na(imputed_data))  
## [1] 0

All missing values have been successfully imputed using bagImpute().

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First we remove near-zero variance predictors and split again using same indices.

filtered_data <- imputed_data[, -nearZeroVar(imputed_data)]

X_train <- filtered_data[trainIndex, ]
X_test <- filtered_data[-trainIndex, ]

Split the data:

set.seed(600)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
X_train <- imputed_data[trainIndex, ]
X_test <- imputed_data[-trainIndex, ]
y_train <- yield[trainIndex]
y_test <- yield[-trainIndex]

Set up cross-validation:

ctrl <- trainControl(method = "cv", number = 10)

Train a PLS model with preprocessing and view best performance

set.seed(700)
pls_model <- train(
  x = X_train,
  y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, ]
##   ncomp     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 2     2 1.367518 0.5417626 1.062286 0.5303672  0.2157857 0.3165178

Train a Ridge model with preprocessing and view best performance.

set.seed(800)
ridge_model <- train(
  x = X_train,
  y = y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(
    alpha = 0,                         # Alpha = 0 for Ridge
    lambda = 10^seq(-4, 1, length = 20)  # Try a range of lambda values
  ),
  trControl = trainControl(method = "cv", number = 10)
)

ridge_model$results[ridge_model$results$lambda == ridge_model$bestTune$lambda, ]
##    alpha   lambda     RMSE  Rsquared     MAE    RMSESD RsquaredSD     MAESD
## 18     0 2.976351 1.308237 0.5461985 1.04494 0.1584835  0.1023064 0.1159436

After removing near-zero variance predictors, we have split the data was into training (80%) and test (20%) sets using a fixed seed for reproducibility. Two models were tuned using 10-fold cross-validation: Partial Least Squares (PLS) and Ridge Regression. Preprocessing included centering and scaling. The PLS model achieved optimal performance with 2 latent variables, resulting in a cross-validated RMSE of \(1.37\) and \(R^2=0.54\). The Ridge model, tuned over a log-scaled range of lambda values, minimized RMSE at approximately \(\log_{10}(\lambda) \approx 0.53\), with an RMSE of \(1.31\) and \(R^2= 0.55\)

The plots below shows the RMSE curves from tuning both models:

Model Tuning Comparison: PLS vs Ridge

# PLS RMSE plot
pls_plot <- ggplot(pls_model$results, aes(x = ncomp, y = RMSE)) +
  geom_line(color = "gray") +
  geom_point(size = 2) +
  geom_vline(xintercept = pls_model$bestTune$ncomp, linetype = "dashed", color = "red") +
  labs(
    title = "PLS: RMSE by Latent Variables",
    x = "Number of Latent Variables",
    y = "RMSE (CV)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 12))

# Ridge RMSE plot (log10 lambda)
ridge_plot <- ggplot(ridge_model$results, aes(x = log10(lambda), y = RMSE)) +
  geom_line(color = "gray") +
  geom_point(size = 2) +
  geom_vline(xintercept = log10(ridge_model$bestTune$lambda), linetype = "dashed", color = "red") +
  labs(
    title = "Ridge: RMSE by log10(Lambda)",
    x = expression(log[10](lambda)),
    y = "RMSE (CV)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 12))

# Combine plots
pls_plot + ridge_plot + plot_annotation(title = "Model Tuning Comparison: PLS vs Ridge")

The PLS model achieved optimal cross-validation performance at 2 latent variables, while Ridge Regression minimized RMSE at \(\log_{10}(\lambda) \approx 0.53\), showing slightly better generalization and stability across tuning.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Predict on the test set: PLS Model:

pls_pred <- predict(pls_model, newdata = X_test)
pls_perf <- postResample(pred = pls_pred, obs = y_test)
pls_perf
##      RMSE  Rsquared       MAE 
## 4.1250585 0.1058172 1.6148758

Ridge Model:

ridge_pred <- predict(ridge_model, newdata = X_test)
ridge_perf <- postResample(pred = ridge_pred, obs = y_test)
ridge_perf
##     RMSE Rsquared      MAE 
## 2.265045 0.172868 1.313020

When evaluated on the test set: - The PLS model’s RMSE increased from 1.37 (CV) to 4.13, and \(R^2\) dropped from 0.54 to 0.11, suggesting overfitting.

This comparison shows that while both models suffered some loss in performance, Ridge Regression generalized better and is more robust for this dataset.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Get Predictor Importance.

For PLS:

pls_imp <- varImp(pls_model)
plot(pls_imp, top = 20, main = "Top 20 Most Important Predictors - PLS")

For Ridge:

ridge_imp <- varImp(ridge_model)
plot(ridge_imp, top = 20, main = "Top 20 Most Important Predictors - Ridge")

top_vars_pls <- rownames(pls_imp$importance)[order(pls_imp$importance$Overall, decreasing = TRUE)][1:20]

bio_count <- sum(grepl("^BiologicalMaterial", top_vars_pls))
proc_count <- sum(grepl("^ManufacturingProcess", top_vars_pls))

bio_count
## [1] 9
proc_count
## [1] 11
top_vars_ridge <- rownames(ridge_imp$importance)[order(ridge_imp$importance$Overall, decreasing = TRUE)][1:20]

bio_count_ridge <- sum(grepl("^BiologicalMaterial", top_vars_ridge))
proc_count_ridge <- sum(grepl("^ManufacturingProcess", top_vars_ridge))

bio_count_ridge
## [1] 4
proc_count_ridge
## [1] 16

We have used variable importance to determine which predictors contributed most to the performance of both the PLS and Ridge Regression models.

In the PLS model, 9 out of the top 20 most important predictors were related to biological materials, while 11 were associated with manufacturing process variables. This indicates a relatively balanced influence between the two types of predictors, with a slight dominance by process variables.

In contrast, the Ridge Regression model showed a stronger emphasis on manufacturing conditions. 16 of the top 20 predictors were process-related, and only 4 were biological. This suggests that Ridge places greater weight on features that can be controlled during production, which aligns with expectations in a manufacturing context.

These results suggest that optimizing manufacturing parameters could lead to greater improvements in yield. However, since biological variables still appear in both models, ensuring the quality of input materials remains important..

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

We examined the top five predictors identified by the Ridge Regression model and explored their relationships with the response variable (yield) using a correlation plot.

Top 5 predictors from Ridge model

top5 <- ridge_imp$importance %>%
  arrange(desc(Overall)) %>%
  head(5)

ChemicalManufacturingProcess %>%
  select(Yield, all_of(rownames(top5))) %>%
  cor(use = "complete.obs") %>%
  corrplot::corrplot(method = "color", type = "upper", order = "hclust",
           addCoef.col = "black", tl.col = "black", tl.cex = 0.8)

The strongest positive associations with yield were found in ‘ManufacturingProcess32’ (correlation ≈ 0.60) and ‘ManufacturingProcess09’ (correlation ≈ 0.52), indicating that these process variables may play an important role in determining output.

In contrast, ‘ManufacturingProcess13’ and ‘ManufacturingProcess17’ showed moderate negative correlations with yield, suggesting they might be contributing to lower performance under certain conditions. ‘ManufacturingProcess36’ had minimal correlation, indicating limited direct influence on yield.

Since these predictors are part of the manufacturing process; and therefore adjustable; this information can help guide process optimization. Focusing on boosting positively correlated variables while minimizing the impact of negatively correlated ones could lead to better yields in future production runs.