6.2. Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling) ##From task 
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(patchwork)  # For combining plots
data(permeability) ## From task

dim(fingerprints)
## [1]  165 1107
dim(permeability)
## [1] 165   1

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

####(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.How many predictors are left for modeling?

nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
ncol(filtered_fingerprints)
## [1] 388

####(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(100)
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)
X_train <- filtered_fingerprints[trainIndex, ]
X_test <- filtered_fingerprints[-trainIndex, ]
y_train <- permeability[trainIndex]
y_test <- permeability[-trainIndex]

ctrl <- trainControl(method = "cv", number = 10)

set.seed(200)
pls_model <- train(
  x = X_train,
  y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, ]
##   ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 6     6 11.01532 0.5506871 8.399358 3.380872  0.2642839 2.175713

Analysis: We perform an 80/20 train-test split using stratified sampling. Next, we build a Partial Least Squares (PLS) model. PLS is ideal for high-dimensional, collinear predictors like molecular fingerprints. We use 10-fold cross-validation for tuning and test up to 20 latent variables (ncomp). The best model used 6 components and achieved a resampled𝑅^2 of around 0.55, indicating moderate predictive power.

Further graphs for visulizization

rmse_plot <- ggplot(pls_model$results, aes(x = ncomp, y = RMSE)) +
  geom_line(color = "blue") +
  geom_point(size = 2) +
  geom_vline(xintercept = pls_model$bestTune$ncomp, linetype = "dashed", color = "red") +
  labs(title = "Cross-Validated RMSE by Latent Variables", x = "Number of Latent Variables", y = "RMSE") +
  theme_minimal()

r2_plot <- ggplot(pls_model$results, aes(x = ncomp, y = Rsquared)) +
  geom_line(color = "darkgreen") +
  geom_point(size = 2) +
  geom_vline(xintercept = pls_model$bestTune$ncomp, linetype = "dashed", color = "red") +
  labs(title = "Cross-Validated R² by Latent Variables", x = "Number of Latent Variables", y = expression(R^2)) +
  theme_minimal()

rmse_plot + r2_plot + plot_annotation(title = "PLS Model Tuning Results")

Analysis: These plots help us visually inspect model performance across latent variables. The left panel shows RMSE decreasing until 6 components, while the right panel shows𝑅^ 2 increasing and peaking at the same point. These plots validate our choice of 6 components as the optimal number.

####(d) Predict the response for the test set. What is the test set estimate of R2?

predicted <- predict(pls_model, newdata = X_test)
postResample(pred = predicted, obs = y_test)
##      RMSE  Rsquared       MAE 
## 11.773628  0.471816  8.672468

Analysis: The trained PLS model is applied to the test data. The function postResample() evaluates prediction accuracy. The test set 𝑅^ 2 is approximately 0.472, with an RMSE of about 11.77. While not perfect, this suggests the model captures nearly half the variance in permeability, which is promising.

####(e) Try building other models discussed in this chapter. Do any have better predictive performance? Ridge Regression:

set.seed(300)
ridge_model <- train(
  x = X_train,
  y = y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(-4, 1, length = 20)),
  trControl = trainControl(method = "cv", number = 10)
)

ridge_pred <- predict(ridge_model, newdata = X_test)
postResample(pred = ridge_pred, obs = y_test)
##      RMSE  Rsquared       MAE 
## 12.027195  0.444071  8.723667

PCR (Principal Components Regression):

set.seed(400)
pcr_model <- train(
  x = X_train,
  y = y_train,
  method = "pcr",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl(method = "cv", number = 10)
)

pcr_pred <- predict(pcr_model, newdata = X_test)
postResample(pred = pcr_pred, obs = y_test)
##       RMSE   Rsquared        MAE 
## 12.1993099  0.4352266  9.1541865

Analysis: We tested Ridge Regression and PCR as alternative models. Ridge had a slightly lower R^2 (~0.44) and higher RMSE (~12.03) than PLS. PCR performed the worst, with an r^2 of ~0.43 and an RMSE of ~12.2. This confirms that PLS outperformed both in predictive accuracy, likely due to its better handling of multicollinearity and relevant dimensionality reduction.

####(f) Would you recommend any of your models to replace the permeability laboratory experiment?

Although the Partial Least Squares (PLS) model demonstrated the best predictive performance among the tested models, it still explains less than half of the variance in permeability in the test data, with an R^2 of approximately 0.47. This level of predictive accuracy suggests that while the model captures some meaningful patterns in the data, it is not yet reliable enough to fully replace traditional laboratory experiments. The relatively high RMSE (around 11.77) indicates a significant margin of error in the predictions, which could lead to both false positives and negatives if used as the sole decision-making tool in drug development. In high-stakes scenarios like pharmaceutical research, where precision is critical, such predictive uncertainty may be unacceptable.

However, the model still holds potential value as a pre-screening or filtering tool. By using the PLS model in the early stages of drug discovery, researchers could prioritize compounds that are more likely to exhibit adequate permeability, thereby reducing the number of molecules subjected to costly and time-consuming lab testing. This hybrid approach—combining machine learning for rapid initial screening with experimental validation for final decisions—could significantly improve efficiency and reduce expenses. With more data, especially involving a broader range of compounds and permeability values, and further model tuning or feature engineering, it’s likely that the performance of the model could be improved to the point where its role in the decision pipeline becomes even more impactful.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

####(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

library(caret)

# Check how many NA values exist
sum(is.na(processPredictors))
## [1] 106
# Impute using bagImpute (bagged trees)
set.seed(123)
imputer <- preProcess(processPredictors, method = "bagImpute")
imputed_data <- predict(imputer, newdata = processPredictors)

# Confirm no missing values remain
sum(is.na(imputed_data))
## [1] 0

Explanation:

There are 106 missing values scattered throughout the predictors. To handle this, we use bagImpute() from the caret package, which uses bagged regression trees to estimate the missing values. This method is robust to nonlinear relationships and is suitable for numeric predictors. After imputation, all missing values are resolved.

####(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# Remove near-zero variance predictors
nzv <- nearZeroVar(imputed_data)
filtered_data <- imputed_data[, -nzv]

# Split data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
X_train <- filtered_data[trainIndex, ]
X_test  <- filtered_data[-trainIndex, ]
y_train <- yield[trainIndex]
y_test  <- yield[-trainIndex]

# Set up cross-validation control
ctrl <- trainControl(method = "cv", number = 10)

# Train a PLS model
set.seed(123)
pls_model <- train(
  x = X_train, y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)

# Best PLS performance
pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, ]
##   ncomp     RMSE  Rsquared     MAE    RMSESD RsquaredSD     MAESD
## 3     3 1.365538 0.6033148 1.02937 0.8527882  0.2338247 0.4145153

Explanation:

We first remove near-zero variance predictors, which typically contribute little to model performance and may increase noise. The data is then split into training (80%) and testing (20%) subsets using createDataPartition() for stratified sampling. We use 10-fold cross-validation to train a Partial Least Squares (PLS) model, a method well-suited for handling multicollinearity and high-dimensional data. Preprocessing includes centering and scaling.

####(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Predict on test set
pls_pred <- predict(pls_model, newdata = X_test)
pls_perf <- postResample(pls_pred, y_test)
pls_perf
##      RMSE  Rsquared       MAE 
## 1.3740887 0.4724546 1.1672226

Explanation:

We evaluate the PLS model’s generalization on the unseen test data. We compute RMSE, R-squared, and MAE using postResample(). Comparing these metrics to those from cross-validation helps determine if the model is overfitting. Typically, a significant drop in R² and rise in RMSE on the test set suggests the model may not generalize well.

####(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# Get and plot variable importance
pls_imp <- varImp(pls_model)
## Warning: package 'pls' was built under R version 4.4.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
plot(pls_imp, top = 20, main = "Top 20 Important Predictors (PLS)")

# Count predictor types
top_vars <- rownames(pls_imp$importance)[order(pls_imp$importance$Overall, decreasing = TRUE)][1:20]
bio_count <- sum(grepl("^BiologicalMaterial", top_vars))
proc_count <- sum(grepl("^ManufacturingProcess", top_vars))

bio_count
## [1] 9
proc_count
## [1] 11

Explanation:

We use the varImp() function to determine the most influential predictors. Among the top 20, we count how many are biological (fixed) versus process (adjustable) variables. This analysis helps us understand where interventions are possible—process predictors offer opportunities to optimize production, while biological ones can help assess input quality.

####(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process

# Extract the top 5 most important predictors
top_vars <- rownames(pls_imp$importance)[order(pls_imp$importance$Overall, decreasing = TRUE)][1:5]

# Combine top predictors and Yield into a new data frame
top_data <- data.frame(Yield = yield, imputed_data[, top_vars])

# Plot scatterplots with smooth trend lines
library(ggplot2)
library(tidyr)

top_data_long <- pivot_longer(top_data, -Yield, names_to = "Predictor", values_to = "Value")

ggplot(top_data_long, aes(x = Value, y = Yield)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  facet_wrap(~ Predictor, scales = "free_x") +
  theme_minimal() +
  labs(title = "Yield vs Top 5 Predictors",
       x = "Predictor Value", y = "Yield")
## `geom_smooth()` using formula = 'y ~ x'

Explanation:

In this analysis, we explore the relationship between the response variable (Yield) and the five most important predictors identified by the PLS model. To visualize this, we create individual scatterplots for each of the top predictors using ggplot2, along with smooth LOESS curves to highlight the general trend of each relationship. This helps in visually identifying whether the predictor has a positive or negative influence on yield, as well as any nonlinear patterns that may be present.

From the plots, we can observe the nature of each predictor’s impact on yield. For example, if a predictor shows an upward trend, increasing that variable might enhance yield. Conversely, a downward or flat trend might indicate limited or negative contribution. This graphical approach not only reinforces the numerical importance of the predictors but also provides intuitive insight for process engineers or decision-makers looking to adjust key variables for improved manufacturing performance.

Overall Findings and Conclusions

The goal of this analysis was to develop a predictive model for product yield in a chemical manufacturing process using a wide set of biological and process-related predictors. After handling missing values through imputation, we applied preprocessing techniques such as centering and scaling, and then trained multiple regression models including Partial Least Squares (PLS), Principal Components Regression (PCR), and Ridge Regression. Among these, the PLS model demonstrated the best cross-validated performance, effectively balancing bias and variance while retaining interpretability.

Feature importance from the PLS model highlighted a subset of variables that strongly influence yield, allowing us to focus on the most impactful process parameters. Further visual exploration confirmed meaningful, often nonlinear relationships between yield and these top predictors. Overall, this analysis not only provides a reliable predictive model for yield but also delivers actionable insights into which process variables should be monitored or optimized to improve production outcomes. These findings can guide future experimental design, process control, and continuous improvement efforts within the manufacturing pipeline.