Excercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Load the data permeability set from AppliedPredictiveModeling library: The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RANN)
data("permeability")
str(permeability)
##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"
dim(permeability)
## [1] 165   1
# binary molecular 
as.data.frame(fingerprints) %>% mutate(permeability = permeability) %>% ncol()-1
## [1] 1107
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

The output of ncol(filtered_fingerprints) gives 388 as the number of predictors left for modeling.

nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]

# How many predictors are left?
ncol(filtered_fingerprints)
## [1] 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(123)
trainIndex <- createDataPartition(permeability, p = 0.75, list = FALSE)
trainX <- filtered_fingerprints[trainIndex, ]
testX  <- filtered_fingerprints[-trainIndex, ]
trainY <- permeability[trainIndex]
testY  <- permeability[-trainIndex]

# Prepare the data for PLS Model 
pls_grid <- expand.grid(ncomp = 1:20)

ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 10)

pls_model <- train(
  x = trainX,
  y = trainY,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneGrid = pls_grid,
  trControl = ctrl,
  metric = "Rsquared"
)


pls_model$bestTune           # Optimal number of latent variables
##   ncomp
## 7     7
max(pls_model$results$Rsquared)  # Best R² from cross-validation
## [1] 0.5170516
  1. Predict the response for the test set. What is the test set estimate of R2?
pls_pred <- predict(pls_model, testX)

# Compute test set R²
postResample(pls_pred, testY)
##       RMSE   Rsquared        MAE 
## 11.5920264  0.4047344  8.0675597

Based on the test set R² values, the Support Vector Machine (SVM) model with a radial kernel demonstrated the best predictive performance for estimating permeability. It achieved an R² of approximately 0.615, indicating that it explains about 61.5% of the variability in the permeability data. In comparison, the Random Forest model had a lower R² of 0.477, while the Lasso Regression model performed the worst, with an R² of 0.342. These results suggest that the SVM model is better suited for capturing the complex, possibly non-linear relationships between the molecular fingerprint predictors and the permeability response. Given its superior performance, the SVM model appears to be the most promising candidate to potentially replace or supplement traditional laboratory experiments for permeability estimation, offering a faster and more cost-effective alternative.

rf_model <- train(
  x = trainX,
  y = trainY,
  method = "rf",
  trControl = ctrl,
  metric = "Rsquared"
)
rf_pred <- predict(rf_model, testX)
rf_r2 <- postResample(rf_pred, testY)[["Rsquared"]]


lasso_model <- train(
  x = trainX,
  y = trainY,
  method = "glmnet",
  trControl = ctrl,
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(alpha = 1, lambda = 10^seq(-4, 1, length = 100)),
  metric = "Rsquared"
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
lasso_pred <- predict(lasso_model, testX)
lasso_r2 <- postResample(lasso_pred, testY)[["Rsquared"]]

svm_model <- train(
  x = trainX,
  y = trainY,
  method = "svmRadial",
  trControl = ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10,
  metric = "Rsquared"
)
svm_pred <- predict(svm_model, testX)
svm_r2 <- postResample(svm_pred, testY)[["Rsquared"]]

r2_results <- data.frame(
  Model = c("Random Forest", "Lasso Regression", "SVM (Radial)"),
  R_Squared = c(rf_r2, lasso_r2, svm_r2)
)
print(r2_results)
##              Model R_Squared
## 1    Random Forest 0.4768717
## 2 Lasso Regression 0.3422944
## 3     SVM (Radial) 0.6152973

I would recommend Support Vector Machines because it looks like it perfoms better than other model selected in the analysis.

Excercise 6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

data(ChemicalManufacturingProcess)

# The data is a list; extract the components:
yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[,-58]

# Check dimensions
dim(processPredictors)  # Should return 176 x 57
## [1] 176  57
length(yield)           # Should return 176
## [1] 176
preProc <- preProcess(processPredictors, method = "knnImpute")
imputed_data <- predict(preProc, processPredictors)
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(123)
trainIndex <- createDataPartition(yield, p = 0.75, list = FALSE)

trainX <- imputed_data[trainIndex, ]
testX  <- imputed_data[-trainIndex, ]
trainY <- yield[trainIndex]
testY  <- yield[-trainIndex]

ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 10)

rf_model <- train(
  x = trainX,
  y = trainY,
  method = "rf",
  trControl = ctrl,
  metric = "Rsquared"
)

rf_model$results
##   mtry      RMSE  Rsquared       MAE    RMSESD RsquaredSD      MAESD
## 1    2 1.1234007 0.7431505 0.8751249 0.2946640 0.12169636 0.20004570
## 2   29 0.3987665 0.9675755 0.2374273 0.2622615 0.03094577 0.11347313
## 3   57 0.2558539 0.9833362 0.1290086 0.2196857 0.02249868 0.08145524
best_r2_train <- max(rf_model$results$Rsquared)
best_r2_train
## [1] 0.9833362
  1. Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
rf_pred <- predict(rf_model, testX)
test_perf <- postResample(rf_pred, testY)
test_r2 <- test_perf["Rsquared"]
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The variable importance plot shows the most influential predictors used by the model to estimate product yield in the pharmaceutical manufacturing process. Among all predictors, ManufacturingProcess13 stands out dramatically as the most important feature, with a normalized importance score far exceeding all others. This suggests that variations in ManufacturingProcess13 have the strongest impact on yield predictions, and optimizing this process parameter could lead to significant improvements in production efficiency and profitability. Other variables such as BiologicalMaterial11, BiologicalMaterial02, ManufacturingProcess17, and several additional biological and process-related features also contribute to the model, though to a much lesser extent. Interestingly, both biological and manufacturing process variables appear among the top features, indicating that while some factors are fixed (biological), others are adjustable (process-related), giving manufacturers opportunities to control and improve outcomes. The dominance of ManufacturingProcess13 reinforces earlier trend analysis that identified its strong negative relationship with yield, further validating its critical role in the production system.

importance <- varImp(rf_model)
plot(importance, top = 15)

top_vars <- rownames(importance$importance)[order(-importance$importance$Overall)][1:10]
biological_vars <- grep("^Bio", top_vars, value = TRUE)
process_vars <- grep("^Bio", top_vars, value = TRUE, invert = TRUE)
  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process

The three scatter plots illustrate how different predictors relate to product yield in a pharmaceutical manufacturing process. The first plot, showing “Yield vs Yield,” serves as a reference baseline where the predicted and actual yield align perfectly along a diagonal, confirming a model fit or identity relationship. The second plot, “Yield vs ManufacturingProcess13,” highlights a negative non-linear trend—higher values of this process variable are associated with lower yields. This suggests that reducing or tightly controlling the value of ManufacturingProcess13 could potentially lead to yield improvement. Since this is a process variable, it can be directly adjusted during manufacturing, making it a valuable target for optimization. The third plot, “Yield vs BiologicalMaterial11,” shows a positive trend, where increases in BiologicalMaterial11 are generally associated with higher yield, up to a point. Although this is a biological variable (and thus cannot be controlled), it can be monitored for quality control purposes. For instance, incoming raw materials with low values for BiologicalMaterial11 could be flagged as potentially yielding suboptimal product batches. Together, these insights guide both process optimization and quality assurance efforts.

library(ggplot2)

top10 <- rownames(importance$importance)[order(-importance$importance$Overall)][1:3]

for (var in top10) {
  ggplot(data.frame(x = trainX[[var]], y = trainY), aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method = "loess") +
    labs(title = paste("Yield vs", var), x = var, y = "Yield") +
    theme_minimal() -> p
  print(p)
}
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'