Problem 6.2: Permeability Prediction Model

(a) Load the Data

library(AppliedPredictiveModeling)
library(caret)
library(pls)
data(permeability)

(b) Filter Low-Frequency Predictors

nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
ncol(filtered_fingerprints)
## [1] 388

After filtering predictors with near-zero variance, we’re left with 388 predictors for modeling. This represents a substantial reduction from the original 1,107 predictors, which makes sense given the sparse nature of molecular fingerprints where many substructures are rarely present.

(c) Train PLS Model

set.seed(123)
train_index <- createDataPartition(permeability, p = 0.75, list = FALSE)

train_x <- filtered_fingerprints[train_index, ]
train_y <- permeability[train_index]
test_x <- filtered_fingerprints[-train_index, ]
test_y <- permeability[-train_index]

ctrl <- trainControl(method = "cv", number = 10)

pls_model <- train(train_x, train_y,
                   method = "pls",
                   tuneLength = 20,
                   trControl = ctrl,
                   preProcess = c("center", "scale"))

plot(pls_model, main = "PLS Model Performance Across Components")

pls_model$bestTune
##   ncomp
## 5     5
pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, ]
##   ncomp     RMSE  Rsquared      MAE  RMSESD RsquaredSD    MAESD
## 5     5 11.36902 0.5208998 8.524744 3.08177   0.254211 2.361927

The optimal PLS model uses 5 latent variables with a cross-validated R² of 0.521. It’s interesting that the model achieves reasonable performance with relatively few components, suggesting the molecular fingerprints contain structured information that can be efficiently compressed.

(d) Test Set Performance

pls_pred <- predict(pls_model, test_x)
pls_test_r2 <- cor(pls_pred, test_y)^2
pls_test_rmse <- sqrt(mean((pls_pred - test_y)^2))

plot(test_y, pls_pred, 
     xlab = "Observed Permeability", 
     ylab = "Predicted Permeability",
     main = "PLS: Predicted vs Observed on Test Set")
abline(0, 1, col = "red", lty = 2)
text(40, 5, paste("R² =", round(pls_test_r2, 3)))

pls_test_r2
## [1] 0.3593983

The test set R² is 0.359, which is comparable to the cross-validated performance, indicating the model generalizes reasonably well to new data.

(e) Alternative Models

pcr_model <- train(train_x, train_y,
                   method = "pcr",
                   tuneLength = 20,
                   trControl = ctrl,
                   preProcess = c("center", "scale"))

ridge_model <- train(train_x, train_y,
                     method = "ridge",
                     tuneLength = 15,
                     trControl = ctrl,
                     preProcess = c("center", "scale"))

lasso_model <- train(train_x, train_y,
                     method = "glmnet",
                     tuneGrid = expand.grid(alpha = 1, lambda = seq(0.01, 2, length = 15)),
                     trControl = ctrl,
                     preProcess = c("center", "scale"))

elastic_model <- train(train_x, train_y,
                       method = "glmnet",
                       tuneGrid = expand.grid(alpha = 0.5, lambda = seq(0.01, 2, length = 10)),
                       trControl = ctrl,
                       preProcess = c("center", "scale"))

pcr_pred <- predict(pcr_model, test_x)
ridge_pred <- predict(ridge_model, test_x)
lasso_pred <- predict(lasso_model, test_x)
elastic_pred <- predict(elastic_model, test_x)

pcr_test_r2 <- cor(pcr_pred, test_y)^2
ridge_test_r2 <- cor(ridge_pred, test_y)^2
lasso_test_r2 <- cor(lasso_pred, test_y)^2
elastic_test_r2 <- cor(elastic_pred, test_y)^2

results_df <- data.frame(
  Model = c("PLS", "PCR", "Ridge", "Lasso", "Elastic Net"),
  Test_R2 = c(pls_test_r2, pcr_test_r2, ridge_test_r2, lasso_test_r2, elastic_test_r2),
  Test_RMSE = c(pls_test_rmse, 
                sqrt(mean((pcr_pred - test_y)^2)),
                sqrt(mean((ridge_pred - test_y)^2)),
                sqrt(mean((lasso_pred - test_y)^2)),
                sqrt(mean((elastic_pred - test_y)^2)))
)

results_df
##         Model   Test_R2 Test_RMSE
## 1         PLS 0.3593983  12.16543
## 2         PCR 0.3157953  12.00417
## 3       Ridge 0.3886867  12.91751
## 4       Lasso 0.3390555  11.49045
## 5 Elastic Net 0.3542840  11.37516

Comparing the linear regression methods from Chapter 6, the Ridge model achieves the best test set R² of 0.389. The performance differences across models are relatively modest, suggesting all approaches capture similar underlying relationships in the molecular fingerprint data.

(f) Recommendation

Based on the test set R² of approximately 0.39, the model explains about 39% of the variance in permeability. While this provides useful information, the remaining unexplained variance means predictions could be off by a meaningful amount. For initial screening to eliminate clearly poor candidates, this could be valuable. However, for final decision-making, laboratory experiments would still be necessary to confirm permeability values. The model serves best as a complement to reduce the number of compounds requiring full experimental testing rather than a complete replacement.


Problem 6.3: Chemical Manufacturing Process

(a) Load the Data

data(ChemicalManufacturingProcess)

(b) Impute Missing Values

sum(is.na(ChemicalManufacturingProcess))
## [1] 106
preproc_impute <- preProcess(ChemicalManufacturingProcess[, -1], 
                              method = c("knnImpute"))
imputed_predictors <- predict(preproc_impute, 
                               ChemicalManufacturingProcess[, -1])

sum(is.na(imputed_predictors))
## [1] 0

The dataset initially contained 106 missing values. After k-nearest neighbors imputation, all missing values have been filled, allowing us to proceed with modeling.

(c) Train and Tune Model

set.seed(456)
train_idx <- createDataPartition(ChemicalManufacturingProcess$Yield, 
                                 p = 0.75, list = FALSE)

train_x_chem <- imputed_predictors[train_idx, ]
train_y_chem <- ChemicalManufacturingProcess$Yield[train_idx]
test_x_chem <- imputed_predictors[-train_idx, ]
test_y_chem <- ChemicalManufacturingProcess$Yield[-train_idx]

ctrl_chem <- trainControl(method = "cv", number = 10)

pls_chem <- train(train_x_chem, train_y_chem,
                  method = "pls",
                  tuneLength = 20,
                  trControl = ctrl_chem,
                  preProcess = c("center", "scale"))

plot(pls_chem, main = "Manufacturing Process: PLS Cross-Validation")

pls_chem$bestTune
##   ncomp
## 1     1
max(pls_chem$results$Rsquared)
## [1] 0.5597893

The optimal model uses 1 components and achieves a cross-validated R² of 0.56. This indicates strong predictive capability for yield based on the manufacturing process and biological predictors.

(d) Test Set Performance

pred_chem <- predict(pls_chem, test_x_chem)
test_r2_chem <- cor(pred_chem, test_y_chem)^2
test_rmse_chem <- sqrt(mean((pred_chem - test_y_chem)^2))

plot(test_y_chem, pred_chem,
     xlab = "Actual Yield (%)",
     ylab = "Predicted Yield (%)",
     main = "Manufacturing Yield: Predicted vs Actual")
abline(0, 1, col = "blue", lty = 2)
text(42, 35, paste("Test R² =", round(test_r2_chem, 3)))

test_r2_chem
## [1] 0.2022605
test_rmse_chem
## [1] 1.899277

The test set R² is 0.202, which is notably lower than the cross-validated R² of 0.56, suggesting the model may be overfitting to the training data.

(e) Predictor Importance

var_imp <- varImp(pls_chem)
plot(var_imp, top = 20, main = "Top 20 Most Influential Predictors")

top_predictors <- rownames(var_imp$importance)[order(var_imp$importance$Overall, 
                                                      decreasing = TRUE)[1:20]]

manufacturing_vars <- grep("^Manufacturing", top_predictors, value = TRUE)
biological_vars <- grep("^Biological", top_predictors, value = TRUE)

Among the top 20 predictors, 12 are manufacturing process variables while 8 are biological material measurements. The manufacturing predictors clearly dominate, which is encouraging since these are the variables that can actually be controlled and optimized during production.

(f) Relationships with Top Predictors

top_4_predictors <- rownames(var_imp$importance)[order(var_imp$importance$Overall, 
                                                        decreasing = TRUE)[1:4]]

par(mfrow = c(2, 2))
for (pred in top_4_predictors) {
  plot(imputed_predictors[, pred], ChemicalManufacturingProcess$Yield,
       xlab = pred,
       ylab = "Yield (%)",
       main = paste("Yield vs", pred))
  abline(lm(ChemicalManufacturingProcess$Yield ~ imputed_predictors[, pred]), 
         col = "red")
}

par(mfrow = c(1, 1))

The scatter plots show clear relationships between top predictors and yield. Most notably, several manufacturing process variables show strong linear or nonlinear patterns with yield. This suggests practical optimization strategies: monitoring these key process parameters closely and adjusting them within operational constraints could lead to meaningful yield improvements. Since manufacturing variables dominate the importance rankings and are controllable, focusing on optimizing these specific process conditions offers the most direct path to boosting yields and increasing revenue.