6.2

Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data(permeability)

# store the predictors and response in separate objects
perm_x <- as.data.frame(fingerprints)
perm_y <- permeability

# check the dimensions of the predictor matrix and response vector
dim(perm_x)
## [1]  165 1107
length(perm_y)
## [1] 165
# compare the original response to the log-transformed response
par(mfrow = c(1, 2))
hist(perm_y, main = "Permeability", xlab = "Permeability")
hist(log10(perm_y), main = "log10(Permeability)", xlab = "log10(Permeability)")

par(mfrow = c(1, 1))

# use the log-transformed response for modeling
perm_y_log <- log10(perm_y)

The permeability data contain 165 compounds and 1107 binary fingerprint predictors. The response is strongly right-skewed on the original scale, so I used log10(permeability) for modeling because the transformed response is more symmetric and better suited for regression.

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nzv_idx <- nearZeroVar(perm_x)
perm_x_nzv <- perm_x[, -nzv_idx]

length(nzv_idx)
## [1] 719
ncol(perm_x_nzv)
## [1] 388

After filtering the predictors with nearZeroVar(), 719 sparse predictors were removed, leaving 388 predictors for modeling.

(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

# create a reproducible training/test split using the log-transformed response
set.seed(523)
train_rows_perm <- createDataPartition(perm_y_log, p = 0.75, list = FALSE)

perm_x_train <- perm_x_nzv[train_rows_perm, ]
perm_x_test <- perm_x_nzv[-train_rows_perm, ]
perm_y_train <- perm_y_log[train_rows_perm]
perm_y_test <- perm_y_log[-train_rows_perm]

# center and scale the predictors using the training set only
pp_perm <- preProcess(perm_x_train, method = c("center", "scale"))

perm_x_train_pp <- predict(pp_perm, perm_x_train)
perm_x_test_pp <- predict(pp_perm, perm_x_test)
# tune the number of PLS components using repeated train/test splits
set.seed(523)
ctrl_perm <- trainControl(method = "LGOCV", number = 25, p = 0.75)

pls_perm <- train(
  x = perm_x_train_pp,
  y = perm_y_train,
  method = "pls",
  tuneGrid = expand.grid(ncomp = 1:15),
  metric = "Rsquared",
  trControl = ctrl_perm
)

pls_perm
## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 75%) 
## Summary of sample sizes: 96, 96, 96, 96, 96, 96, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.5886037  0.2694262  0.4826885
##    2     0.5514556  0.3560571  0.4328660
##    3     0.5207134  0.4348129  0.4120690
##    4     0.5231492  0.4410068  0.4139973
##    5     0.5172680  0.4616020  0.4070493
##    6     0.5257027  0.4557305  0.4103489
##    7     0.5198374  0.4814553  0.4002685
##    8     0.5177427  0.4937874  0.3950801
##    9     0.5136173  0.5078961  0.3964033
##   10     0.5120501  0.5119936  0.3949570
##   11     0.5172456  0.5144198  0.4015524
##   12     0.5251537  0.5092766  0.4104458
##   13     0.5396847  0.4923351  0.4225691
##   14     0.5451044  0.4893766  0.4253877
##   15     0.5563348  0.4777751  0.4317421
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 11.
# extract the tuning result with the largest resampled R-squared
best_pls_perm <- pls_perm$results %>%
  arrange(desc(Rsquared)) %>%
  slice(1)

best_pls_perm
##   ncomp      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1    11 0.5172456 0.5144198 0.4015524 0.06092767 0.08530535 0.05023191
plot(pls_perm, metric = "Rsquared")

I split the data into training and test sets, centered and scaled the predictors using the training set, and tuned a partial least squares model on the training data. The optimal model used 11 latent variables, and the corresponding resampled estimate of \(R^2\) was 0.5144.

(d)

Predict the response for the test set. What is the test set estimate of \(R^2\)?

# predict log permeability on the hold-out test set
pls_perm_pred <- predict(pls_perm, newdata = perm_x_test_pp)

# summarize the test-set performance
pls_perm_test_stats <- postResample(pred = pls_perm_pred, obs = perm_y_test)
pls_perm_test_stats
##      RMSE  Rsquared       MAE 
## 0.5254459 0.3911148 0.4264977

Using the tuned PLS model on the hold-out test set, the test-set \(R^2\) was 0.3911. This suggests moderate predictive ability on unseen data, although the test performance was lower than the resampled training estimate.

(e)

Try building other models discussed in this chapter. Do any have better predictive performance?

set.seed(523)
ridge_grid <- expand.grid(lambda = seq(0.02, 0.35, length.out = 9))

ridge_perm <- train(
  x = perm_x_train_pp,
  y = perm_y_train,
  method = "ridge",
  tuneGrid = ridge_grid,
  metric = "Rsquared",
  trControl = ctrl_perm
)

ridge_best <- ridge_perm$results %>%
  arrange(desc(Rsquared)) %>%
  slice(1)

ridge_best
##   lambda      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1   0.35 0.5534298 0.5239394 0.4319008 0.07019289 0.08423931 0.05952944
# fit an elastic net model using a tuning grid that avoids unstable zero-penalty fits
set.seed(523)
enet_grid <- expand.grid(
  lambda = c(0.01, 0.05, 0.10),
  fraction = seq(0.05, 0.95, by = 0.05)
)

enet_perm <- train(
  x = perm_x_train_pp,
  y = perm_y_train,
  method = "enet",
  tuneGrid = enet_grid,
  metric = "Rsquared",
  trControl = ctrl_perm
)

# remove any failed tuning combinations before selecting the best result
enet_results_clean <- enet_perm$results %>%
  filter(!is.na(Rsquared))

enet_best <- enet_results_clean %>%
  arrange(desc(Rsquared)) %>%
  slice(1)

enet_best
##   lambda fraction      RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1    0.1      0.1 0.6330262 0.5313755 0.464279 0.7576115  0.1118589 0.4696797
# compare the models using hold-out test performance
ridge_perm_pred <- predict(ridge_perm, newdata = perm_x_test_pp)
enet_perm_pred <- predict(enet_perm, newdata = perm_x_test_pp)

ridge_test_stats <- postResample(pred = ridge_perm_pred, obs = perm_y_test)
enet_test_stats <- postResample(pred = enet_perm_pred, obs = perm_y_test)

perm_model_compare <- tibble(
  Model = c("PLS", "Ridge", "Elastic Net"),
  Best_Resampled_R2 = c(
    best_pls_perm$Rsquared, 
    ridge_best$Rsquared, 
    enet_best$Rsquared
  ),
  Test_R2 = c(
    unname(pls_perm_test_stats["Rsquared"]),
    unname(ridge_test_stats["Rsquared"]),
    unname(enet_test_stats["Rsquared"])
  ),
  Test_RMSE = c(
    unname(pls_perm_test_stats["RMSE"]),
    unname(ridge_test_stats["RMSE"]),
    unname(enet_test_stats["RMSE"])
  )
)

perm_model_compare <- perm_model_compare %>%
  arrange(desc(Test_R2))

perm_model_compare
## # A tibble: 3 × 4
##   Model       Best_Resampled_R2 Test_R2 Test_RMSE
##   <chr>                   <dbl>   <dbl>     <dbl>
## 1 Elastic Net             0.531   0.686     0.385
## 2 Ridge                   0.524   0.394     0.546
## 3 PLS                     0.514   0.391     0.525
# identify the best model by hold-out test R-Squared
best_perm_model <- perm_model_compare %>%
  slice(1)

best_perm_model
## # A tibble: 1 × 4
##   Model       Best_Resampled_R2 Test_R2 Test_RMSE
##   <chr>                   <dbl>   <dbl>     <dbl>
## 1 Elastic Net             0.531   0.686     0.385

I also fit ridge regression and elastic net models to compare with the PLS model. On this train/test split, Elastic Net had the best predictive performance, with the highest test-set \(R^2\) and the lowest test RMSE. However, because the dataset is relatively small and the Elastic Net test performance was noticeably higher than its resampled training estimate, I would interpret this result cautiously and view it as evidence that Elastic Net is promising rather than definitively superior in all settings.

(f)

Would you recommend any of your models to replace the permeability laboratory experiment?

Although Elastic Net performed best among the models I tried, I would not recommend replacing the permeability laboratory experiment entirely. The sample size is still relatively small, and even the best model makes prediction errors on unseen data. In practice, I would use the model as a screening tool to prioritize compounds for laboratory testing, reduce cost, and speed up early decision-making. However, final permeability assessment should still be confirmed by laboratory experiments.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(chemicalManufacturingProcess)

I loaded the ChemicalManufacturingProcess data from the AppliedPredictiveModeling package and separated the predictors from the response variable Yield.

# load the chemical manufacturing data
data(ChemicalManufacturingProcess)

# store this dataset as one data frame
chem_data <- ChemicalManufacturingProcess

# separate predictors and response for modeling
chem_x <- chem_data %>%
  select(-Yield)

chem_y <- chem_data$Yield

# inspect the dimensions of the predictor matrix
nrow(chem_x)
## [1] 176
ncol(chem_x)
## [1] 57

The chemical manufacturing data contain 176 runs and 57 predictors.

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values(e.g., Sect.3.8).

# summarize missingness in the predictor matrix
sum(is.na(chem_x))
## [1] 106
mean(is.na(as.matrix(chem_x))) * 100
## [1] 1.056619
# inspect the response distribution before modeling
hist(chem_y, main = "Yield", xlab = "Yield")

The predictor set contains 106 missing cells, which is about 1.0566% of all predictor values. Because imputation should be learned from the training data only, I used k-nearest neighbors imputation during preprocessing in part (c).

(c)

Split the data into a training and a test set, pre-process the data, and tuned a model of your choice from this chapter. What is the optimal value of the performance metric?

I chose a partial least squares (PLS) model because the manufacturing predictors are likely to be correlated, and PLS is well suited for handling correlated numeric predictors.

# create a reproducible training/test split
set.seed(711)
train_rows_chem <- createDataPartition(chem_y, p = 0.70, list = FALSE)

chem_x_train <- chem_x[train_rows_chem, , drop = FALSE]
chem_x_test  <- chem_x[-train_rows_chem, , drop = FALSE]
chem_y_train <- chem_y[train_rows_chem]
chem_y_test  <- chem_y[-train_rows_chem]
#  remove near-zero-variance predictors using training data only
nzv_chem <- nearZeroVar(chem_x_train)

if(length(nzv_chem) > 0) {
  chem_x_train <- chem_x_train[, -nzv_chem, drop = FALSE]
  chem_x_test <- chem_x_test[, -nzv_chem, drop = FALSE]
}
# preprocess the predictors using training data only
pp_chem <- preProcess(
  chem_x_train,
  method = c("knnImpute", "center", "scale")
)

chem_x_train_pp <- predict(pp_chem, chem_x_train)
chem_x_test_pp  <- predict(pp_chem, chem_x_test)
# remove highly correlated predictors after preprocessing to reduce redundancy
chem_corr <- cor(chem_x_train_pp)
high_corr_chem <- findCorrelation(chem_corr, cutoff = 0.90)

if (length(high_corr_chem) > 0) {
  chem_x_train_pp <- chem_x_train_pp[, -high_corr_chem, drop = FALSE]
  chem_x_test_pp  <- chem_x_test_pp[, -high_corr_chem, drop = FALSE]
}

ncol(chem_x_train_pp)
## [1] 46
# tune a PLS model using bootstrap resampling
set.seed(711)
ctrl_chem <- trainControl(method = "boot", number = 25)

max_comp_chem <- min(15, ncol(chem_x_train_pp))

pls_chem <- train(
  x = chem_x_train_pp,
  y = chem_y_train,
  method = "pls",
  tuneGrid = expand.grid(ncomp = 1:max_comp_chem),
  metric = "Rsquared",
  trControl = ctrl_chem
)

pls_chem
## Partial Least Squares 
## 
## 124 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.285277  0.4845803  1.039031
##    2     1.235730  0.5287046  1.011201
##    3     1.249662  0.5290953  1.023099
##    4     1.306022  0.5047897  1.062658
##    5     1.384069  0.4732810  1.120213
##    6     1.438796  0.4521674  1.158914
##    7     1.500446  0.4265575  1.209462
##    8     1.537667  0.4101771  1.241285
##    9     1.580676  0.3927867  1.274166
##   10     1.624889  0.3763657  1.308129
##   11     1.677362  0.3589233  1.344903
##   12     1.720009  0.3488622  1.367720
##   13     1.768729  0.3363045  1.398932
##   14     1.817138  0.3254607  1.425090
##   15     1.865197  0.3166620  1.445549
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
# extract the best tuning result for reporting
best_pls_chem <- pls_chem$results %>%
  arrange(desc(Rsquared)) %>%
  slice(1)

best_pls_chem
##   ncomp     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1     3 1.249662 0.5290953 1.023099 0.1238408 0.08504925 0.1000714
# visualize performance across the number of PLS components
plot(pls_chem, metric = "Rsquared")

After splitting the data into training and test sets, I removed near-zero-variance predictors using the training data, then applied k-nearest neighbors imputation, centering, and scaling. I also removed highly correlated predictors after preprocessing to reduce redundancy in the predictor set. Using bootstrap resampling, the optimal PLS model used 3 components and achieved a resampled \(R^2\) of 0.5291.

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# generate predictions for the hold-out test set
pls_chem_pred <- predict(pls_chem, newdata = chem_x_test_pp)

# summarize test-set performance
pls_chem_test_stats <- postResample(pred = pls_chem_pred, obs = chem_y_test)
pls_chem_test_stats
##      RMSE  Rsquared       MAE 
## 1.2902007 0.5614766 1.0498534
# compare test-set R-squared to the resampled training R-squared
chem_r2_gap <- unname(pls_chem_test_stats["Rsquared"]) - best_pls_chem$Rsquared
chem_r2_gap
## [1] 0.03238133

On the hold-out test set, the tuned PLS model achieved an \(R^2\) of 0.5615 and an RMSE of 1.2902. The resampled training-set \(R^2\) from part (c) was 0.5291, so the test-set \(R^2\) was higher by 0.0324 compared with the resampled training estimate. This suggests that the model generalized reasonably well to unseen manufacturing runs, although the difference should be interpreted cautiously because it is based on one train/test split.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# compute variable importance for the tuned PLS model
chem_varimp <- varImp(pls_chem, scale = FALSE)$importance %>%
  data.frame(Predictor = rownames(.), ., row.names = NULL) %>%
  arrange(desc(Overall))
## Warning: package 'pls' was built under R version 4.5.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
# classify predictors as biological or process variables
chem_varimp <- chem_varimp %>%
  mutate(
    Predictor_Type = if_else(
      grepl("^BiologicalMaterial", Predictor),
      "Biological",
      "Process"
    )
  )

# inspect the 10 most importance predictors
top10_chem <- chem_varimp %>%
  slice_head(n = 10)

top10_chem
##                 Predictor    Overall Predictor_Type
## 1  ManufacturingProcess32 0.15328137        Process
## 2  ManufacturingProcess36 0.12314752        Process
## 3  ManufacturingProcess13 0.12221124        Process
## 4  ManufacturingProcess09 0.11978856        Process
## 5  ManufacturingProcess17 0.11768643        Process
## 6    BiologicalMaterial06 0.10420115     Biological
## 7  ManufacturingProcess33 0.10329736        Process
## 8    BiologicalMaterial08 0.09362826     Biological
## 9    BiologicalMaterial01 0.09014032     Biological
## 10   BiologicalMaterial03 0.08934011     Biological
# visualize the top 10 variable importance scores
ggplot(top10_chem, aes(x = reorder(Predictor, Overall), y = Overall, fill = Predictor_Type)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Variable Importance Scores for the PLS Model",
    x = "Predictor",
    y = "Importance"
  )

The most important predictors in the tuned PLS model are shown above. Among the top 10 predictors, 4 are biological variables and 6 are process variables. The first five predictors in the ranking are all process variables, so process predictors clearly dominate the top of the importance list. This suggests that yield may be influenced more by variables that can be adjusted during manufacturing than by raw-material characteristics alone.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

# choose the six most important predictors for relationship plots
top6_chem <- top10_chem %>%
  slice_head(n = min(6, nrow(top10_chem)))

top6_chem
##                Predictor   Overall Predictor_Type
## 1 ManufacturingProcess32 0.1532814        Process
## 2 ManufacturingProcess36 0.1231475        Process
## 3 ManufacturingProcess13 0.1222112        Process
## 4 ManufacturingProcess09 0.1197886        Process
## 5 ManufacturingProcess17 0.1176864        Process
## 6   BiologicalMaterial06 0.1042011     Biological
# calculate simple correlations between each top predictor and yield 
chem_relationships <- tibble(
  Predictor = top6_chem$Predictor,
  Correlation = sapply(
    top6_chem$Predictor,
    function(v) cor(chem_x_train_pp[[v]], chem_y_train, use = "complete.obs")
  )
) %>%
  mutate(
    Direction = if_else(Correlation >= 0, "Positive", "Negative"),
    Abs_Correlation = abs(Correlation)
  ) %>%
  arrange(desc(Abs_Correlation))

chem_relationships
## # A tibble: 6 × 4
##   Predictor              Correlation Direction Abs_Correlation
##   <chr>                        <dbl> <chr>               <dbl>
## 1 ManufacturingProcess32       0.618 Positive            0.618
## 2 ManufacturingProcess36      -0.518 Negative            0.518
## 3 BiologicalMaterial06         0.484 Positive            0.484
## 4 ManufacturingProcess09       0.473 Positive            0.473
## 5 ManufacturingProcess13      -0.470 Negative            0.470
## 6 ManufacturingProcess17      -0.415 Negative            0.415
# reshape the training data for faceted scatterplots
chem_plot_data <- chem_x_train_pp %>%
  select(all_of(top6_chem$Predictor)) %>%
  mutate(Yield = chem_y_train) %>%
  pivot_longer(
    cols = -Yield,
    names_to = "Predictor",
    values_to = "Value"
  )

chem_plot_data %>%
  head()
## # A tibble: 6 × 3
##   Yield Predictor               Value
##   <dbl> <chr>                   <dbl>
## 1    38 ManufacturingProcess32 -0.398
## 2    38 ManufacturingProcess36 -0.660
## 3    38 ManufacturingProcess13  1.07 
## 4    38 ManufacturingProcess09 -1.93 
## 5    38 ManufacturingProcess17  1.06 
## 6    38 BiologicalMaterial06   -1.36
# visualize the relationships between the top predictors and yield
ggplot(chem_plot_data, aes(x = Value, y = Yield)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "loess", se = FALSE) +
  facet_wrap(~ Predictor, scales = "free_x") +
  labs(
    title = "Relationships Between the Top Predictors and Yield",
    x = "Preprocessed Predictor Value",
    y = "Yield"
  )
## `geom_smooth()` using formula = 'y ~ x'

The six most important predictors in the tuned PLS model were ManufacturingProcess32, ManufacturingProcess36, ManufacturingProcess13, ManufacturingProcess09, ManufacturingProcess17, and BiologicalMaterial06. Among these six predictors, five were process variables and one was a biological variable. The relationship plots and correlation results suggest that ManufacturingProcess32 has the strongest positive association with yield, while ManufacturingProcess36 has the strongest negative association. More generally, ManufacturingProcess32, ManufacturingProcess09, and BiologicalMaterial06 appear to be positively related to yield, whereas ManufacturingProcess36, ManufacturingProcess13, and ManufacturingProcess17 appear to be negatively related.

This information is useful because most of the top predictors are process variables, which means they may be adjusted during manufacturing. Variables with positive relationships to yield may point to operating conditions worth maintaining or increasing, while variables with negative relationships may indicate conditions that should be reduced or monitored more closely. The biological variable BiologicalMaterial06 is also important because it may help flag raw-material quality before production begins. Since the predictors were preprocessed, I would use these plots mainly to identify important patterns and candidate variables for process optimization rather than to recommend exact operating targets directly from the model.