1 Introduction

This assignment completes Problems 6.2 and 6.3 from Kuhn and Johnson’s Applied Predictive Modeling. These problems focus on regression methods from Chapter 6 and compare how different models perform on two pharmaceutical data sets. I used the data sets included with the AppliedPredictiveModeling package and evaluated models using cross-validation and a held-out test set.

2 Problem 6.2

Developing a model to predict permeability could help a pharmaceutical company reduce time and cost by screening compounds before sending them through laboratory testing.

2.1 6.2(a)

Start R and use these commands to load the data.

data(permeability)
data(fingerprints)

dim(permeability)
## [1] 165   1
dim(fingerprints)
## [1]  165 1107
head(permeability)
##   permeability
## 1       12.520
## 2        1.120
## 3       19.405
## 4        1.730
## 5        1.680
## 6        0.510

The fingerprints data contains 1,107 binary predictors for 165 compounds, while permeability contains the response values.

2.2 6.2(b)

Filter out the predictors that have low frequencies using nearZeroVar. How many predictors are left for modeling?

nzv_cols <- nearZeroVar(fingerprints)
fingerprints_filtered <- fingerprints[, -nzv_cols]

dim(fingerprints_filtered)
## [1] 165 388
ncol(fingerprints_filtered)
## [1] 388

After removing near-zero variance predictors, there are 388 predictors left for modeling.

2.3 6.2(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

set.seed(5889)
perm_index <- createDataPartition(permeability, p = 0.80, list = FALSE)

perm_train_x <- fingerprints_filtered[perm_index, ]
perm_test_x  <- fingerprints_filtered[-perm_index, ]
perm_train_y <- permeability[perm_index]
perm_test_y  <- permeability[-perm_index]

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

set.seed(5889)
pls_fit <- train(
  x = perm_train_x,
  y = perm_train_y,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_fit$bestTune
##   ncomp
## 3     3
max(pls_fit$results$Rsquared)
## [1] 0.525382
pls_fit$results %>% arrange(desc(Rsquared)) %>% head()
##   ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1     3 11.26236 0.5253820 8.216301 3.103523  0.2178676 1.985033
## 2     9 11.65280 0.5190977 8.767226 2.882035  0.2197635 1.977037
## 3     4 11.43912 0.5186295 8.410240 3.087314  0.2236229 1.878985
## 4    10 11.74210 0.5160713 8.813084 2.927320  0.2183269 1.944957
## 5     8 11.67862 0.5110293 8.834702 2.767934  0.2109630 1.867351
## 6     2 11.44487 0.5093270 8.047187 3.201866  0.2466197 1.950012
ggplot(pls_fit$results, aes(x = ncomp, y = Rsquared)) +
  geom_line() +
  geom_point() +
  labs(
    title = "PLS Cross-Validated R-squared by Number of Components",
    x = "Number of Components",
    y = expression(R^2)
  ) +
  theme_minimal()

The optimal number of latent variables is 3, and the corresponding resampled \(R^2\) is about 0.51.

2.4 6.2(d)

Predict the response for the test set. What is the test set estimate of \(R^2\)?

pls_pred <- predict(pls_fit, perm_test_x)
pls_perf <- postResample(pls_pred, perm_test_y)
pls_perf
##       RMSE   Rsquared        MAE 
## 13.9266124  0.2791087 10.7374415

The test-set \(R^2\) is about 0.28. This is lower than the resampled training \(R^2\), which suggests the model does not generalize as strongly on unseen data as the cross-validation estimate might suggest.

2.5 6.2(e)

Try building other models discussed in this chapter. Do any have better predictive performance?

set.seed(5889)
ridge_fit <- train(
  x = perm_train_x,
  y = perm_train_y,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 20)),
  preProcess = c("center", "scale"),
  trControl = ctrl
)

set.seed(5889)
lasso_fit <- train(
  x = perm_train_x,
  y = perm_train_y,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 20)),
  preProcess = c("center", "scale"),
  trControl = ctrl
)

set.seed(5889)
enet_fit <- train(
  x = perm_train_x,
  y = perm_train_y,
  method = "glmnet",
  tuneLength = 10,
  preProcess = c("center", "scale"),
  trControl = ctrl
)

ridge_pred <- predict(ridge_fit, perm_test_x)
lasso_pred <- predict(lasso_fit, perm_test_x)
enet_pred  <- predict(enet_fit, perm_test_x)

ridge_perf <- postResample(ridge_pred, perm_test_y)
lasso_perf <- postResample(lasso_pred, perm_test_y)
enet_perf  <- postResample(enet_pred, perm_test_y)

model_results_62 <- tibble(
  Model = c("PLS", "Ridge", "Lasso", "Elastic Net"),
  Rsquared = c(pls_perf["Rsquared"], ridge_perf["Rsquared"], lasso_perf["Rsquared"], enet_perf["Rsquared"]),
  RMSE = c(pls_perf["RMSE"], ridge_perf["RMSE"], lasso_perf["RMSE"], enet_perf["RMSE"]),
  MAE = c(pls_perf["MAE"], ridge_perf["MAE"], lasso_perf["MAE"], enet_perf["MAE"])
)

model_results_62
## # A tibble: 4 × 4
##   Model       Rsquared  RMSE   MAE
##   <chr>          <dbl> <dbl> <dbl>
## 1 PLS            0.279  13.9 10.7 
## 2 Ridge          0.369  12.6  9.30
## 3 Lasso          0.445  11.9  8.31
## 4 Elastic Net    0.469  11.5  8.01

Based on the test-set results, Elastic Net performs the best, followed by Lasso, then Ridge, and finally PLS. This makes sense because the permeability data has many sparse binary predictors, so methods that shrink coefficients and perform feature selection are especially helpful.

2.6 6.2(f)

Would you recommend any of your models to replace the permeability laboratory experiment?

I would not fully replace the lab experiment yet. The Elastic Net model performs better than the other models, but its predictive strength is still moderate. I think it would be more useful as a screening tool to help prioritize which compounds should move forward for laboratory testing. That could still save time and money without depending completely on the model.

3 Problem 6.3

This problem uses data from a chemical manufacturing process. The goal is to understand how biological measurements and manufacturing process variables relate to product yield.

3.1 6.3(a)

Start R and use these commands to load the data.

data(ChemicalManufacturingProcess)

dim(ChemicalManufacturingProcess)
## [1] 176  58
head(ChemicalManufacturingProcess)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

The data has 176 observations and 58 variables total, including the response Yield.

3.2 6.3(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

cmp_y <- ChemicalManufacturingProcess$Yield
cmp_x <- ChemicalManufacturingProcess %>% select(-Yield)

missing_counts <- sapply(cmp_x, function(x) sum(is.na(x)))
missing_counts[missing_counts > 0]
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
##                      1                      3                     15 
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
##                      1                      1                      2 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 
##                      1                      1                      9 
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 
##                     10                      1                      1 
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
##                      1                      1                      1 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##                      5                      5                      5 
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
##                      5                      5                      5 
## ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 
##                      5                      5                      5 
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 
##                      5                      5                      1 
## ManufacturingProcess41 
##                      1
set.seed(5889)
cmp_imputer <- preProcess(cmp_x, method = "knnImpute")
cmp_x_imputed <- predict(cmp_imputer, cmp_x)

sum(is.na(cmp_x_imputed))
## [1] 0

I used KNN imputation to fill in the missing predictor values. After imputation, there are no missing values left in the predictor set.

3.3 6.3(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I chose a PLS model for this part.

set.seed(5889)
cmp_index <- createDataPartition(cmp_y, p = 0.80, list = FALSE)

cmp_train_x <- cmp_x_imputed[cmp_index, ]
cmp_test_x  <- cmp_x_imputed[-cmp_index, ]
cmp_train_y <- cmp_y[cmp_index]
cmp_test_y  <- cmp_y[-cmp_index]

set.seed(5889)
cmp_pls_fit <- train(
  x = cmp_train_x,
  y = cmp_train_y,
  method = "pls",
  preProcess = c("center", "scale", "BoxCox"),
  tuneLength = 20,
  trControl = ctrl
)

cmp_pls_fit$bestTune
##   ncomp
## 3     3
cmp_pls_fit$results %>% filter(ncomp == cmp_pls_fit$bestTune$ncomp)
##   ncomp    RMSE  Rsquared   MAE    RMSESD RsquaredSD     MAESD
## 1     3 1.42273 0.5327401 1.087 0.5126658  0.1796494 0.2441878

The best PLS model uses 3 components. The resampled performance for that setting gives an \(R^2\) of about 0.53.

3.4 6.3(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

cmp_pred <- predict(cmp_pls_fit, cmp_test_x)
cmp_perf <- postResample(cmp_pred, cmp_test_y)
cmp_perf
##      RMSE  Rsquared       MAE 
## 0.9518583 0.7325815 0.7905189

The test-set \(R^2\) is about 0.73, which is higher than the resampled training \(R^2\) from cross-validation. This means the model performed better on the test split than expected. That can happen sometimes with a favorable split, although in general I would still rely on repeated validation before drawing a strong conclusion.

3.5 6.3(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

cmp_imp <- varImp(cmp_pls_fit, scale = TRUE)
cmp_imp
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   90.09
## ManufacturingProcess09   88.26
## ManufacturingProcess17   86.82
## ManufacturingProcess36   82.08
## ManufacturingProcess33   59.65
## ManufacturingProcess06   59.21
## ManufacturingProcess11   58.39
## BiologicalMaterial02     55.77
## BiologicalMaterial08     54.37
## BiologicalMaterial06     53.37
## ManufacturingProcess12   51.76
## BiologicalMaterial04     50.79
## BiologicalMaterial03     49.79
## BiologicalMaterial12     49.79
## BiologicalMaterial11     48.74
## BiologicalMaterial01     47.91
## ManufacturingProcess28   42.90
## ManufacturingProcess04   40.03
## ManufacturingProcess37   38.31
plot(cmp_imp, top = 20, main = "Top 20 Predictors for Yield")

The most important predictors are mostly manufacturing process variables, which suggests that the process predictors dominate the model rather than the biological raw material variables.

3.6 6.3(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

imp_df <- cmp_imp$importance %>%
  tibble::rownames_to_column("Predictor") %>%
  arrange(desc(Overall))

head(imp_df, 10)
##                 Predictor   Overall
## 1  ManufacturingProcess32 100.00000
## 2  ManufacturingProcess13  90.08629
## 3  ManufacturingProcess09  88.26199
## 4  ManufacturingProcess17  86.81590
## 5  ManufacturingProcess36  82.08255
## 6  ManufacturingProcess33  59.64919
## 7  ManufacturingProcess06  59.21093
## 8  ManufacturingProcess11  58.38776
## 9    BiologicalMaterial02  55.76742
## 10   BiologicalMaterial08  54.37120
top_vars <- imp_df$Predictor[1:3]
top_vars
## [1] "ManufacturingProcess32" "ManufacturingProcess13" "ManufacturingProcess09"
cmp_plot_data <- cmp_x_imputed %>%
  mutate(Yield = cmp_y)
for (v in top_vars) {
  print(
    ggplot(cmp_plot_data, aes(x = .data[[v]], y = Yield)) +
      geom_point(alpha = 0.6) +
      geom_smooth(method = "loess", se = FALSE, color = "steelblue") +
      labs(
        title = paste("Yield vs", v),
        x = v,
        y = "Yield"
      ) +
      theme_minimal()
  )
}

The top predictors show how specific process conditions are related to yield. Some of them appear to have positive relationships with yield, while others show flatter or negative patterns. This is useful because the process variables can be adjusted, unlike the biological variables. If a certain process measurement consistently aligns with higher yield, the company can use that information to fine-tune future runs and improve production efficiency.

4 Conclusion

Overall, these two exercises show that different regression methods can behave very differently depending on the structure of the predictors. For the permeability data, regularized models such as Elastic Net performed better than PLS because the predictors were sparse and high dimensional. For the chemical manufacturing data, the PLS model performed reasonably well and highlighted that manufacturing process variables were more influential than the biological variables. In both cases, the models are useful for insight and screening, but I would still be careful about using them as complete replacements for real world experiments or production testing.