1 Introduction

In this report, I will be answering questions 2 and 3 in the 6th chapter from the Kuhn and Johnson book.

2 Libraries

library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(glmnet)
library(MASS)
library(dplyr)
library(ggplot2)
library(tidyverse)

2.1 Question 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

2.1.1 (a)

Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

set.seed(1234)

# data
data(permeability)

# dimensions of the data
fingerprint_dim <- dim(fingerprints)
response_length <- length(permeability)

fingerprint_dim
## [1]  165 1107
response_length
## [1] 165
str(fingerprints)
##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
summary(permeability)
##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

fingerprints is a 165 x 1107 binary matrix where each column indicates whether a molecular substructure is absent or present. The object permeability is a continuous response vector of length 165.

The response variable appears to be right-skewed because the mean is much larger than the median and the maximum value is far above the third quartile. Because of this, I considered two versions of the response: the original permeability values and a log-transformed version.

Response distribution

response_df <- data.frame(
  permeability = permeability,
  log_permeability = log(permeability)
)

summary(response_df)
##   permeability   permeability.1   
##  Min.   : 0.06   Min.   :-2.8134  
##  1st Qu.: 1.55   1st Qu.: 0.4383  
##  Median : 4.91   Median : 1.5913  
##  Mean   :12.24   Mean   : 1.5464  
##  3rd Qu.:15.47   3rd Qu.: 2.7389  
##  Max.   :55.60   Max.   : 4.0182
response_df |>
  pivot_longer(cols = everything(), names_to = "Response", values_to = "Value") |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 25, color = "white") +
  facet_wrap(~ Response, scales = "free") +
  labs(
    title = "Distribution of Original and Log-Transformed Permeability",
    x = "Response value",
    y = "Count"
  ) +
  theme_minimal()

2.1.2 (b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nzv_index <- nearZeroVar(fingerprints)

fingerprints_filtered <- fingerprints[, -nzv_index]

original_predictors <- ncol(fingerprints)
removed_predictors <- length(nzv_index)
remaining_predictors <- ncol(fingerprints_filtered)

original_predictors
## [1] 1107
removed_predictors
## [1] 719
remaining_predictors
## [1] 388

After filtering out near-zero variance predictors, 388 predictors are left for modeling.This step substantially reduces dimensionality while retaining the most informative molecular features, which is especially important given the sparsity of the fingerprint data.

2.1.3 (c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

I split the data (original and log-transformed) into a training set and a test set using createDataPartition() so that the distribution of the response variable is represented in both sets.

set.seed(1234)
train_index <- createDataPartition(permeability, p = 0.80, list = FALSE)

x_train <- fingerprints_filtered[train_index, ]
x_test  <- fingerprints_filtered[-train_index, ]

y_train <- permeability[train_index]
y_test  <- permeability[-train_index]

y_train_log <- log(y_train)
y_test_log  <- log(y_test)

nrow(x_train)
## [1] 133
nrow(x_test)
## [1] 32

Because the predictors are binary, centering and scaling are not always required in the same way as for continuous predictors. However, for PLS and other regression models, using consistent preprocessing is helpful. Here, I will center and scale the predictors inside the train() function.

Train Control

set.seed(1234)
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

pls_grid <- expand.grid(ncomp = 1:20)

PLS on the Original Response

set.seed(1234)
pls_fit <- train(
  x = x_train,
  y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneGrid = pls_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

pls_fit
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 121, 120, 118, 120, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.77879  0.3855978  9.793839
##    2     11.19020  0.5462406  7.942535
##    3     11.12470  0.5514622  8.359664
##    4     11.20094  0.5504032  8.466549
##    5     11.25765  0.5556787  8.561954
##    6     11.14923  0.5596007  8.334588
##    7     11.05521  0.5713326  8.302687
##    8     11.25520  0.5631958  8.541518
##    9     11.37238  0.5623321  8.596139
##   10     11.58338  0.5540710  8.800999
##   11     11.78742  0.5401137  8.956146
##   12     11.97542  0.5332715  9.083828
##   13     12.20496  0.5235523  9.196912
##   14     12.62862  0.4958865  9.474099
##   15     12.79523  0.4818240  9.530474
##   16     13.01989  0.4766020  9.605137
##   17     13.17761  0.4673446  9.628623
##   18     13.22728  0.4650710  9.666151
##   19     13.16922  0.4753337  9.575973
##   20     13.21340  0.4741987  9.555740
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 7.
plot(pls_fit)

best_pls_ncomp <- pls_fit$bestTune$ncomp
best_pls_results <- pls_fit$results |>
  filter(ncomp == best_pls_ncomp)

best_pls_ncomp
## [1] 7
best_pls_results
##   ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1     7 11.05521 0.5713326 8.302687 2.640405   0.190672 1.745732

PLS on the Log-Transformed Response

set.seed(1234)
pls_log_fit <- train(
  x = x_train,
  y = y_train_log,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneGrid = pls_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

pls_log_fit
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 121, 120, 118, 120, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.280345  0.3417054  1.0459600
##    2     1.141772  0.4834199  0.8910315
##    3     1.108650  0.5049825  0.8563055
##    4     1.121508  0.5023783  0.8721127
##    5     1.113671  0.5127793  0.8429958
##    6     1.108965  0.5237371  0.8174632
##    7     1.080778  0.5528280  0.8056394
##    8     1.052732  0.5817502  0.7886534
##    9     1.031979  0.6075312  0.7790562
##   10     1.036579  0.6063386  0.7868955
##   11     1.061507  0.5948601  0.8118387
##   12     1.086647  0.5823590  0.8323885
##   13     1.119682  0.5656230  0.8709612
##   14     1.133211  0.5596291  0.8805557
##   15     1.131757  0.5589670  0.8825847
##   16     1.133892  0.5588487  0.8806515
##   17     1.146674  0.5500391  0.8936552
##   18     1.156976  0.5466606  0.9020677
##   19     1.155499  0.5509110  0.8982716
##   20     1.164532  0.5473877  0.9055269
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
plot(pls_log_fit)

best_pls_log_ncomp <- pls_log_fit$bestTune$ncomp
best_pls_log_results <- pls_log_fit$results |>
  filter(ncomp == best_pls_log_ncomp)

best_pls_log_ncomp
## [1] 9
best_pls_log_results
##   ncomp     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1     9 1.031979 0.6075312 0.7790562 0.2494862  0.1709737 0.1878993

For the PLS model using the original response, the optimal number of latent variables was 7, with a cross-validated R² of approximately 0.57.

For the log-transformed response, the optimal number of components increased to 9, with a cross-validated R² of approximately 0.61 (on the log scale).

The slightly higher R² for the log-transformed response suggests that reducing skewness improves the model’s ability to capture underlying structure. However, these values are not directly comparable across scales and should be interpreted cautiously.

2.1.4 (d)

Predict the response for the test set. What is the test set estimate of R2?

For the original-response model, predictions are already on the original permeability scale. For the log-response model, predictions are first made on the log scale and then back-transformed with exp() so that test-set performance can also be evaluated on the original permeability scale.

get_metrics <- function(obs, pred, model_name, response_scale) {
  data.frame(
    Model = model_name,
    Response_Scale = response_scale,
    Test_R2 = cor(obs, pred)^2,
    Test_RMSE = RMSE(pred, obs),
    Test_MAE = MAE(pred, obs)
  )
}
# original-response PLS
pls_pred <- predict(pls_fit, newdata = x_test)

# log-response PLS
pls_log_pred_log_scale <- predict(pls_log_fit, newdata = x_test)
pls_log_pred_original_scale <- exp(pls_log_pred_log_scale)

pls_test_results <- bind_rows(
  get_metrics(y_test, pls_pred, "PLS", "Original response"),
  get_metrics(y_test_log, pls_log_pred_log_scale, "PLS", "Log response / log-scale test"),
  get_metrics(y_test, pls_log_pred_original_scale, "PLS", "Log response / back-transformed test")
)

pls_test_results
##   Model                       Response_Scale   Test_R2 Test_RMSE Test_MAE
## 1   PLS                    Original response 0.2212101 12.703290 8.802542
## 2   PLS        Log response / log-scale test 0.2408409  1.341785 1.025175
## 3   PLS Log response / back-transformed test 0.3093118 12.727392 7.881455

Test set performance provides a more realistic estimate of model generalization. When evaluated on the original permeability scale, the log-transformed PLS model (after back-transformation) achieved a higher R² (~0.31) than the model trained on the original response (~0.22).

This indicates that transforming the response improved predictive performance in practical terms, even after converting predictions back to the original scale. However, the overall R² values remain relatively low, suggesting that a substantial portion of variability is not captured by the model.

PLS Observed vs Predicted

pls_plot_data <- bind_rows(
  data.frame(
    Observed = y_test,
    Predicted = pls_pred,
    Model = "PLS: Original response"
  ),
  data.frame(
    Observed = y_test,
    Predicted = pls_log_pred_original_scale,
    Model = "PLS: Log response, back-transformed"
  )
)

ggplot(pls_plot_data, aes(x = Observed, y = Predicted)) +
  geom_point(alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  facet_wrap(~ Model) +
  labs(
    title = "PLS Models: Observed vs Predicted Permeability",
    x = "Observed Permeability",
    y = "Predicted Permeability"
  ) +
  theme_minimal()

2.1.5 (e)

Try building other models discussed in this chapter. Do any have better predictive performance?

I will compare PLS with ridge regression, lasso regression, elastic net, and principal component regression. Each model is fit twice: once using the original response and once using the log-transformed response.

Model Grids

ridge_grid <- expand.grid(
  alpha = 0,
  lambda = 10^seq(-4, 2, length = 50)
)

lasso_grid <- expand.grid(
  alpha = 1,
  lambda = 10^seq(-4, 2, length = 50)
)

enet_grid <- expand.grid(
  alpha = seq(0.1, 0.9, by = 0.2),
  lambda = 10^seq(-4, 2, length = 30)
)

pcr_grid <- expand.grid(ncomp = 1:20)

Helper Function to Train both Modelts

train_original_and_log <- function(model_label, method_name, tune_grid) {
  set.seed(1234)
  fit_original <- train(
    x = x_train,
    y = y_train,
    method = method_name,
    preProcess = c("center", "scale"),
    tuneGrid = tune_grid,
    trControl = ctrl,
    metric = "Rsquared"
  )
  
  set.seed(1234)
  fit_log <- train(
    x = x_train,
    y = y_train_log,
    method = method_name,
    preProcess = c("center", "scale"),
    tuneGrid = tune_grid,
    trControl = ctrl,
    metric = "Rsquared"
  )
  
  pred_original <- predict(fit_original, newdata = x_test)
  pred_log_scale <- predict(fit_log, newdata = x_test)
  pred_log_backtransformed <- exp(pred_log_scale)
  
  test_results <- bind_rows(
    get_metrics(y_test, pred_original, model_label, "Original response"),
    get_metrics(y_test_log, pred_log_scale, model_label, "Log response / log-scale test"),
    get_metrics(y_test, pred_log_backtransformed, model_label, "Log response / back-transformed test")
  )
  
  cv_results <- bind_rows(
    fit_original$results |>
      filter(Rsquared == max(Rsquared)) |>
      slice(1) |>
      mutate(Model = model_label, Response_Scale = "Original response"),
    fit_log$results |>
      filter(Rsquared == max(Rsquared)) |>
      slice(1) |>
      mutate(Model = model_label, Response_Scale = "Log response")
  ) |>
    select(Model, Response_Scale, everything())
  
  list(
    fit_original = fit_original,
    fit_log = fit_log,
    test_results = test_results,
    cv_results = cv_results
  )
}

Ridge Regression

ridge_models <- train_original_and_log("Ridge Regression", "glmnet", ridge_grid)

ridge_models$fit_original$bestTune
##    alpha  lambda
## 49     0 75.4312
ridge_models$fit_log$bestTune
##    alpha   lambda
## 41     0 7.906043
ridge_models$test_results
##              Model                       Response_Scale   Test_R2 Test_RMSE
## 1 Ridge Regression                    Original response 0.1757161 12.841140
## 2 Ridge Regression        Log response / log-scale test 0.1576768  1.364284
## 3 Ridge Regression Log response / back-transformed test 0.1755076 13.688151
##   Test_MAE
## 1 9.011121
## 2 1.048982
## 3 8.377415

Lasso Regression

lasso_models <- train_original_and_log("Lasso Regression", "glmnet", lasso_grid)

lasso_models$fit_original$bestTune
##    alpha   lambda
## 30     1 0.355648
lasso_models$fit_log$bestTune
##    alpha    lambda
## 27     1 0.1526418
lasso_models$test_results
##              Model                       Response_Scale   Test_R2 Test_RMSE
## 1 Lasso Regression                    Original response 0.2384971 12.232278
## 2 Lasso Regression        Log response / log-scale test 0.4178389  1.139078
## 3 Lasso Regression Log response / back-transformed test 0.3701371 12.513340
##    Test_MAE
## 1 8.7359505
## 2 0.9262542
## 3 7.3824850

Elastic Net

enet_models <- train_original_and_log("Elastic Net", "glmnet", enet_grid)

enet_models$fit_original$bestTune
##    alpha   lambda
## 23   0.1 3.562248
enet_models$fit_log$bestTune
##    alpha    lambda
## 49   0.3 0.5298317
enet_models$test_results
##         Model                       Response_Scale   Test_R2 Test_RMSE
## 1 Elastic Net                    Original response 0.2444813 12.200925
## 2 Elastic Net        Log response / log-scale test 0.3948823  1.169366
## 3 Elastic Net Log response / back-transformed test 0.3485761 12.839981
##    Test_MAE
## 1 8.5816126
## 2 0.9506977
## 3 7.6604013

Principal Component Regression

pcr_models <- train_original_and_log("PCR", "pcr", pcr_grid)

pcr_models$fit_original$bestTune
##    ncomp
## 10    10
pcr_models$fit_log$bestTune
##    ncomp
## 11    11
pcr_models$test_results
##   Model                       Response_Scale   Test_R2 Test_RMSE Test_MAE
## 1   PCR                    Original response 0.1536834 13.582955 9.259680
## 2   PCR        Log response / log-scale test 0.1185924  1.454988 1.136864
## 3   PCR Log response / back-transformed test 0.1595568 13.874028 8.711250
plot(pcr_models$fit_original)

plot(pcr_models$fit_log)

Model Comparison

all_test_results <- bind_rows(
  pls_test_results,
  ridge_models$test_results,
  lasso_models$test_results,
  enet_models$test_results,
  pcr_models$test_results
)

# predictions evaluated on original permeability scale
original_scale_comparison <- all_test_results |>
  filter(Response_Scale %in% c("Original response", "Log response / back-transformed test")) |>
  arrange(desc(Test_R2), Test_RMSE)

original_scale_comparison
##               Model                       Response_Scale   Test_R2 Test_RMSE
## 1  Lasso Regression Log response / back-transformed test 0.3701371  12.51334
## 2       Elastic Net Log response / back-transformed test 0.3485761  12.83998
## 3               PLS Log response / back-transformed test 0.3093118  12.72739
## 4       Elastic Net                    Original response 0.2444813  12.20093
## 5  Lasso Regression                    Original response 0.2384971  12.23228
## 6               PLS                    Original response 0.2212101  12.70329
## 7  Ridge Regression                    Original response 0.1757161  12.84114
## 8  Ridge Regression Log response / back-transformed test 0.1755076  13.68815
## 9               PCR Log response / back-transformed test 0.1595568  13.87403
## 10              PCR                    Original response 0.1536834  13.58295
##    Test_MAE
## 1  7.382485
## 2  7.660401
## 3  7.881455
## 4  8.581613
## 5  8.735951
## 6  8.802542
## 7  9.011121
## 8  8.377415
## 9  8.711250
## 10 9.259680
ggplot(original_scale_comparison, aes(x = reorder(paste(Model, Response_Scale, sep = " - "), Test_R2), y = Test_R2)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Comparison of Test Set R-squared Values on the Original Permeability Scale",
    x = "Model and response version",
    y = "Test Set R-squared"
  ) +
  theme_minimal()

cv_comparison <- bind_rows(
  best_pls_results |> mutate(Model = "PLS", Response_Scale = "Original response"),
  best_pls_log_results |> mutate(Model = "PLS", Response_Scale = "Log response"),
  ridge_models$cv_results,
  lasso_models$cv_results,
  enet_models$cv_results,
  pcr_models$cv_results
) |>
  select(Model, Response_Scale, Rsquared, RMSE, MAE, everything()) |>
  arrange(desc(Rsquared))

cv_comparison
##              Model    Response_Scale  Rsquared      RMSE       MAE ncomp
## 1              PLS      Log response 0.6075312  1.031979 0.7790562     9
## 2 Ridge Regression Original response 0.5799020 10.880559 8.0497479    NA
## 3              PLS Original response 0.5713326 11.055213 8.3026868     7
## 4              PCR Original response 0.5428816 11.230300 8.2116689    10
## 5 Ridge Regression      Log response 0.5294639  1.097478 0.8731481    NA
## 6              PCR      Log response 0.5061354  1.115715 0.8796130    11
##      RMSESD RsquaredSD     MAESD alpha lambda
## 1 0.2494862  0.1709737 0.1878993    NA     NA
## 2 2.7677718  0.1920138 1.9559785     0  1e-04
## 3 2.6404049  0.1906720 1.7457325    NA     NA
## 4 2.9587234  0.1812855 1.9598109    NA     NA
## 5 0.2466379  0.2035292 0.1879455     0  1e-04
## 6 0.2616569  0.2016651 0.1986388    NA     NA

The table above compares the best resampled/cross-validated results. It is important to remember that the original-response and log-response cross-validated \(R^2\) values are calculated on different response scales, so they should not be interpreted as perfectly equivalent. For the final practical comparison, the test-set results on the original permeability scale are the most useful.

Among all models evaluated, the lasso regression model trained on the log-transformed response (with predictions back-transformed to the original scale) achieved the best performance, with a test R² of approximately 0.37. Elastic net and PLS models followed closely, while ridge regression and PCR performed worse overall.

The relatively small differences between the top models suggest that, while regularization improves performance, the primary limitation lies in the data representation and underlying complexity of the problem rather than the choice of model alone.

2.1.6 (f)

Would you recommend any of your models to replace the permeability laboratory experiment?

Although some models achieved moderate predictive performance, the best test R² was approximately 0.37, indicating that a large portion of the variability in permeability remains unexplained. In a pharmaceutical setting, where accurate predictions are critical for decision-making, this level of performance is likely insufficient to replace laboratory experiments. However, these models could still be useful as a screening tool to prioritize compounds before experimental testing. The log transformation consistently improved model performance on the original scale, suggesting that addressing skewness was beneficial. Even with advanced techniques and regularization, predictive performance may remain limited when the signal-to-noise ratio is low or important nonlinear relationships are not captured.

2.2 Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

2.2.1 (a)

Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturing)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

data(ChemicalManufacturingProcess)

chem_df <- ChemicalManufacturingProcess

# split response and predictors
chem_y <- chem_df$Yield
chem_x <- chem_df[,-1]

chem_dim <- dim(chem_x)
yield_length <- length(chem_y)

chem_dim
## [1] 176  57
yield_length
## [1] 176
str(chem_x)
## 'data.frame':    176 obs. of  57 variables:
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
summary(chem_y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.25   38.75   39.97   40.18   41.48   46.34

ChemicalManufacturingProcess contains 176 manufacturing runs. The response variable is Yield, and the remaining 57 columns are predictors. The first 12 predictors describe the biological raw material, while the remaining 45 predictors describe the manufacturing process.

bio_predictors <- names(chem_x)[grepl("BiologicalMaterial", names(chem_x))]
process_predictors <- names(chem_x)[grepl("ManufacturingProcess", names(chem_x))]

length(bio_predictors)
## [1] 12
length(process_predictors)
## [1] 45
head(bio_predictors)
## [1] "BiologicalMaterial01" "BiologicalMaterial02" "BiologicalMaterial03"
## [4] "BiologicalMaterial04" "BiologicalMaterial05" "BiologicalMaterial06"
head(process_predictors)
## [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
## [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"

2.2.2 (b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8)

A small number of predictor values are missing. Since the models in this chapter require complete predictor data, I used KNN imputation. KNN imputation estimates missing values using similar observations, which makes sense here because the biological and manufacturing-process measurements are likely related to one another.

# total missing values 
sum(is.na(chem_x))
## [1] 106
# predictors with at least one missing value
sum(colSums(is.na(chem_x)) > 0)
## [1] 28
# Show the predictors with missing values
missing_summary <- data.frame(
  Predictor = names(chem_x),
  Missing_Count = colSums(is.na(chem_x))
) |>
  filter(Missing_Count > 0) |>
  arrange(desc(Missing_Count))

missing_summary
##                                     Predictor Missing_Count
## ManufacturingProcess03 ManufacturingProcess03            15
## ManufacturingProcess11 ManufacturingProcess11            10
## ManufacturingProcess10 ManufacturingProcess10             9
## ManufacturingProcess25 ManufacturingProcess25             5
## ManufacturingProcess26 ManufacturingProcess26             5
## ManufacturingProcess27 ManufacturingProcess27             5
## ManufacturingProcess28 ManufacturingProcess28             5
## ManufacturingProcess29 ManufacturingProcess29             5
## ManufacturingProcess30 ManufacturingProcess30             5
## ManufacturingProcess31 ManufacturingProcess31             5
## ManufacturingProcess33 ManufacturingProcess33             5
## ManufacturingProcess34 ManufacturingProcess34             5
## ManufacturingProcess35 ManufacturingProcess35             5
## ManufacturingProcess36 ManufacturingProcess36             5
## ManufacturingProcess02 ManufacturingProcess02             3
## ManufacturingProcess06 ManufacturingProcess06             2
## ManufacturingProcess01 ManufacturingProcess01             1
## ManufacturingProcess04 ManufacturingProcess04             1
## ManufacturingProcess05 ManufacturingProcess05             1
## ManufacturingProcess07 ManufacturingProcess07             1
## ManufacturingProcess08 ManufacturingProcess08             1
## ManufacturingProcess12 ManufacturingProcess12             1
## ManufacturingProcess14 ManufacturingProcess14             1
## ManufacturingProcess22 ManufacturingProcess22             1
## ManufacturingProcess23 ManufacturingProcess23             1
## ManufacturingProcess24 ManufacturingProcess24             1
## ManufacturingProcess40 ManufacturingProcess40             1
## ManufacturingProcess41 ManufacturingProcess41             1

To show that KNN imputation removes the missing values, I also apply it once to the full predictor set for checking purposes. The final model below still performs the preprocessing inside train() so that the same steps are applied correctly during resampling and prediction.

chem_knn_check <- preProcess(chem_x, method = c("center", "scale", "knnImpute"))
chem_x_knn_check <- predict(chem_knn_check, chem_x)

# confirm numb of missinf values
sum(is.na(chem_x_knn_check))
## [1] 0

For the modeling step, KNN imputation is performed inside the train() function by including knnImpute in the preProcess argument.For each missing cell, the algorithm finds the k most similar complete observations (distance on non-missing features) and replaces the missing value with their weighted mean. This preserves local correlation structure better than simple mean/median imputation.

2.2.3 (c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I used an elastic net regression model. Elastic net is a useful choice for this problem because it combines ridge and lasso penalties. This allows it to handle correlated predictors while also shrinking less useful predictors toward zero.

set.seed(1234)
chem_train_index <- createDataPartition(chem_y, p = 0.80, list = FALSE)

chem_x_train <- chem_x[chem_train_index, ]
chem_x_test <- chem_x[-chem_train_index, ]
chem_y_train <- chem_y[chem_train_index]
chem_y_test <- chem_y[-chem_train_index]

nrow(chem_x_train)
## [1] 144
nrow(chem_x_test)
## [1] 32
set.seed(1234)
chem_ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

chem_enet_grid <- expand.grid(
  alpha = seq(0.1, 0.9, by = 0.2),
  lambda = 10^seq(-4, 2, length = 40)
)
set.seed(1234)
chem_enet_fit <- train(
  x = chem_x_train,
  y = chem_y_train,
  method = "glmnet",
  preProcess = c("center", "scale", "knnImpute"),
  tuneGrid = chem_enet_grid,
  trControl = chem_ctrl,
  metric = "Rsquared"
)

chem_enet_fit
## glmnet 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57), nearest neighbor imputation (57) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 128, 129, 129, 128, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared    MAE      
##   0.1    1.000000e-04  4.487890  0.45097681  1.9466517
##   0.1    1.425103e-04  4.487890  0.45097681  1.9466517
##   0.1    2.030918e-04  4.487890  0.45097681  1.9466517
##   0.1    2.894266e-04  4.487890  0.45097681  1.9466517
##   0.1    4.124626e-04  4.487890  0.45097681  1.9466517
##   0.1    5.878016e-04  4.487890  0.45097681  1.9466517
##   0.1    8.376776e-04  4.487890  0.45097681  1.9466517
##   0.1    1.193777e-03  4.444877  0.45165520  1.9346201
##   0.1    1.701254e-03  4.136862  0.45676816  1.8486460
##   0.1    2.424462e-03  3.812800  0.46320839  1.7571196
##   0.1    3.455107e-03  3.462948  0.47200562  1.6560826
##   0.1    4.923883e-03  3.100731  0.48081675  1.5503064
##   0.1    7.017038e-03  2.762046  0.49127786  1.4496272
##   0.1    1.000000e-02  2.470293  0.50510716  1.3624669
##   0.1    1.425103e-02  2.251949  0.51536257  1.2924942
##   0.1    2.030918e-02  2.115850  0.52249731  1.2462167
##   0.1    2.894266e-02  2.038922  0.52841127  1.2181947
##   0.1    4.124626e-02  1.990305  0.53302621  1.1981109
##   0.1    5.878016e-02  1.896720  0.53662989  1.1691256
##   0.1    8.376776e-02  1.774481  0.54034211  1.1356990
##   0.1    1.193777e-01  1.658813  0.54454822  1.1072186
##   0.1    1.701254e-01  1.547175  0.55159290  1.0804813
##   0.1    2.424462e-01  1.417344  0.56901889  1.0431016
##   0.1    3.455107e-01  1.295448  0.59420886  1.0036118
##   0.1    4.923883e-01  1.207182  0.62194126  0.9728583
##   0.1    7.017038e-01  1.198188  0.62587673  0.9741158
##   0.1    1.000000e+00  1.246217  0.61007103  1.0042269
##   0.1    1.425103e+00  1.323118  0.58267819  1.0552223
##   0.1    2.030918e+00  1.372129  0.57283195  1.0999230
##   0.1    2.894266e+00  1.438005  0.56602041  1.1669133
##   0.1    4.124626e+00  1.536724  0.55898831  1.2622654
##   0.1    5.878016e+00  1.663828  0.55028510  1.3724658
##   0.1    8.376776e+00  1.806220  0.55489076  1.4812170
##   0.1    1.193777e+01  1.879543  0.08848137  1.5391161
##   0.1    1.701254e+01  1.879561         NaN  1.5391257
##   0.1    2.424462e+01  1.879561         NaN  1.5391257
##   0.1    3.455107e+01  1.879561         NaN  1.5391257
##   0.1    4.923883e+01  1.879561         NaN  1.5391257
##   0.1    7.017038e+01  1.879561         NaN  1.5391257
##   0.1    1.000000e+02  1.879561         NaN  1.5391257
##   0.3    1.000000e-04  5.052550  0.44400484  2.1028463
##   0.3    1.425103e-04  5.052550  0.44400484  2.1028463
##   0.3    2.030918e-04  5.052550  0.44400484  2.1028463
##   0.3    2.894266e-04  5.052550  0.44400484  2.1028463
##   0.3    4.124626e-04  4.999517  0.44468781  2.0880228
##   0.3    5.878016e-04  4.807703  0.44793924  2.0343146
##   0.3    8.376776e-04  4.497195  0.45233191  1.9486784
##   0.3    1.193777e-03  4.108095  0.45786266  1.8412523
##   0.3    1.701254e-03  3.761046  0.46466202  1.7428055
##   0.3    2.424462e-03  3.402062  0.47390875  1.6383181
##   0.3    3.455107e-03  2.968797  0.48461905  1.5121250
##   0.3    4.923883e-03  2.564139  0.49925009  1.3928974
##   0.3    7.017038e-03  2.290115  0.51288787  1.3059286
##   0.3    1.000000e-02  2.146454  0.52232476  1.2550761
##   0.3    1.425103e-02  2.065957  0.53110908  1.2201920
##   0.3    2.030918e-02  2.038570  0.53574831  1.2039834
##   0.3    2.894266e-02  2.003008  0.53928443  1.1891068
##   0.3    4.124626e-02  1.864294  0.54380581  1.1512356
##   0.3    5.878016e-02  1.700049  0.54657195  1.1120068
##   0.3    8.376776e-02  1.545217  0.55773679  1.0757610
##   0.3    1.193777e-01  1.338363  0.59347379  1.0130677
##   0.3    1.701254e-01  1.256760  0.60757536  0.9822264
##   0.3    2.424462e-01  1.286275  0.61524694  0.9871694
##   0.3    3.455107e-01  1.361021  0.58704735  1.0244400
##   0.3    4.923883e-01  1.410984  0.57630389  1.0572039
##   0.3    7.017038e-01  1.403963  0.57232892  1.0799618
##   0.3    1.000000e+00  1.412023  0.57248662  1.1170358
##   0.3    1.425103e+00  1.458033  0.57165745  1.1863612
##   0.3    2.030918e+00  1.572176  0.58363089  1.2929060
##   0.3    2.894266e+00  1.765789  0.55876765  1.4454913
##   0.3    4.124626e+00  1.879561         NaN  1.5391257
##   0.3    5.878016e+00  1.879561         NaN  1.5391257
##   0.3    8.376776e+00  1.879561         NaN  1.5391257
##   0.3    1.193777e+01  1.879561         NaN  1.5391257
##   0.3    1.701254e+01  1.879561         NaN  1.5391257
##   0.3    2.424462e+01  1.879561         NaN  1.5391257
##   0.3    3.455107e+01  1.879561         NaN  1.5391257
##   0.3    4.923883e+01  1.879561         NaN  1.5391257
##   0.3    7.017038e+01  1.879561         NaN  1.5391257
##   0.3    1.000000e+02  1.879561         NaN  1.5391257
##   0.5    1.000000e-04  5.577812  0.43988206  2.2419927
##   0.5    1.425103e-04  5.577812  0.43988206  2.2419927
##   0.5    2.030918e-04  5.577812  0.43988206  2.2419927
##   0.5    2.894266e-04  5.470912  0.44155821  2.2126414
##   0.5    4.124626e-04  5.188352  0.44492312  2.1357588
##   0.5    5.878016e-04  4.814459  0.44966152  2.0335840
##   0.5    8.376776e-04  4.370204  0.45547365  1.9119014
##   0.5    1.193777e-03  3.957279  0.46275879  1.7962033
##   0.5    1.701254e-03  3.547770  0.47179565  1.6786239
##   0.5    2.424462e-03  3.110269  0.48278460  1.5505484
##   0.5    3.455107e-03  2.637137  0.49768698  1.4128524
##   0.5    4.923883e-03  2.317966  0.51205425  1.3127573
##   0.5    7.017038e-03  2.158283  0.52347988  1.2569925
##   0.5    1.000000e-02  2.077082  0.53331742  1.2212322
##   0.5    1.425103e-02  2.045155  0.53811303  1.2026978
##   0.5    2.030918e-02  2.016241  0.54118681  1.1899458
##   0.5    2.894266e-02  1.894775  0.54545109  1.1563450
##   0.5    4.124626e-02  1.727386  0.54986610  1.1152621
##   0.5    5.878016e-02  1.531604  0.56772572  1.0652235
##   0.5    8.376776e-02  1.331871  0.60152498  1.0047676
##   0.5    1.193777e-01  1.304403  0.60193269  0.9935126
##   0.5    1.701254e-01  1.315961  0.60651473  1.0012011
##   0.5    2.424462e-01  1.419355  0.57696825  1.0454312
##   0.5    3.455107e-01  1.427279  0.56958960  1.0681024
##   0.5    4.923883e-01  1.393631  0.57156165  1.0809798
##   0.5    7.017038e-01  1.386103  0.57631238  1.1125752
##   0.5    1.000000e+00  1.445276  0.58940107  1.1853342
##   0.5    1.425103e+00  1.605963  0.60053158  1.3149978
##   0.5    2.030918e+00  1.832042  0.45728021  1.4986526
##   0.5    2.894266e+00  1.879561         NaN  1.5391257
##   0.5    4.124626e+00  1.879561         NaN  1.5391257
##   0.5    5.878016e+00  1.879561         NaN  1.5391257
##   0.5    8.376776e+00  1.879561         NaN  1.5391257
##   0.5    1.193777e+01  1.879561         NaN  1.5391257
##   0.5    1.701254e+01  1.879561         NaN  1.5391257
##   0.5    2.424462e+01  1.879561         NaN  1.5391257
##   0.5    3.455107e+01  1.879561         NaN  1.5391257
##   0.5    4.923883e+01  1.879561         NaN  1.5391257
##   0.5    7.017038e+01  1.879561         NaN  1.5391257
##   0.5    1.000000e+02  1.879561         NaN  1.5391257
##   0.7    1.000000e-04  5.644092  0.43789947  2.2610522
##   0.7    1.425103e-04  5.644092  0.43789947  2.2610522
##   0.7    2.030918e-04  5.545110  0.43937220  2.2338680
##   0.7    2.894266e-04  5.332524  0.44214517  2.1757759
##   0.7    4.124626e-04  5.034698  0.44647441  2.0938405
##   0.7    5.878016e-04  4.638506  0.45235728  1.9848410
##   0.7    8.376776e-04  4.226334  0.45879785  1.8703289
##   0.7    1.193777e-03  3.748440  0.46839147  1.7351460
##   0.7    1.701254e-03  3.298468  0.47853930  1.6049812
##   0.7    2.424462e-03  2.826739  0.49159467  1.4674161
##   0.7    3.455107e-03  2.385777  0.50790937  1.3349197
##   0.7    4.923883e-03  2.213062  0.52003625  1.2744593
##   0.7    7.017038e-03  2.089992  0.53155227  1.2279632
##   0.7    1.000000e-02  2.047413  0.53871735  1.2045769
##   0.7    1.425103e-02  2.035104  0.54109237  1.1948397
##   0.7    2.030918e-02  1.954814  0.54406089  1.1707407
##   0.7    2.894266e-02  1.799374  0.55039166  1.1305539
##   0.7    4.124626e-02  1.600155  0.56672539  1.0786429
##   0.7    5.878016e-02  1.371340  0.59375783  1.0181644
##   0.7    8.376776e-02  1.321280  0.59710108  0.9988113
##   0.7    1.193777e-01  1.305220  0.61173717  0.9935692
##   0.7    1.701254e-01  1.419539  0.57656753  1.0441206
##   0.7    2.424462e-01  1.438267  0.56796033  1.0682134
##   0.7    3.455107e-01  1.386410  0.56960243  1.0745058
##   0.7    4.923883e-01  1.355109  0.57916113  1.0922724
##   0.7    7.017038e-01  1.406588  0.60101821  1.1540992
##   0.7    1.000000e+00  1.564413  0.60839378  1.2786358
##   0.7    1.425103e+00  1.810086  0.47360547  1.4797546
##   0.7    2.030918e+00  1.879561         NaN  1.5391257
##   0.7    2.894266e+00  1.879561         NaN  1.5391257
##   0.7    4.124626e+00  1.879561         NaN  1.5391257
##   0.7    5.878016e+00  1.879561         NaN  1.5391257
##   0.7    8.376776e+00  1.879561         NaN  1.5391257
##   0.7    1.193777e+01  1.879561         NaN  1.5391257
##   0.7    1.701254e+01  1.879561         NaN  1.5391257
##   0.7    2.424462e+01  1.879561         NaN  1.5391257
##   0.7    3.455107e+01  1.879561         NaN  1.5391257
##   0.7    4.923883e+01  1.879561         NaN  1.5391257
##   0.7    7.017038e+01  1.879561         NaN  1.5391257
##   0.7    1.000000e+02  1.879561         NaN  1.5391257
##   0.9    1.000000e-04  5.468567  0.43947409  2.2139802
##   0.9    1.425103e-04  5.442777  0.43996386  2.2067142
##   0.9    2.030918e-04  5.316456  0.44213683  2.1719781
##   0.9    2.894266e-04  5.118071  0.44546068  2.1171216
##   0.9    4.124626e-04  4.810248  0.45007718  2.0322857
##   0.9    5.878016e-04  4.514525  0.45551039  1.9493738
##   0.9    8.376776e-04  4.074555  0.46411986  1.8255907
##   0.9    1.193777e-03  3.549270  0.47390564  1.6769592
##   0.9    1.701254e-03  3.069572  0.48500633  1.5379312
##   0.9    2.424462e-03  2.508821  0.50222282  1.3752095
##   0.9    3.455107e-03  2.269253  0.51452881  1.2958512
##   0.9    4.923883e-03  2.116143  0.52729600  1.2406407
##   0.9    7.017038e-03  2.049717  0.53807655  1.2085642
##   0.9    1.000000e-02  2.040582  0.54016527  1.1980077
##   0.9    1.425103e-02  2.004148  0.54287129  1.1838793
##   0.9    2.030918e-02  1.879983  0.54748034  1.1500060
##   0.9    2.894266e-02  1.709371  0.56129143  1.1044890
##   0.9    4.124626e-02  1.470246  0.58015427  1.0413661
##   0.9    5.878016e-02  1.327650  0.59391590  1.0034828
##   0.9    8.376776e-02  1.311294  0.60494382  0.9972479
##   0.9    1.193777e-01  1.367897  0.58569266  1.0255804
##   0.9    1.701254e-01  1.431671  0.57047178  1.0574590
##   0.9    2.424462e-01  1.395229  0.56622626  1.0706591
##   0.9    3.455107e-01  1.337373  0.57702591  1.0734749
##   0.9    4.923883e-01  1.355993  0.60311556  1.1115791
##   0.9    7.017038e-01  1.480221  0.61788645  1.2109469
##   0.9    1.000000e+00  1.721475  0.54593266  1.4050348
##   0.9    1.425103e+00  1.879561         NaN  1.5391257
##   0.9    2.030918e+00  1.879561         NaN  1.5391257
##   0.9    2.894266e+00  1.879561         NaN  1.5391257
##   0.9    4.124626e+00  1.879561         NaN  1.5391257
##   0.9    5.878016e+00  1.879561         NaN  1.5391257
##   0.9    8.376776e+00  1.879561         NaN  1.5391257
##   0.9    1.193777e+01  1.879561         NaN  1.5391257
##   0.9    1.701254e+01  1.879561         NaN  1.5391257
##   0.9    2.424462e+01  1.879561         NaN  1.5391257
##   0.9    3.455107e+01  1.879561         NaN  1.5391257
##   0.9    4.923883e+01  1.879561         NaN  1.5391257
##   0.9    7.017038e+01  1.879561         NaN  1.5391257
##   0.9    1.000000e+02  1.879561         NaN  1.5391257
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.7017038.
plot(chem_enet_fit)

chem_best_tune <- chem_enet_fit$bestTune
chem_best_results <- chem_enet_fit$results |>
  filter(alpha == chem_best_tune$alpha, lambda == chem_best_tune$lambda)

chem_best_tune
##    alpha    lambda
## 26   0.1 0.7017038
chem_best_results
##   alpha    lambda     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1   0.1 0.7017038 1.198188 0.6258767 0.9741158 0.2456553  0.1802082 0.1954139

The elastic net model achieved a maximum cross-validated R² of approximately 0.63. This indicates that the model is able to explain a substantial portion of the variability in yield during resampling. The optimal tuning parameters correspond to a relatively low-to-moderate value of alpha, suggesting that a combination of ridge and lasso regularization is beneficial for this dataset.

2.2.4 (d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

chem_test_pred <- predict(chem_enet_fit, newdata = chem_x_test)

chem_test_results <- data.frame(
  Model = "Elastic Net",
  Test_R2 = cor(chem_y_test, chem_test_pred)^2,
  Test_RMSE = RMSE(chem_test_pred, chem_y_test),
  Test_MAE = MAE(chem_test_pred, chem_y_test)
)

chem_test_results
##         Model   Test_R2 Test_RMSE Test_MAE
## 1 Elastic Net 0.4495545  1.194167 0.977465
chem_best_results |> select(alpha, lambda, RMSE, Rsquared, MAE)
##   alpha    lambda     RMSE  Rsquared       MAE
## 1   0.1 0.7017038 1.198188 0.6258767 0.9741158

The test set R² is lower than the cross-validated R², which is expected since resampling estimates are often optimistic. This difference suggests some degree of overfitting, although the gap is not excessively large. Overall, the model demonstrates reasonable generalization to unseen data.

chem_pred_df <- data.frame(
  Observed = chem_y_test,
  Predicted = chem_test_pred
)

ggplot(chem_pred_df, aes(x = Observed, y = Predicted)) +
  geom_point(alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Elastic Net Model: Observed vs Predicted Yield",
    x = "Observed Yield",
    y = "Predicted Yield"
  ) +
  theme_minimal()

2.2.5 (e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

chem_varimp <- varImp(chem_enet_fit, scale = TRUE)
chem_varimp
## glmnet variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  73.354
## ManufacturingProcess36  61.426
## ManufacturingProcess13  58.776
## ManufacturingProcess17  52.598
## ManufacturingProcess06  33.270
## ManufacturingProcess37  33.107
## ManufacturingProcess11  28.042
## ManufacturingProcess34  25.049
## BiologicalMaterial03    23.924
## BiologicalMaterial02    21.031
## BiologicalMaterial06    19.201
## ManufacturingProcess15  18.054
## ManufacturingProcess45  18.034
## BiologicalMaterial05    15.664
## ManufacturingProcess43  14.287
## BiologicalMaterial07    14.257
## ManufacturingProcess07  11.849
## ManufacturingProcess24  10.273
## ManufacturingProcess39   9.726
chem_varimp_df <- chem_varimp$importance |>
  tibble::rownames_to_column("Predictor") |>
  arrange(desc(Overall)) |>
  mutate(
    Predictor_Type = case_when(
      Predictor %in% bio_predictors ~ "Biological material",
      Predictor %in% process_predictors ~ "Manufacturing process",
      TRUE ~ "Other"
    )
  )

chem_top_predictors <- chem_varimp_df |>
  slice_head(n = 15)

chem_top_predictors
##                 Predictor   Overall        Predictor_Type
## 1  ManufacturingProcess32 100.00000 Manufacturing process
## 2  ManufacturingProcess09  73.35402 Manufacturing process
## 3  ManufacturingProcess36  61.42589 Manufacturing process
## 4  ManufacturingProcess13  58.77603 Manufacturing process
## 5  ManufacturingProcess17  52.59795 Manufacturing process
## 6  ManufacturingProcess06  33.27021 Manufacturing process
## 7  ManufacturingProcess37  33.10740 Manufacturing process
## 8  ManufacturingProcess11  28.04226 Manufacturing process
## 9  ManufacturingProcess34  25.04910 Manufacturing process
## 10   BiologicalMaterial03  23.92385   Biological material
## 11   BiologicalMaterial02  21.03072   Biological material
## 12   BiologicalMaterial06  19.20147   Biological material
## 13 ManufacturingProcess15  18.05402 Manufacturing process
## 14 ManufacturingProcess45  18.03441 Manufacturing process
## 15   BiologicalMaterial05  15.66375   Biological material
chem_top_predictors |>
  count(Predictor_Type)
##          Predictor_Type  n
## 1   Biological material  4
## 2 Manufacturing process 11

The most important predictors are primarily manufacturing process variables rather than biological material variables. So, the yield is more strongly influenced by controllable process conditions than by the initial properties of the raw materials. While biological variables provide useful baseline information, the process variables appear to play a dominant role in determining final yield. So, operators can adjust those process parameters.

ggplot(chem_top_predictors, aes(x = reorder(Predictor, Overall), y = Overall, fill = Predictor_Type)) +
  geom_col(show.legend = TRUE) +
  coord_flip() +
  labs(
    title = "Top Predictors of Manufacturing Yield",
    x = "Predictor",
    y = "Variable Importance"
  ) +
  theme_minimal()

2.2.6 (f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

To explore how the top predictors relate to yield, I plotted each of the top predictors against the response to help identify whether higher or lower values of a predictor tend to be associated with higher product yield.

chem_pp <- preProcess(chem_x, method = c("center", "scale", "knnImpute"))
chem_x_imputed_scaled <- predict(chem_pp, chem_x)

chem_plot_df <- chem_x_imputed_scaled |>
  as.data.frame() |>
  mutate(Yield = chem_y)
top_6_predictors <- chem_top_predictors$Predictor[1:6]

chem_plot_df |>
  select(Yield, all_of(top_6_predictors)) |>
  pivot_longer(cols = all_of(top_6_predictors), names_to = "Predictor", values_to = "Value") |>
  ggplot(aes(x = Value, y = Yield)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ Predictor, scales = "free_x") +
  labs(
    title = "Relationships Between Top Predictors and Yield",
    x = "Predictor value (imputed, centered, and scaled)",
    y = "Yield"
  ) +
  theme_minimal()

The relationships between the top predictors and yield reveal how specific variables influence the manufacturing outcome. Several process variables show clear trends with yield, indicating that adjusting these parameters could lead to improved performance. This is particularly valuable because manufacturing process variables can be controlled. By identifying which variables have the strongest impact, engineers can optimize process settings to increase yield. Even small improvements are economically significant, as a 1% increase in yield corresponds to a substantial increase in revenue per batch. On another hand, biological material variables can’t be directly controlled, but they can be used to assess the quality of incoming raw materials and anticipate potential yield issues before processing begins. The results highlight that while predictive modeling can explain a significant portion of yield variability, its greatest value lies in identifying actionable process variables that can be optimized to improve manufacturing outcomes.