Exercise 6.2

(a) Load the data

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

glimpse(fingerprints)
##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
glimpse(permeability)
##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

(b) Remove near-zero variance variables

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Answer: Of the 1107 original predictors, 719 had near-zero variance, which leaves 388 for modeling.

# Indexes of predictors with near-zero variance
zero_variance_col_idx <- caret::nearZeroVar(fingerprints)
sprintf('There are %d near-zero variance predictors', length(zero_variance_col_idx))
## [1] "There are 719 near-zero variance predictors"
# Remove zero-variance variables
fingerprints2 <- fingerprints[, -zero_variance_col_idx]

# Number of predictors remaining
predictors_remaining <- ncol(fingerprints) - length(zero_variance_col_idx)
sprintf('There are %d predictors remaining', predictors_remaining)
## [1] "There are 388 predictors remaining"

Partial least squares model

(c) Training

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

Answer: RMSE is minimized when there are 6 latent components. For the training dataset, \(R^2 = 0.57\).

set.seed(144)

# Split data 70:30 using permeability as the response variable (Y) and
# and the fingerprint variables with non-zero variance as predictors (X)
train_idx <- caret::createDataPartition(permeability, p = 0.7, list = FALSE)
trainX <- fingerprints2[train_idx, ]
testX <- fingerprints2[-train_idx, ]
trainY <- permeability[train_idx]
testY <- permeability[-train_idx]
set.seed(144)

# Tune partial least squares model using training data
plsTune <- train(
  x = trainX,
  y = trainY,
  method = 'pls',
  # Center and scale data prior to tuning
  preProcess = c('center', 'scale'),
  # Evaluate up to 20 values of hyperparameter (latent components)
  tuneLength = 20,
  # 10-fold cross validation
  trControl = trainControl(method = 'cv', number = 10)
)

plsTune
## Partial Least Squares 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.82735  0.3664386   9.539715
##    2     11.12034  0.5072545   8.021733
##    3     11.14211  0.5093928   8.498432
##    4     10.98911  0.5367780   8.481169
##    5     10.88829  0.5619681   8.278530
##    6     10.62656  0.5658886   8.132241
##    7     10.79510  0.5529893   8.278695
##    8     11.26775  0.5348785   8.622683
##    9     11.26959  0.5295104   8.575702
##   10     11.46784  0.5148290   8.789975
##   11     11.51577  0.5185152   8.770103
##   12     11.99108  0.4846067   9.152932
##   13     12.39265  0.4636449   9.463919
##   14     12.50018  0.4614354   9.524699
##   15     12.73235  0.4468071   9.736424
##   16     13.09652  0.4321113   9.951540
##   17     13.28473  0.4331505  10.179938
##   18     13.46916  0.4312645  10.284784
##   19     13.66999  0.4197236  10.312308
##   20     13.79573  0.4133405  10.331832
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
# Visualize number of latent components that minimizes RMSE
plot(plsTune)

# Extract R^2 for best model
slice_max(plsTune$results, order_by = Rsquared) %>%
  pull(Rsquared)
## [1] 0.5658886

(d) Test

Predict the response for the test set. What is the test set estimate of \(R^2\)?

Answer: \(R^2 = 0.34\)

# Make predictions using test data
plsPredict <- predict(plsTune, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(plsPredict_metrics <- caret::postResample(pred = plsPredict, obs = testY))
##       RMSE   Rsquared        MAE 
## 13.5161089  0.3401654  9.7779312

(e) Additional models

Try building other models discussed in this chapter. Do any have better predictive performance?

Answer: I compared four other regression models (ordinary and robust linear regression, ridge regression, and elastic net regression). As shown below, the predictive performance of all of these models on the test dataset was similar to that of the PLS model.

Ordinary linear regression

Of note, the training dataset is rank deficient, which prevents linear regression from working properly. To circumvent this, I pre-processed the data using PCA to reduce dimensionality.

# By definition, a matrix is rank deficient if its rank is less than the
# maximum possible rank. This can be assessed by comparing its rank via
# QR decomposition versus the maximum possible rank of the matrix (number of columns)
# Reference: https://codesignal.com/learn/courses/fundamentals-of-vectors-and-matrices-with-r/lessons/matrix-rank-in-r
(matrix_is_rank_deficient <- qr(trainX)$rank < ncol(trainX))
## [1] TRUE
set.seed(144)

lmPCA <- train(
  x = trainX,
  y = trainY,
  method = 'lm',
  preProcess = c('center', 'scale', 'pca'),
  trControl = trainControl(method = 'cv', number = 10)  
)

lmPCA
## Linear Regression 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388), principal component
##  signal extraction (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   10.93587  0.5316605  8.249603
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Make predictions using test data
lmPredict <- predict(lmPCA, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(lmPredict_metrics <- caret::postResample(pred = lmPredict, obs = testY))
##       RMSE   Rsquared        MAE 
## 13.9091764  0.3027811  9.6828307

Robust linear regression

set.seed(144)

rlmPCA <- train(
  x = trainX,
  y = trainY,
  method = 'rlm',
  preProcess = c('center', 'scale', 'pca'),
  maxit = 100,
  trControl = trainControl(method = 'cv', number = 10)  
)

rlmPCA
## Robust Linear Model 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388), principal component
##  signal extraction (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE      Rsquared   MAE      
##   FALSE      psi.huber     16.64032  0.5402025  13.669605
##   FALSE      psi.hampel    16.63778  0.5376294  13.635641
##   FALSE      psi.bisquare  16.69247  0.5388262  13.662185
##    TRUE      psi.huber     10.48630  0.5547350   7.313938
##    TRUE      psi.hampel    10.85642  0.5452367   7.411143
##    TRUE      psi.bisquare  10.76021  0.5334245   6.907199
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
plot(rlmPCA)

# Make predictions using test data
rlmPredict <- predict(rlmPCA, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(rlmPredict_metrics <- caret::postResample(pred = rlmPredict, obs = testY))
##       RMSE   Rsquared        MAE 
## 14.5744155  0.2972605  9.2516546

Ridge regression

set.seed(144)

# Note: In AMR textbook chapter 6, the example tunes the ridge regression model 
# with lambda ranging from 0 to 0.1; however, this produces warnings on my computer  
# that the model fit fails when lambda=0 due to missing values in resampled
# performance measures. Google search/Gemini queries did not provide any
# insight into this issue. The only workaround I came up with (not using Google/Gemini)
# was to exclude lambda=0 from the tuning process.
ridgeGrid <- data.frame(.lambda = seq(0.01, 1, length = 20))

ridgeTune <- train(
  x = trainX,
  y = trainY,
  method = 'ridge',
  preProcess = c('center', 'scale'),
  tuneGrid = ridgeGrid,
  trControl = trainControl(method = 'cv', number = 10)
)

ridgeTune
## Ridge Regression 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE      
##   0.01000000  15.34100  0.3575719  11.021800
##   0.06210526  12.63832  0.4687580   9.444068
##   0.11421053  12.15505  0.5015648   9.168843
##   0.16631579  12.00128  0.5202751   9.086311
##   0.21842105  11.98702  0.5329231   9.114263
##   0.27052632  12.05585  0.5420228   9.223177
##   0.32263158  12.17585  0.5494491   9.378903
##   0.37473684  12.33696  0.5554017   9.558935
##   0.42684211  12.53214  0.5601159   9.750346
##   0.47894737  12.75434  0.5640591   9.961042
##   0.53105263  12.99982  0.5673933  10.187967
##   0.58315789  13.26504  0.5702201  10.418751
##   0.63526316  13.54878  0.5725926  10.651036
##   0.68736842  13.84486  0.5746919  10.897812
##   0.73947368  14.15561  0.5764592  11.160609
##   0.79157895  14.47971  0.5779639  11.425370
##   0.84368421  14.81126  0.5792741  11.692189
##   0.89578947  15.15379  0.5803846  11.961301
##   0.94789474  15.50478  0.5813303  12.229480
##   1.00000000  15.86403  0.5821285  12.507234
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2184211.
plot(ridgeTune)

# Make predictions using test data
ridgePredict <- predict(ridgeTune, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(ridgePredict_metrics <- caret::postResample(pred = ridgePredict, obs = testY))
##       RMSE   Rsquared        MAE 
## 14.5163325  0.3612944 10.2580152

Elastic net regression

set.seed(144)

enetGrid <- expand.grid(.lambda = c(0.001, 0.01, 0.1), .fraction = seq(0.05, 1, length = 20))

enetTune <- train(
  x = trainX,
  y = trainY,
  method = 'enet',
  preProcess = c('center', 'scale'),
  tuneGrid = enetGrid,
  trControl = trainControl(method = 'cv', number = 10)
)

enetTune
## Elasticnet 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE       
##   0.001   0.05       21.92516  0.4408144   14.233950
##   0.001   0.10       33.39217  0.4045580   21.080584
##   0.001   0.15       44.86406  0.4151169   27.681653
##   0.001   0.20       56.50263  0.4057260   34.378142
##   0.001   0.25       68.05735  0.3791952   41.084955
##   0.001   0.30       79.69451  0.3609058   47.816233
##   0.001   0.35       91.44968  0.3356446   54.596944
##   0.001   0.40      102.95813  0.3255008   61.236422
##   0.001   0.45      114.42770  0.3164243   68.036454
##   0.001   0.50      125.99964  0.2994123   74.624188
##   0.001   0.55      137.45221  0.2868641   81.108481
##   0.001   0.60      148.62304  0.2774371   87.660864
##   0.001   0.65      159.56278  0.2741550   94.152319
##   0.001   0.70      170.23242  0.2756948  100.448713
##   0.001   0.75      180.84112  0.2765503  106.742011
##   0.001   0.80      191.47665  0.2784298  113.052184
##   0.001   0.85      201.99557  0.2762665  119.342284
##   0.001   0.90      212.45542  0.2723294  125.564235
##   0.001   0.95      222.74113  0.2660977  131.625085
##   0.001   1.00      232.94793  0.2597495  137.609210
##   0.010   0.05       11.46432  0.4807434    8.363457
##   0.010   0.10       10.90218  0.5176640    7.952392
##   0.010   0.15       11.29710  0.4821903    8.452190
##   0.010   0.20       11.49502  0.4738032    8.600224
##   0.010   0.25       11.57232  0.4782690    8.677517
##   0.010   0.30       11.75276  0.4745424    8.785162
##   0.010   0.35       12.05546  0.4598227    8.966394
##   0.010   0.40       12.34161  0.4467214    9.165087
##   0.010   0.45       12.60567  0.4354373    9.302387
##   0.010   0.50       12.90784  0.4252802    9.459857
##   0.010   0.55       13.23975  0.4153923    9.631252
##   0.010   0.60       13.59133  0.4040686    9.879696
##   0.010   0.65       13.89397  0.3943074   10.105854
##   0.010   0.70       14.19974  0.3846975   10.319497
##   0.010   0.75       14.46418  0.3778002   10.479181
##   0.010   0.80       14.72863  0.3706790   10.667527
##   0.010   0.85       14.91601  0.3665217   10.774013
##   0.010   0.90       15.05572  0.3639233   10.858132
##   0.010   0.95       15.17722  0.3618514   10.928652
##   0.010   1.00       15.34109  0.3575732   11.021611
##   0.100   0.05       12.51813  0.4274538    9.704841
##   0.100   0.10       10.90631  0.5282895    7.768255
##   0.100   0.15       10.54565  0.5668596    7.537557
##   0.100   0.20       10.72017  0.5513935    7.869559
##   0.100   0.25       10.94242  0.5334051    8.152821
##   0.100   0.30       11.07308  0.5264534    8.271870
##   0.100   0.35       11.13752  0.5261122    8.296813
##   0.100   0.40       11.20933  0.5243942    8.324347
##   0.100   0.45       11.29740  0.5221722    8.391041
##   0.100   0.50       11.42774  0.5175032    8.511105
##   0.100   0.55       11.56820  0.5120208    8.628309
##   0.100   0.60       11.69991  0.5068339    8.746064
##   0.100   0.65       11.79654  0.5035850    8.834430
##   0.100   0.70       11.88514  0.5004241    8.903107
##   0.100   0.75       11.95359  0.4987767    8.974108
##   0.100   0.80       12.01123  0.4975360    9.037600
##   0.100   0.85       12.06627  0.4971584    9.086457
##   0.100   0.90       12.12368  0.4965675    9.131381
##   0.100   0.95       12.17505  0.4960982    9.170227
##   0.100   1.00       12.23585  0.4947262    9.209491
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.15 and lambda = 0.1.
plot(enetTune)

# Make predictions using test data
enetPredict <- predict(enetTune, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(enetPredict_metrics <- caret::postResample(pred = enetPredict, obs = testY))
##       RMSE   Rsquared        MAE 
## 12.9498384  0.3283974  8.5961223

Model comparison

The elastic net model has the lowest RMSE and MAE, which indicates that it has the lowest prediction error. Although the ridge regression model has the highest \(R^2\), it is only ~0.03 greater than the \(R^2\) for the elastic net model. So, I would select the elastic net model as the best predictive model.

models <- c('Ordinary LR', 'Robust LR', 'PLS', 'Ridge regression', 'Elastic net')

model_comparison_df <- bind_rows(lmPredict_metrics, rlmPredict_metrics, plsPredict_metrics,
                                 ridgePredict_metrics, enetPredict_metrics) %>%
  add_column(model = models, .before = 'RMSE') %>%
  arrange(RMSE)

model_comparison_df
## # A tibble: 5 × 4
##   model             RMSE Rsquared   MAE
##   <chr>            <dbl>    <dbl> <dbl>
## 1 Elastic net       12.9    0.328  8.60
## 2 PLS               13.5    0.340  9.78
## 3 Ordinary LR       13.9    0.303  9.68
## 4 Ridge regression  14.5    0.361 10.3 
## 5 Robust LR         14.6    0.297  9.25

(f) Recommendation

Would you recommend any of your models to replace the permeability laboratory experiment?

Answer: No. All of the models have \(0.3 < R^2 < 0.36\), meaning that they only explain approximately one-third of the variance in permeability. I don’t think this would justify using a model instead of experimentally determining permeability values, especially for a commercial application.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately $100,000 per batch:

(a) Load the data

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

For plotting purposes, I shortened some column names.

ChemicalManufacturingProcess <- ChemicalManufacturingProcess %>%
  rename_with(~ str_replace_all(., 'BiologicalMaterial', 'Material')) %>%
  rename_with(~ str_replace_all(., 'ManufacturingProcess', 'Process'))

In addition, although not mentioned in the exercise instructions, I checked for and removed 1 variable with near-zero variance.

# Remove zero-variance variables
zero_variance_col_idx <- caret::nearZeroVar(ChemicalManufacturingProcess)
ChemicalManufacturingProcess2 <- ChemicalManufacturingProcess[, -zero_variance_col_idx]
sprintf('There are %d near-zero variance predictors', length(zero_variance_col_idx))
## [1] "There are 1 near-zero variance predictors"
setdiff(colnames(ChemicalManufacturingProcess), colnames(ChemicalManufacturingProcess2))
## [1] "Material07"

(b) Missing data and imputation

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

To determine the best imputation method, it is helpful to assess any patterns in the missing data. The plot below shows that missing data only occurs in the Process variables but not the Material variables or the target variable (Yield). In particular, Process03 is missing the most data (9%), followed by Process11 (6%) and Process10 (5%). This suggests that the missing data mechanism is “missing at random” (MAR), which would be best addressed using multiple imputation (eg, multivariate imputation by chained equations [MICE]).1

vis_miss(ChemicalManufacturingProcess2) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Model development

(c) Training

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Answer: I used MICE imputation with predictive mean matching2 and partial least squares to develop five imputation models, which had mean \(R^2 = 0.56\) for the training dataset.

set.seed(144)

# Split data 70:30
train_idx <- caret::createDataPartition(ChemicalManufacturingProcess2$Yield, 
                                        p = 0.7, list = FALSE)
train_df <- ChemicalManufacturingProcess2[train_idx, ]
test_df <- ChemicalManufacturingProcess2[-train_idx, ]
# Note that imputation needs to be applied to the training and test data separately
# to avoid target leakage. This can be implemented by creating a logical vector that 
# tells mice() which rows to ignore
# Reference: https://stackoverflow.com/questions/33500047/r-mice-machine-learning-re-use-imputation-scheme-from-train-to-test-set

# Initialize a vector with TRUE everywhere, then replace train indexes with FALSE
ignore_idx <- c(rep(TRUE, nrow(ChemicalManufacturingProcess2)))
ignore_idx[train_idx] <- FALSE  # Impute (ie, do not ignore) these rows

# Create 5 imputation datasets
n_impute_datasets <- 5
CMP_imputations <- mice(ChemicalManufacturingProcess2,
                        m = n_impute_datasets, 
                        method = 'pmm',  # predictive mean matching
                        ignore = ignore_idx,
                        seed = 144, 
                        printFlag = FALSE)
set.seed(144)

# Create list to store models
models <- list()

# Create vectors to store R squared and RMSE values for all models
r2_vec <- vector(mode = 'numeric', length = n_impute_datasets)
rmse_vec <- vector(mode = 'numeric', length = n_impute_datasets)

# Loop through the imputed datasets and tune PLS model for each
for (i in 1 : n_impute_datasets) {
  # Extract the imputed dataset
  full_imputed <- mice::complete(CMP_imputations, i)
  
  # Subset training rows
  train_imputed <- full_imputed[train_idx, ]
  
  # Tune PLS model
  plsTune <- train(
    Yield ~ ., data = train_imputed,
    method = 'pls',
    preProcess = c('center', 'scale'),
    tuneLength = 20,
    trControl = trainControl(method = 'cv', number = 10)
  )
  models[[i]] <- plsTune
  
  # Extract max R squared value
  r2_vec[[i]] <- slice_max(plsTune$results, order_by = Rsquared) %>%
    pull(Rsquared)
  
  # Extract corresponding RMSE value
  rmse_vec[[i]] <- slice_max(plsTune$results, order_by = Rsquared) %>%
    pull(RMSE)
}
# Average R squared value for all models (per Rubin's rules for integrating MICE datasets)
# Reference: https://stefvanbuuren.name/fimd/sec-nutshell.html
mean(r2_vec)
## [1] 0.5559764
# Average R squared value for all models
mean(rmse_vec)
## [1] 1.710271

(d) Test

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Answer: For the test dataset, \(R^2 = 0.58\), which was similar to that for the training dataset. The RMSE decreased from 1.71 in the training dataset to 1.21. Together, these results suggest that the imputed models are well-fitted to the data.

# Create matrix to store predictions for all models
predictions_mat <- matrix(NA, nrow = nrow(test_df), ncol = n_impute_datasets)

for (i in 1 : n_impute_datasets) {
  # Extract the imputed dataset
  full_imputed <- mice::complete(CMP_imputations, i)
  
  # Subset test rows
  test_imputed <- full_imputed[-train_idx, ]
  
  # Predict test data using imputed training data
  predictions_mat[, i] <- predict(models[[i]], newdata = test_imputed)
}

# Average the predictions (following Rubin's rules for integrating MICE datasets)
# and then calculate performance metrics
mean_predictions <- rowMeans(predictions_mat)
caret::postResample(pred = mean_predictions, obs = test_imputed$Yield)
##      RMSE  Rsquared       MAE 
## 1.2130215 0.5866628 0.9846066

(e) Predictor importance

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Answer:

Because the MICE imputation process generates multiple imputed datasets, each of which has an associated PLS model, I compared variable importance for each model.

The top 10 important variables in all models included an approximately equal number of biological material and manufacturing process predictors. The most important predictor in all 5 models was Process32. The second most important predictor across models was Process36 (3 models) and Material06 (2 models). The third most important predictor was Process17 (3 models) and Material02 (2 models).

varImp(models[[1]])
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##            Overall
## Process32   100.00
## Process36    68.98
## Process17    64.78
## Process13    62.99
## Process09    61.77
## Process33    52.47
## Material06   50.02
## Material02   46.83
## Material03   46.64
## Material08   44.49
## Material11   43.30
## Material12   42.95
## Process12    42.14
## Material04   41.81
## Material01   40.74
## Process28    40.42
## Process04    38.83
## Process06    38.39
## Process34    37.91
## Process11    31.84
varImp(models[[2]])
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##            Overall
## Process32   100.00
## Material06   81.20
## Material02   80.87
## Process36    80.84
## Material03   72.10
## Material04   66.66
## Process09    64.90
## Process13    64.60
## Process33    63.39
## Material01   58.96
## Process17    58.85
## Material08   58.64
## Material12   54.07
## Material11   50.58
## Process06    50.29
## Process04    45.30
## Process12    45.15
## Process28    42.72
## Process11    38.60
## Process15    38.27
varImp(models[[3]])
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##            Overall
## Process32   100.00
## Process36    66.95
## Process17    65.49
## Process13    64.24
## Process09    60.92
## Material06   51.64
## Process33    50.45
## Material03   48.26
## Material02   47.60
## Material08   45.49
## Material11   44.61
## Material12   43.49
## Process12    42.25
## Material04   42.11
## Material01   41.00
## Process06    39.99
## Process28    39.16
## Process11    37.55
## Process04    36.77
## Process34    35.19
varImp(models[[4]])
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##            Overall
## Process32   100.00
## Material06   81.20
## Material02   80.87
## Process36    80.38
## Material03   72.10
## Material04   66.66
## Process09    64.90
## Process13    64.60
## Process33    63.39
## Material01   58.96
## Process17    58.85
## Material08   58.64
## Material12   54.07
## Material11   50.58
## Process06    50.41
## Process12    45.15
## Process28    42.72
## Process04    42.44
## Process15    38.27
## Process02    37.99
varImp(models[[5]])
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##            Overall
## Process32   100.00
## Process36    64.81
## Process17    63.58
## Process13    61.51
## Process09    60.15
## Material06   48.74
## Process33    46.79
## Material03   45.76
## Material02   44.92
## Material08   43.50
## Material11   42.57
## Material12   41.88
## Process12    40.71
## Material04   40.22
## Material01   39.12
## Process06    38.70
## Process28    38.48
## Process04    37.54
## Process34    35.19
## Process11    30.90

(f) Real-world implications

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Answer: Of the five top important variables identified in part (e), manufacturing Process32 has the strongest linear relationship with the target variable Yield (Pearson \(r=0.66\)). The best univariate linear model has an adjusted \(R^2 = 0.43\), meaning that the variable explains approximately 43% of the variance in Yield.

Among biological material predictors, Material06 and Material02 have moderately strong linear relationship with Yield (Pearson \(r=0.54\) and \(r=0.53\), respectively. The best univariate linear model for both of these predictors have adjusted \(R^2 = 0.28\), meaning that each of these variables can explain ~28% of the variance in Yield.

These results suggest that optimization (maximization) of Process32 is most important to increase product yield and, in turn, revenue. Although biological predictors cannot be changed (per exercise instructions), Material06 and Material02 may be useful for quality control of the raw material.

Relationship between Process32 and Yield

Process32 and Yield have a moderately strong Pearson correlation coefficient (\(r=0.66\)), and the scatterplot is consistent with a linear relationship. In the univariate linear model, the adjusted \(R^2 = 0.43\), which indicates that Process32 accounts for ~43% of the variance in Yield.

pearson_r <- round(cor(train_df$Process32, train_df$Yield, 
                       method = c("pearson"), use = 'complete.obs'), 2)

ggplot(drop_na(train_df, Process32), aes(x = Process32, y = Yield)) +
  geom_point() +
  labs(
    x = 'Process32',
    y = 'Yield',
    title = 'Relationship between Manufacturing Process32 and Yield'
  ) +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  geom_text(x = 170, y = 44,
            label = paste0('r = ', pearson_r),
            color = 'blue') +
  theme_classic() +
  theme(
    plot.title = element_text(face = 'bold'),
    axis.title = element_text(face = 'bold')
  )

process32_yield_lm <- lm(Yield ~ Process32, data = train_df)
summary(process32_yield_lm)
## 
## Call:
## lm(formula = Yield ~ Process32, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0278 -0.9197  0.0285  0.7453  4.3623 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.49328    3.48874   1.861   0.0651 .  
## Process32    0.21248    0.02198   9.666   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.383 on 122 degrees of freedom
## Multiple R-squared:  0.4337, Adjusted R-squared:  0.4291 
## F-statistic: 93.43 on 1 and 122 DF,  p-value: < 2.2e-16

Relationship between Process36 and Yield

Although Process36 and Yield have a moderately strong Pearson correlation coefficient (\(r=-0.54\)), the scatterplot shows that Process36 values are discrete rather than continuous, so it is not actually linear, which could make a linear model unsuitable.

pearson_r <- round(cor(train_df$Process36, train_df$Yield, 
                       method = c("pearson"), use = 'complete.obs'), 2)

ggplot(drop_na(train_df, Process36), aes(x = Process36, y = Yield)) +
  geom_point() +
  labs(
    x = 'Process36',
    y = 'Yield',
    title = 'Relationship between Manufacturing Process36 and Yield'
  ) +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  geom_text(x = 0.0175, y = 44,
            label = paste0('r = ', pearson_r),
            color = 'blue') +
  theme_classic() +
  theme(
    plot.title = element_text(face = 'bold'),
    axis.title = element_text(face = 'bold')
  )

process36_yield_lm <- lm(Yield ~ Process36, data = drop_na(train_df, Process36))
summary(process36_yield_lm)
## 
## Call:
## lm(formula = Yield ~ Process36, data = drop_na(train_df, Process36))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5609 -1.0209  0.0412  1.0513  4.4405 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    61.547      3.095  19.889  < 2e-16 ***
## Process36   -1091.389    158.054  -6.905  2.7e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.561 on 118 degrees of freedom
## Multiple R-squared:  0.2878, Adjusted R-squared:  0.2818 
## F-statistic: 47.68 on 1 and 118 DF,  p-value: 2.697e-10

Relationship between Process17 and Yield

Although Process17 and Yield have a moderate Pearson correlation coefficient (\(r=-0.39\)), the scatterplot below suggests that the best fit linear model is influenced by several outliers/leverage points. Among other data points, the linear relationship is not as clear.

pearson_r <- round(cor(train_df$Process17, train_df$Yield, 
                       method = c("pearson"), use = 'complete.obs'), 2)

ggplot(drop_na(train_df, Process17), aes(x = Process17, y = Yield)) +
  geom_point() +
  labs(
    x = 'Process17',
    y = 'Yield',
    title = 'Relationship between Manufacturing Process17 and Yield'
  ) +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  geom_text(x = 39, y = 38.5,
            label = paste0('r = ', pearson_r),
            color = 'blue') +
  theme_classic() +
  theme(
    plot.title = element_text(face = 'bold'),
    axis.title = element_text(face = 'bold')
  )

process17_yield_lm <- lm(Yield ~ Process17, data = drop_na(train_df, Process17))
summary(process17_yield_lm)
## 
## Call:
## lm(formula = Yield ~ Process17, data = drop_na(train_df, Process17))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2401 -1.2786 -0.0765  1.1542  4.3784 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.7629     4.5908  13.454  < 2e-16 ***
## Process17    -0.6286     0.1337  -4.701 6.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.69 on 122 degrees of freedom
## Multiple R-squared:  0.1534, Adjusted R-squared:  0.1464 
## F-statistic:  22.1 on 1 and 122 DF,  p-value: 6.874e-06

Relationship between Material06 and Yield

Material06 and Yield have a moderately strong Pearson correlation coefficient (\(r=0.54\)), and the scatterplot is consistent with a linear relationship. In the univariate linear model, the adjusted \(R^2 = 0.28\), which indicates that Material06 accounts for ~28% of the variance in Yield.

pearson_r <- round(cor(train_df$Material06, train_df$Yield, 
                       method = c("pearson"), use = 'complete.obs'), 2)

ggplot(drop_na(train_df, Material06), aes(x = Material06, y = Yield)) +
  geom_point() +
  labs(
    x = 'Material06',
    y = 'Yield',
    title = 'Relationship between Biological Material06 and Yield'
  ) +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  geom_text(x = 55, y = 44,
            label = paste0('r = ', pearson_r),
            color = 'blue') +
  theme_classic() +
  theme(
    plot.title = element_text(face = 'bold'),
    axis.title = element_text(face = 'bold')
  )

material06_yield_lm <- lm(Yield ~ Material06, data = drop_na(train_df, Material06))
summary(material06_yield_lm)
## 
## Call:
## lm(formula = Yield ~ Material06, data = drop_na(train_df, Material06))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4822 -1.0046 -0.0941  0.9965  4.2775 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.48936    1.81427  15.152  < 2e-16 ***
## Material06   0.25922    0.03691   7.023 1.33e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.55 on 122 degrees of freedom
## Multiple R-squared:  0.2879, Adjusted R-squared:  0.2821 
## F-statistic: 49.33 on 1 and 122 DF,  p-value: 1.333e-10

Relationship between Material02 and Yield

Material02 and Yield have a moderately strong Pearson correlation coefficient (\(r=0.53\)), and the scatterplot is consistent with a linear relationship. In the univariate linear model, the adjusted \(R^2 = 0.28\), which indicates that Material02 accounts for ~28% of the variance in Yield.

pearson_r <- round(cor(train_df$Material02, train_df$Yield, 
                       method = c("pearson"), use = 'complete.obs'), 2)

ggplot(drop_na(train_df, Material02), aes(x = Material02, y = Yield)) +
  geom_point() +
  labs(
    x = 'Material02',
    y = 'Yield',
    title = 'Relationship between Biological Material02 and Yield'
  ) +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  geom_text(x = 55, y = 43,
            label = paste0('r = ', pearson_r),
            color = 'blue') +
  theme_classic() +
  theme(
    plot.title = element_text(face = 'bold'),
    axis.title = element_text(face = 'bold')
  )

material02_yield_lm <- lm(Yield ~ Material02, data = drop_na(train_df, Material02))
summary(material02_yield_lm)
## 
## Call:
## lm(formula = Yield ~ Material02, data = drop_na(train_df, Material02))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5713 -1.0240 -0.1391  1.0325  4.4461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.00597    1.89330  14.264  < 2e-16 ***
## Material02   0.23618    0.03382   6.985 1.62e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.553 on 122 degrees of freedom
## Multiple R-squared:  0.2856, Adjusted R-squared:  0.2798 
## F-statistic: 48.78 on 1 and 122 DF,  p-value: 1.623e-10

Session Details

pander::pander(sessionInfo())

R version 4.5.2 (2025-10-31)

Platform: aarch64-apple-darwin20

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: pander(v.0.6.6), mice(v.3.19.0), naniar(v.1.1.0), elasticnet(v.1.3), lars(v.1.3), pls(v.2.9-0), caret(v.7.0-1), lattice(v.0.22-7), lubridate(v.1.9.5), forcats(v.1.0.1), stringr(v.1.6.0), dplyr(v.1.2.0), purrr(v.1.2.1), readr(v.2.1.6), tidyr(v.1.3.2), tibble(v.3.3.1), ggplot2(v.4.0.2), tidyverse(v.2.0.0) and AppliedPredictiveModeling(v.1.1-7)

loaded via a namespace (and not attached): Rdpack(v.2.6.5), pROC(v.1.19.0.1), rlang(v.1.1.7), magrittr(v.2.0.4), rpart.plot(v.3.1.4), otel(v.0.2.0), compiler(v.4.5.2), mgcv(v.1.9-4), vctrs(v.0.7.2), reshape2(v.1.4.5), pkgconfig(v.2.0.3), shape(v.1.4.6.1), fastmap(v.1.2.0), backports(v.1.5.0), labeling(v.0.4.3), utf8(v.1.2.6), rmarkdown(v.2.30), prodlim(v.2025.04.28), tzdb(v.0.5.0), nloptr(v.2.2.1), visdat(v.0.6.0), xfun(v.0.56), glmnet(v.4.1-10), jomo(v.2.7-6), cachem(v.1.1.0), jsonlite(v.2.0.0), recipes(v.1.3.1), pan(v.1.9), broom(v.1.0.12), parallel(v.4.5.2), cluster(v.2.1.8.1), R6(v.2.6.1), CORElearn(v.1.57.3.1), bslib(v.0.10.0), stringi(v.1.8.7), RColorBrewer(v.1.1-3), parallelly(v.1.46.1), boot(v.1.3-32), rpart(v.4.1.24), jquerylib(v.0.1.4), Rcpp(v.1.1.1), iterators(v.1.0.14), knitr(v.1.51), future.apply(v.1.20.1), Matrix(v.1.7-4), splines(v.4.5.2), nnet(v.7.3-20), timechange(v.0.4.0), tidyselect(v.1.2.1), rstudioapi(v.0.18.0), yaml(v.2.3.12), timeDate(v.4052.112), codetools(v.0.2-20), listenv(v.0.10.0), plyr(v.1.8.9), withr(v.3.0.2), S7(v.0.2.1), evaluate(v.1.0.5), future(v.1.69.0), survival(v.3.8-6), pillar(v.1.11.1), foreach(v.1.5.2), stats4(v.4.5.2), reformulas(v.0.4.3.1), ellipse(v.0.5.0), generics(v.0.1.4), hms(v.1.1.4), scales(v.1.4.0), minqa(v.1.2.8), globals(v.0.18.0), class(v.7.3-23), glue(v.1.8.0), tools(v.4.5.2), data.table(v.1.18.2.1), lme4(v.1.1-38), ModelMetrics(v.1.2.2.2), gower(v.1.0.2), grid(v.4.5.2), plotrix(v.3.8-14), rbibutils(v.2.4.1), ipred(v.0.9-15), nlme(v.3.1-168), cli(v.3.6.5), lava(v.1.8.2), gtable(v.0.3.6), sass(v.0.4.10), digest(v.0.6.39), farver(v.2.1.2), htmltools(v.0.5.9), lifecycle(v.1.0.5), hardhat(v.1.4.2), mitml(v.0.4-5) and MASS(v.7.3-65)


  1. Tuzen MF (2025). Handling missing data in R: A comprehensive guide. https://www.r-bloggers.com/2025/08/handling-missing-data-in-r-a-comprehensive-guide/↩︎

  2. van Buuren S. Flexible imputation of missing data, 2nd ed. https://stefvanbuuren.name/fimd/sec-pmm.html↩︎