data(permeability)
The matrix fingerprints contains the 1,107 binary
molecular predictors for the 165 compounds, while
permeability contains permeability response.
glimpse(fingerprints)
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
glimpse(permeability)
## num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr "permeability"
The fingerprint predictors indicate the presence or absence of
substructures of a molecule and are often sparse meaning that relatively
few of the molecules contain each substructure. Filter out the
predictors that have low frequencies using the nearZeroVar
function from the caret package. How many predictors are
left for modeling?
Answer: Of the 1107 original predictors, 719 had near-zero variance, which leaves 388 for modeling.
# Indexes of predictors with near-zero variance
zero_variance_col_idx <- caret::nearZeroVar(fingerprints)
sprintf('There are %d near-zero variance predictors', length(zero_variance_col_idx))
## [1] "There are 719 near-zero variance predictors"
# Remove zero-variance variables
fingerprints2 <- fingerprints[, -zero_variance_col_idx]
# Number of predictors remaining
predictors_remaining <- ncol(fingerprints) - length(zero_variance_col_idx)
sprintf('There are %d predictors remaining', predictors_remaining)
## [1] "There are 388 predictors remaining"
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
Answer: RMSE is minimized when there are 6 latent components. For the training dataset, \(R^2 = 0.57\).
set.seed(144)
# Split data 70:30 using permeability as the response variable (Y) and
# and the fingerprint variables with non-zero variance as predictors (X)
train_idx <- caret::createDataPartition(permeability, p = 0.7, list = FALSE)
trainX <- fingerprints2[train_idx, ]
testX <- fingerprints2[-train_idx, ]
trainY <- permeability[train_idx]
testY <- permeability[-train_idx]
set.seed(144)
# Tune partial least squares model using training data
plsTune <- train(
x = trainX,
y = trainY,
method = 'pls',
# Center and scale data prior to tuning
preProcess = c('center', 'scale'),
# Evaluate up to 20 values of hyperparameter (latent components)
tuneLength = 20,
# 10-fold cross validation
trControl = trainControl(method = 'cv', number = 10)
)
plsTune
## Partial Least Squares
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.82735 0.3664386 9.539715
## 2 11.12034 0.5072545 8.021733
## 3 11.14211 0.5093928 8.498432
## 4 10.98911 0.5367780 8.481169
## 5 10.88829 0.5619681 8.278530
## 6 10.62656 0.5658886 8.132241
## 7 10.79510 0.5529893 8.278695
## 8 11.26775 0.5348785 8.622683
## 9 11.26959 0.5295104 8.575702
## 10 11.46784 0.5148290 8.789975
## 11 11.51577 0.5185152 8.770103
## 12 11.99108 0.4846067 9.152932
## 13 12.39265 0.4636449 9.463919
## 14 12.50018 0.4614354 9.524699
## 15 12.73235 0.4468071 9.736424
## 16 13.09652 0.4321113 9.951540
## 17 13.28473 0.4331505 10.179938
## 18 13.46916 0.4312645 10.284784
## 19 13.66999 0.4197236 10.312308
## 20 13.79573 0.4133405 10.331832
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
# Visualize number of latent components that minimizes RMSE
plot(plsTune)
# Extract R^2 for best model
slice_max(plsTune$results, order_by = Rsquared) %>%
pull(Rsquared)
## [1] 0.5658886
Predict the response for the test set. What is the test set estimate of \(R^2\)?
Answer: \(R^2 = 0.34\)
# Make predictions using test data
plsPredict <- predict(plsTune, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(plsPredict_metrics <- caret::postResample(pred = plsPredict, obs = testY))
## RMSE Rsquared MAE
## 13.5161089 0.3401654 9.7779312
Try building other models discussed in this chapter. Do any have better predictive performance?
Answer: I compared four other regression models (ordinary and robust linear regression, ridge regression, and elastic net regression). As shown below, the predictive performance of all of these models on the test dataset was similar to that of the PLS model.
Of note, the training dataset is rank deficient, which prevents linear regression from working properly. To circumvent this, I pre-processed the data using PCA to reduce dimensionality.
# By definition, a matrix is rank deficient if its rank is less than the
# maximum possible rank. This can be assessed by comparing its rank via
# QR decomposition versus the maximum possible rank of the matrix (number of columns)
# Reference: https://codesignal.com/learn/courses/fundamentals-of-vectors-and-matrices-with-r/lessons/matrix-rank-in-r
(matrix_is_rank_deficient <- qr(trainX)$rank < ncol(trainX))
## [1] TRUE
set.seed(144)
lmPCA <- train(
x = trainX,
y = trainY,
method = 'lm',
preProcess = c('center', 'scale', 'pca'),
trControl = trainControl(method = 'cv', number = 10)
)
lmPCA
## Linear Regression
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388), principal component
## signal extraction (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 10.93587 0.5316605 8.249603
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Make predictions using test data
lmPredict <- predict(lmPCA, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(lmPredict_metrics <- caret::postResample(pred = lmPredict, obs = testY))
## RMSE Rsquared MAE
## 13.9091764 0.3027811 9.6828307
set.seed(144)
rlmPCA <- train(
x = trainX,
y = trainY,
method = 'rlm',
preProcess = c('center', 'scale', 'pca'),
maxit = 100,
trControl = trainControl(method = 'cv', number = 10)
)
rlmPCA
## Robust Linear Model
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388), principal component
## signal extraction (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 16.64032 0.5402025 13.669605
## FALSE psi.hampel 16.63778 0.5376294 13.635641
## FALSE psi.bisquare 16.69247 0.5388262 13.662185
## TRUE psi.huber 10.48630 0.5547350 7.313938
## TRUE psi.hampel 10.85642 0.5452367 7.411143
## TRUE psi.bisquare 10.76021 0.5334245 6.907199
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
plot(rlmPCA)
# Make predictions using test data
rlmPredict <- predict(rlmPCA, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(rlmPredict_metrics <- caret::postResample(pred = rlmPredict, obs = testY))
## RMSE Rsquared MAE
## 14.5744155 0.2972605 9.2516546
set.seed(144)
# Note: In AMR textbook chapter 6, the example tunes the ridge regression model
# with lambda ranging from 0 to 0.1; however, this produces warnings on my computer
# that the model fit fails when lambda=0 due to missing values in resampled
# performance measures. Google search/Gemini queries did not provide any
# insight into this issue. The only workaround I came up with (not using Google/Gemini)
# was to exclude lambda=0 from the tuning process.
ridgeGrid <- data.frame(.lambda = seq(0.01, 1, length = 20))
ridgeTune <- train(
x = trainX,
y = trainY,
method = 'ridge',
preProcess = c('center', 'scale'),
tuneGrid = ridgeGrid,
trControl = trainControl(method = 'cv', number = 10)
)
ridgeTune
## Ridge Regression
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.01000000 15.34100 0.3575719 11.021800
## 0.06210526 12.63832 0.4687580 9.444068
## 0.11421053 12.15505 0.5015648 9.168843
## 0.16631579 12.00128 0.5202751 9.086311
## 0.21842105 11.98702 0.5329231 9.114263
## 0.27052632 12.05585 0.5420228 9.223177
## 0.32263158 12.17585 0.5494491 9.378903
## 0.37473684 12.33696 0.5554017 9.558935
## 0.42684211 12.53214 0.5601159 9.750346
## 0.47894737 12.75434 0.5640591 9.961042
## 0.53105263 12.99982 0.5673933 10.187967
## 0.58315789 13.26504 0.5702201 10.418751
## 0.63526316 13.54878 0.5725926 10.651036
## 0.68736842 13.84486 0.5746919 10.897812
## 0.73947368 14.15561 0.5764592 11.160609
## 0.79157895 14.47971 0.5779639 11.425370
## 0.84368421 14.81126 0.5792741 11.692189
## 0.89578947 15.15379 0.5803846 11.961301
## 0.94789474 15.50478 0.5813303 12.229480
## 1.00000000 15.86403 0.5821285 12.507234
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2184211.
plot(ridgeTune)
# Make predictions using test data
ridgePredict <- predict(ridgeTune, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(ridgePredict_metrics <- caret::postResample(pred = ridgePredict, obs = testY))
## RMSE Rsquared MAE
## 14.5163325 0.3612944 10.2580152
set.seed(144)
enetGrid <- expand.grid(.lambda = c(0.001, 0.01, 0.1), .fraction = seq(0.05, 1, length = 20))
enetTune <- train(
x = trainX,
y = trainY,
method = 'enet',
preProcess = c('center', 'scale'),
tuneGrid = enetGrid,
trControl = trainControl(method = 'cv', number = 10)
)
enetTune
## Elasticnet
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.001 0.05 21.92516 0.4408144 14.233950
## 0.001 0.10 33.39217 0.4045580 21.080584
## 0.001 0.15 44.86406 0.4151169 27.681653
## 0.001 0.20 56.50263 0.4057260 34.378142
## 0.001 0.25 68.05735 0.3791952 41.084955
## 0.001 0.30 79.69451 0.3609058 47.816233
## 0.001 0.35 91.44968 0.3356446 54.596944
## 0.001 0.40 102.95813 0.3255008 61.236422
## 0.001 0.45 114.42770 0.3164243 68.036454
## 0.001 0.50 125.99964 0.2994123 74.624188
## 0.001 0.55 137.45221 0.2868641 81.108481
## 0.001 0.60 148.62304 0.2774371 87.660864
## 0.001 0.65 159.56278 0.2741550 94.152319
## 0.001 0.70 170.23242 0.2756948 100.448713
## 0.001 0.75 180.84112 0.2765503 106.742011
## 0.001 0.80 191.47665 0.2784298 113.052184
## 0.001 0.85 201.99557 0.2762665 119.342284
## 0.001 0.90 212.45542 0.2723294 125.564235
## 0.001 0.95 222.74113 0.2660977 131.625085
## 0.001 1.00 232.94793 0.2597495 137.609210
## 0.010 0.05 11.46432 0.4807434 8.363457
## 0.010 0.10 10.90218 0.5176640 7.952392
## 0.010 0.15 11.29710 0.4821903 8.452190
## 0.010 0.20 11.49502 0.4738032 8.600224
## 0.010 0.25 11.57232 0.4782690 8.677517
## 0.010 0.30 11.75276 0.4745424 8.785162
## 0.010 0.35 12.05546 0.4598227 8.966394
## 0.010 0.40 12.34161 0.4467214 9.165087
## 0.010 0.45 12.60567 0.4354373 9.302387
## 0.010 0.50 12.90784 0.4252802 9.459857
## 0.010 0.55 13.23975 0.4153923 9.631252
## 0.010 0.60 13.59133 0.4040686 9.879696
## 0.010 0.65 13.89397 0.3943074 10.105854
## 0.010 0.70 14.19974 0.3846975 10.319497
## 0.010 0.75 14.46418 0.3778002 10.479181
## 0.010 0.80 14.72863 0.3706790 10.667527
## 0.010 0.85 14.91601 0.3665217 10.774013
## 0.010 0.90 15.05572 0.3639233 10.858132
## 0.010 0.95 15.17722 0.3618514 10.928652
## 0.010 1.00 15.34109 0.3575732 11.021611
## 0.100 0.05 12.51813 0.4274538 9.704841
## 0.100 0.10 10.90631 0.5282895 7.768255
## 0.100 0.15 10.54565 0.5668596 7.537557
## 0.100 0.20 10.72017 0.5513935 7.869559
## 0.100 0.25 10.94242 0.5334051 8.152821
## 0.100 0.30 11.07308 0.5264534 8.271870
## 0.100 0.35 11.13752 0.5261122 8.296813
## 0.100 0.40 11.20933 0.5243942 8.324347
## 0.100 0.45 11.29740 0.5221722 8.391041
## 0.100 0.50 11.42774 0.5175032 8.511105
## 0.100 0.55 11.56820 0.5120208 8.628309
## 0.100 0.60 11.69991 0.5068339 8.746064
## 0.100 0.65 11.79654 0.5035850 8.834430
## 0.100 0.70 11.88514 0.5004241 8.903107
## 0.100 0.75 11.95359 0.4987767 8.974108
## 0.100 0.80 12.01123 0.4975360 9.037600
## 0.100 0.85 12.06627 0.4971584 9.086457
## 0.100 0.90 12.12368 0.4965675 9.131381
## 0.100 0.95 12.17505 0.4960982 9.170227
## 0.100 1.00 12.23585 0.4947262 9.209491
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.15 and lambda = 0.1.
plot(enetTune)
# Make predictions using test data
enetPredict <- predict(enetTune, newdata = testX)
# Evaluate predictive performance (predicted vs actual)
(enetPredict_metrics <- caret::postResample(pred = enetPredict, obs = testY))
## RMSE Rsquared MAE
## 12.9498384 0.3283974 8.5961223
The elastic net model has the lowest RMSE and MAE, which indicates that it has the lowest prediction error. Although the ridge regression model has the highest \(R^2\), it is only ~0.03 greater than the \(R^2\) for the elastic net model. So, I would select the elastic net model as the best predictive model.
models <- c('Ordinary LR', 'Robust LR', 'PLS', 'Ridge regression', 'Elastic net')
model_comparison_df <- bind_rows(lmPredict_metrics, rlmPredict_metrics, plsPredict_metrics,
ridgePredict_metrics, enetPredict_metrics) %>%
add_column(model = models, .before = 'RMSE') %>%
arrange(RMSE)
model_comparison_df
## # A tibble: 5 × 4
## model RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 Elastic net 12.9 0.328 8.60
## 2 PLS 13.5 0.340 9.78
## 3 Ordinary LR 13.9 0.303 9.68
## 4 Ridge regression 14.5 0.361 10.3
## 5 Robust LR 14.6 0.297 9.25
Would you recommend any of your models to replace the permeability laboratory experiment?
Answer: No. All of the models have \(0.3 < R^2 < 0.36\), meaning that they only explain approximately one-third of the variance in permeability. I don’t think this would justify using a model instead of experimentally determining permeability values, especially for a commercial application.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately $100,000 per batch:
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
For plotting purposes, I shortened some column names.
ChemicalManufacturingProcess <- ChemicalManufacturingProcess %>%
rename_with(~ str_replace_all(., 'BiologicalMaterial', 'Material')) %>%
rename_with(~ str_replace_all(., 'ManufacturingProcess', 'Process'))
In addition, although not mentioned in the exercise instructions, I checked for and removed 1 variable with near-zero variance.
# Remove zero-variance variables
zero_variance_col_idx <- caret::nearZeroVar(ChemicalManufacturingProcess)
ChemicalManufacturingProcess2 <- ChemicalManufacturingProcess[, -zero_variance_col_idx]
sprintf('There are %d near-zero variance predictors', length(zero_variance_col_idx))
## [1] "There are 1 near-zero variance predictors"
setdiff(colnames(ChemicalManufacturingProcess), colnames(ChemicalManufacturingProcess2))
## [1] "Material07"
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.
To determine the best imputation method, it is helpful to assess any
patterns in the missing data. The plot below shows that missing data
only occurs in the Process variables but not the
Material variables or the target variable
(Yield). In particular, Process03 is missing
the most data (9%), followed by Process11 (6%) and
Process10 (5%). This suggests that the missing data
mechanism is “missing at random” (MAR), which would be best addressed
using multiple imputation (eg, multivariate imputation by chained
equations [MICE]).1
vis_miss(ChemicalManufacturingProcess2) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Answer: I used MICE imputation with predictive mean matching2 and partial least squares to develop five imputation models, which had mean \(R^2 = 0.56\) for the training dataset.
set.seed(144)
# Split data 70:30
train_idx <- caret::createDataPartition(ChemicalManufacturingProcess2$Yield,
p = 0.7, list = FALSE)
train_df <- ChemicalManufacturingProcess2[train_idx, ]
test_df <- ChemicalManufacturingProcess2[-train_idx, ]
# Note that imputation needs to be applied to the training and test data separately
# to avoid target leakage. This can be implemented by creating a logical vector that
# tells mice() which rows to ignore
# Reference: https://stackoverflow.com/questions/33500047/r-mice-machine-learning-re-use-imputation-scheme-from-train-to-test-set
# Initialize a vector with TRUE everywhere, then replace train indexes with FALSE
ignore_idx <- c(rep(TRUE, nrow(ChemicalManufacturingProcess2)))
ignore_idx[train_idx] <- FALSE # Impute (ie, do not ignore) these rows
# Create 5 imputation datasets
n_impute_datasets <- 5
CMP_imputations <- mice(ChemicalManufacturingProcess2,
m = n_impute_datasets,
method = 'pmm', # predictive mean matching
ignore = ignore_idx,
seed = 144,
printFlag = FALSE)
set.seed(144)
# Create list to store models
models <- list()
# Create vectors to store R squared and RMSE values for all models
r2_vec <- vector(mode = 'numeric', length = n_impute_datasets)
rmse_vec <- vector(mode = 'numeric', length = n_impute_datasets)
# Loop through the imputed datasets and tune PLS model for each
for (i in 1 : n_impute_datasets) {
# Extract the imputed dataset
full_imputed <- mice::complete(CMP_imputations, i)
# Subset training rows
train_imputed <- full_imputed[train_idx, ]
# Tune PLS model
plsTune <- train(
Yield ~ ., data = train_imputed,
method = 'pls',
preProcess = c('center', 'scale'),
tuneLength = 20,
trControl = trainControl(method = 'cv', number = 10)
)
models[[i]] <- plsTune
# Extract max R squared value
r2_vec[[i]] <- slice_max(plsTune$results, order_by = Rsquared) %>%
pull(Rsquared)
# Extract corresponding RMSE value
rmse_vec[[i]] <- slice_max(plsTune$results, order_by = Rsquared) %>%
pull(RMSE)
}
# Average R squared value for all models (per Rubin's rules for integrating MICE datasets)
# Reference: https://stefvanbuuren.name/fimd/sec-nutshell.html
mean(r2_vec)
## [1] 0.5559764
# Average R squared value for all models
mean(rmse_vec)
## [1] 1.710271
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Answer: For the test dataset, \(R^2 = 0.58\), which was similar to that for the training dataset. The RMSE decreased from 1.71 in the training dataset to 1.21. Together, these results suggest that the imputed models are well-fitted to the data.
# Create matrix to store predictions for all models
predictions_mat <- matrix(NA, nrow = nrow(test_df), ncol = n_impute_datasets)
for (i in 1 : n_impute_datasets) {
# Extract the imputed dataset
full_imputed <- mice::complete(CMP_imputations, i)
# Subset test rows
test_imputed <- full_imputed[-train_idx, ]
# Predict test data using imputed training data
predictions_mat[, i] <- predict(models[[i]], newdata = test_imputed)
}
# Average the predictions (following Rubin's rules for integrating MICE datasets)
# and then calculate performance metrics
mean_predictions <- rowMeans(predictions_mat)
caret::postResample(pred = mean_predictions, obs = test_imputed$Yield)
## RMSE Rsquared MAE
## 1.2130215 0.5866628 0.9846066
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Answer:
Because the MICE imputation process generates multiple imputed datasets, each of which has an associated PLS model, I compared variable importance for each model.
The top 10 important variables in all models included an
approximately equal number of biological material and manufacturing
process predictors. The most important predictor in all 5 models was
Process32. The second most important predictor across
models was Process36 (3 models) and Material06
(2 models). The third most important predictor was
Process17 (3 models) and Material02 (2
models).
varImp(models[[1]])
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## Process32 100.00
## Process36 68.98
## Process17 64.78
## Process13 62.99
## Process09 61.77
## Process33 52.47
## Material06 50.02
## Material02 46.83
## Material03 46.64
## Material08 44.49
## Material11 43.30
## Material12 42.95
## Process12 42.14
## Material04 41.81
## Material01 40.74
## Process28 40.42
## Process04 38.83
## Process06 38.39
## Process34 37.91
## Process11 31.84
varImp(models[[2]])
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## Process32 100.00
## Material06 81.20
## Material02 80.87
## Process36 80.84
## Material03 72.10
## Material04 66.66
## Process09 64.90
## Process13 64.60
## Process33 63.39
## Material01 58.96
## Process17 58.85
## Material08 58.64
## Material12 54.07
## Material11 50.58
## Process06 50.29
## Process04 45.30
## Process12 45.15
## Process28 42.72
## Process11 38.60
## Process15 38.27
varImp(models[[3]])
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## Process32 100.00
## Process36 66.95
## Process17 65.49
## Process13 64.24
## Process09 60.92
## Material06 51.64
## Process33 50.45
## Material03 48.26
## Material02 47.60
## Material08 45.49
## Material11 44.61
## Material12 43.49
## Process12 42.25
## Material04 42.11
## Material01 41.00
## Process06 39.99
## Process28 39.16
## Process11 37.55
## Process04 36.77
## Process34 35.19
varImp(models[[4]])
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## Process32 100.00
## Material06 81.20
## Material02 80.87
## Process36 80.38
## Material03 72.10
## Material04 66.66
## Process09 64.90
## Process13 64.60
## Process33 63.39
## Material01 58.96
## Process17 58.85
## Material08 58.64
## Material12 54.07
## Material11 50.58
## Process06 50.41
## Process12 45.15
## Process28 42.72
## Process04 42.44
## Process15 38.27
## Process02 37.99
varImp(models[[5]])
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## Process32 100.00
## Process36 64.81
## Process17 63.58
## Process13 61.51
## Process09 60.15
## Material06 48.74
## Process33 46.79
## Material03 45.76
## Material02 44.92
## Material08 43.50
## Material11 42.57
## Material12 41.88
## Process12 40.71
## Material04 40.22
## Material01 39.12
## Process06 38.70
## Process28 38.48
## Process04 37.54
## Process34 35.19
## Process11 30.90
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Answer: Of the five top important variables
identified in part (e), manufacturing Process32 has the
strongest linear relationship with the target variable
Yield (Pearson \(r=0.66\)). The best univariate linear model
has an adjusted \(R^2 = 0.43\), meaning
that the variable explains approximately 43% of the variance in
Yield.
Among biological material predictors, Material06 and
Material02 have moderately strong linear relationship with
Yield (Pearson \(r=0.54\) and \(r=0.53\), respectively. The best univariate
linear model for both of these predictors have adjusted \(R^2 = 0.28\), meaning that each of these
variables can explain ~28% of the variance in Yield.
These results suggest that optimization (maximization) of
Process32 is most important to increase product yield and,
in turn, revenue. Although biological predictors cannot be changed (per
exercise instructions), Material06 and
Material02 may be useful for quality control of the raw
material.
Process32 and
YieldProcess32 and Yield have a moderately
strong Pearson correlation coefficient (\(r=0.66\)), and the scatterplot is
consistent with a linear relationship. In the univariate linear model,
the adjusted \(R^2 = 0.43\), which
indicates that Process32 accounts for ~43% of the variance
in Yield.
pearson_r <- round(cor(train_df$Process32, train_df$Yield,
method = c("pearson"), use = 'complete.obs'), 2)
ggplot(drop_na(train_df, Process32), aes(x = Process32, y = Yield)) +
geom_point() +
labs(
x = 'Process32',
y = 'Yield',
title = 'Relationship between Manufacturing Process32 and Yield'
) +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
geom_text(x = 170, y = 44,
label = paste0('r = ', pearson_r),
color = 'blue') +
theme_classic() +
theme(
plot.title = element_text(face = 'bold'),
axis.title = element_text(face = 'bold')
)
process32_yield_lm <- lm(Yield ~ Process32, data = train_df)
summary(process32_yield_lm)
##
## Call:
## lm(formula = Yield ~ Process32, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0278 -0.9197 0.0285 0.7453 4.3623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.49328 3.48874 1.861 0.0651 .
## Process32 0.21248 0.02198 9.666 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.383 on 122 degrees of freedom
## Multiple R-squared: 0.4337, Adjusted R-squared: 0.4291
## F-statistic: 93.43 on 1 and 122 DF, p-value: < 2.2e-16
Process36 and
YieldAlthough Process36 and Yield have a
moderately strong Pearson correlation coefficient (\(r=-0.54\)), the scatterplot shows that
Process36 values are discrete rather than continuous, so it
is not actually linear, which could make a linear model unsuitable.
pearson_r <- round(cor(train_df$Process36, train_df$Yield,
method = c("pearson"), use = 'complete.obs'), 2)
ggplot(drop_na(train_df, Process36), aes(x = Process36, y = Yield)) +
geom_point() +
labs(
x = 'Process36',
y = 'Yield',
title = 'Relationship between Manufacturing Process36 and Yield'
) +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
geom_text(x = 0.0175, y = 44,
label = paste0('r = ', pearson_r),
color = 'blue') +
theme_classic() +
theme(
plot.title = element_text(face = 'bold'),
axis.title = element_text(face = 'bold')
)
process36_yield_lm <- lm(Yield ~ Process36, data = drop_na(train_df, Process36))
summary(process36_yield_lm)
##
## Call:
## lm(formula = Yield ~ Process36, data = drop_na(train_df, Process36))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5609 -1.0209 0.0412 1.0513 4.4405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.547 3.095 19.889 < 2e-16 ***
## Process36 -1091.389 158.054 -6.905 2.7e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.561 on 118 degrees of freedom
## Multiple R-squared: 0.2878, Adjusted R-squared: 0.2818
## F-statistic: 47.68 on 1 and 118 DF, p-value: 2.697e-10
Process17 and
YieldAlthough Process17 and Yield have a
moderate Pearson correlation coefficient (\(r=-0.39\)), the scatterplot below suggests
that the best fit linear model is influenced by several
outliers/leverage points. Among other data points, the linear
relationship is not as clear.
pearson_r <- round(cor(train_df$Process17, train_df$Yield,
method = c("pearson"), use = 'complete.obs'), 2)
ggplot(drop_na(train_df, Process17), aes(x = Process17, y = Yield)) +
geom_point() +
labs(
x = 'Process17',
y = 'Yield',
title = 'Relationship between Manufacturing Process17 and Yield'
) +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
geom_text(x = 39, y = 38.5,
label = paste0('r = ', pearson_r),
color = 'blue') +
theme_classic() +
theme(
plot.title = element_text(face = 'bold'),
axis.title = element_text(face = 'bold')
)
process17_yield_lm <- lm(Yield ~ Process17, data = drop_na(train_df, Process17))
summary(process17_yield_lm)
##
## Call:
## lm(formula = Yield ~ Process17, data = drop_na(train_df, Process17))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2401 -1.2786 -0.0765 1.1542 4.3784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.7629 4.5908 13.454 < 2e-16 ***
## Process17 -0.6286 0.1337 -4.701 6.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.69 on 122 degrees of freedom
## Multiple R-squared: 0.1534, Adjusted R-squared: 0.1464
## F-statistic: 22.1 on 1 and 122 DF, p-value: 6.874e-06
Material06 and
YieldMaterial06 and Yield have a moderately
strong Pearson correlation coefficient (\(r=0.54\)), and the scatterplot is
consistent with a linear relationship. In the univariate linear model,
the adjusted \(R^2 = 0.28\), which
indicates that Material06 accounts for ~28% of the variance
in Yield.
pearson_r <- round(cor(train_df$Material06, train_df$Yield,
method = c("pearson"), use = 'complete.obs'), 2)
ggplot(drop_na(train_df, Material06), aes(x = Material06, y = Yield)) +
geom_point() +
labs(
x = 'Material06',
y = 'Yield',
title = 'Relationship between Biological Material06 and Yield'
) +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
geom_text(x = 55, y = 44,
label = paste0('r = ', pearson_r),
color = 'blue') +
theme_classic() +
theme(
plot.title = element_text(face = 'bold'),
axis.title = element_text(face = 'bold')
)
material06_yield_lm <- lm(Yield ~ Material06, data = drop_na(train_df, Material06))
summary(material06_yield_lm)
##
## Call:
## lm(formula = Yield ~ Material06, data = drop_na(train_df, Material06))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4822 -1.0046 -0.0941 0.9965 4.2775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.48936 1.81427 15.152 < 2e-16 ***
## Material06 0.25922 0.03691 7.023 1.33e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.55 on 122 degrees of freedom
## Multiple R-squared: 0.2879, Adjusted R-squared: 0.2821
## F-statistic: 49.33 on 1 and 122 DF, p-value: 1.333e-10
Material02 and
YieldMaterial02 and Yield have a moderately
strong Pearson correlation coefficient (\(r=0.53\)), and the scatterplot is
consistent with a linear relationship. In the univariate linear model,
the adjusted \(R^2 = 0.28\), which
indicates that Material02 accounts for ~28% of the variance
in Yield.
pearson_r <- round(cor(train_df$Material02, train_df$Yield,
method = c("pearson"), use = 'complete.obs'), 2)
ggplot(drop_na(train_df, Material02), aes(x = Material02, y = Yield)) +
geom_point() +
labs(
x = 'Material02',
y = 'Yield',
title = 'Relationship between Biological Material02 and Yield'
) +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
geom_text(x = 55, y = 43,
label = paste0('r = ', pearson_r),
color = 'blue') +
theme_classic() +
theme(
plot.title = element_text(face = 'bold'),
axis.title = element_text(face = 'bold')
)
material02_yield_lm <- lm(Yield ~ Material02, data = drop_na(train_df, Material02))
summary(material02_yield_lm)
##
## Call:
## lm(formula = Yield ~ Material02, data = drop_na(train_df, Material02))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5713 -1.0240 -0.1391 1.0325 4.4461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.00597 1.89330 14.264 < 2e-16 ***
## Material02 0.23618 0.03382 6.985 1.62e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.553 on 122 degrees of freedom
## Multiple R-squared: 0.2856, Adjusted R-squared: 0.2798
## F-statistic: 48.78 on 1 and 122 DF, p-value: 1.623e-10
pander::pander(sessionInfo())
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: pander(v.0.6.6), mice(v.3.19.0), naniar(v.1.1.0), elasticnet(v.1.3), lars(v.1.3), pls(v.2.9-0), caret(v.7.0-1), lattice(v.0.22-7), lubridate(v.1.9.5), forcats(v.1.0.1), stringr(v.1.6.0), dplyr(v.1.2.0), purrr(v.1.2.1), readr(v.2.1.6), tidyr(v.1.3.2), tibble(v.3.3.1), ggplot2(v.4.0.2), tidyverse(v.2.0.0) and AppliedPredictiveModeling(v.1.1-7)
loaded via a namespace (and not attached): Rdpack(v.2.6.5), pROC(v.1.19.0.1), rlang(v.1.1.7), magrittr(v.2.0.4), rpart.plot(v.3.1.4), otel(v.0.2.0), compiler(v.4.5.2), mgcv(v.1.9-4), vctrs(v.0.7.2), reshape2(v.1.4.5), pkgconfig(v.2.0.3), shape(v.1.4.6.1), fastmap(v.1.2.0), backports(v.1.5.0), labeling(v.0.4.3), utf8(v.1.2.6), rmarkdown(v.2.30), prodlim(v.2025.04.28), tzdb(v.0.5.0), nloptr(v.2.2.1), visdat(v.0.6.0), xfun(v.0.56), glmnet(v.4.1-10), jomo(v.2.7-6), cachem(v.1.1.0), jsonlite(v.2.0.0), recipes(v.1.3.1), pan(v.1.9), broom(v.1.0.12), parallel(v.4.5.2), cluster(v.2.1.8.1), R6(v.2.6.1), CORElearn(v.1.57.3.1), bslib(v.0.10.0), stringi(v.1.8.7), RColorBrewer(v.1.1-3), parallelly(v.1.46.1), boot(v.1.3-32), rpart(v.4.1.24), jquerylib(v.0.1.4), Rcpp(v.1.1.1), iterators(v.1.0.14), knitr(v.1.51), future.apply(v.1.20.1), Matrix(v.1.7-4), splines(v.4.5.2), nnet(v.7.3-20), timechange(v.0.4.0), tidyselect(v.1.2.1), rstudioapi(v.0.18.0), yaml(v.2.3.12), timeDate(v.4052.112), codetools(v.0.2-20), listenv(v.0.10.0), plyr(v.1.8.9), withr(v.3.0.2), S7(v.0.2.1), evaluate(v.1.0.5), future(v.1.69.0), survival(v.3.8-6), pillar(v.1.11.1), foreach(v.1.5.2), stats4(v.4.5.2), reformulas(v.0.4.3.1), ellipse(v.0.5.0), generics(v.0.1.4), hms(v.1.1.4), scales(v.1.4.0), minqa(v.1.2.8), globals(v.0.18.0), class(v.7.3-23), glue(v.1.8.0), tools(v.4.5.2), data.table(v.1.18.2.1), lme4(v.1.1-38), ModelMetrics(v.1.2.2.2), gower(v.1.0.2), grid(v.4.5.2), plotrix(v.3.8-14), rbibutils(v.2.4.1), ipred(v.0.9-15), nlme(v.3.1-168), cli(v.3.6.5), lava(v.1.8.2), gtable(v.0.3.6), sass(v.0.4.10), digest(v.0.6.39), farver(v.2.1.2), htmltools(v.0.5.9), lifecycle(v.1.0.5), hardhat(v.1.4.2), mitml(v.0.4-5) and MASS(v.7.3-65)
Tuzen MF (2025). Handling missing data in R: A comprehensive guide. https://www.r-bloggers.com/2025/08/handling-missing-data-in-r-a-comprehensive-guide/↩︎
van Buuren S. Flexible imputation of missing data, 2nd ed. https://stefvanbuuren.name/fimd/sec-pmm.html↩︎