This assignment completes Problems 6.2 and 6.3 from Kuhn and
Johnson’s Applied Predictive Modeling. These problems focus on
regression methods from Chapter 6 and compare how different models
perform on two pharmaceutical data sets. I used the data sets included
with the AppliedPredictiveModeling package and evaluated
models using cross-validation and a held-out test set.
Developing a model to predict permeability could help a pharmaceutical company reduce time and cost by screening compounds before sending them through laboratory testing.
Start R and use these commands to load the data.
data(permeability)
data(fingerprints)
dim(permeability)
## [1] 165 1
dim(fingerprints)
## [1] 165 1107
head(permeability)
## permeability
## 1 12.520
## 2 1.120
## 3 19.405
## 4 1.730
## 5 1.680
## 6 0.510
The fingerprints data contains 1,107 binary predictors
for 165 compounds, while permeability contains the response
values.
Filter out the predictors that have low frequencies using
nearZeroVar. How many predictors are left for modeling?
nzv_cols <- nearZeroVar(fingerprints)
fingerprints_filtered <- fingerprints[, -nzv_cols]
dim(fingerprints_filtered)
## [1] 165 388
ncol(fingerprints_filtered)
## [1] 388
After removing near-zero variance predictors, there are 388 predictors left for modeling.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
set.seed(5889)
perm_index <- createDataPartition(permeability, p = 0.80, list = FALSE)
perm_train_x <- fingerprints_filtered[perm_index, ]
perm_test_x <- fingerprints_filtered[-perm_index, ]
perm_train_y <- permeability[perm_index]
perm_test_y <- permeability[-perm_index]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
set.seed(5889)
pls_fit <- train(
x = perm_train_x,
y = perm_train_y,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = ctrl
)
pls_fit$bestTune
## ncomp
## 3 3
max(pls_fit$results$Rsquared)
## [1] 0.525382
pls_fit$results %>% arrange(desc(Rsquared)) %>% head()
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 3 11.26236 0.5253820 8.216301 3.103523 0.2178676 1.985033
## 2 9 11.65280 0.5190977 8.767226 2.882035 0.2197635 1.977037
## 3 4 11.43912 0.5186295 8.410240 3.087314 0.2236229 1.878985
## 4 10 11.74210 0.5160713 8.813084 2.927320 0.2183269 1.944957
## 5 8 11.67862 0.5110293 8.834702 2.767934 0.2109630 1.867351
## 6 2 11.44487 0.5093270 8.047187 3.201866 0.2466197 1.950012
ggplot(pls_fit$results, aes(x = ncomp, y = Rsquared)) +
geom_line() +
geom_point() +
labs(
title = "PLS Cross-Validated R-squared by Number of Components",
x = "Number of Components",
y = expression(R^2)
) +
theme_minimal()
The optimal number of latent variables is 3, and the corresponding resampled \(R^2\) is about 0.51.
Predict the response for the test set. What is the test set estimate of \(R^2\)?
pls_pred <- predict(pls_fit, perm_test_x)
pls_perf <- postResample(pls_pred, perm_test_y)
pls_perf
## RMSE Rsquared MAE
## 13.9266124 0.2791087 10.7374415
The test-set \(R^2\) is about 0.28. This is lower than the resampled training \(R^2\), which suggests the model does not generalize as strongly on unseen data as the cross-validation estimate might suggest.
Try building other models discussed in this chapter. Do any have better predictive performance?
set.seed(5889)
ridge_fit <- train(
x = perm_train_x,
y = perm_train_y,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 20)),
preProcess = c("center", "scale"),
trControl = ctrl
)
set.seed(5889)
lasso_fit <- train(
x = perm_train_x,
y = perm_train_y,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 20)),
preProcess = c("center", "scale"),
trControl = ctrl
)
set.seed(5889)
enet_fit <- train(
x = perm_train_x,
y = perm_train_y,
method = "glmnet",
tuneLength = 10,
preProcess = c("center", "scale"),
trControl = ctrl
)
ridge_pred <- predict(ridge_fit, perm_test_x)
lasso_pred <- predict(lasso_fit, perm_test_x)
enet_pred <- predict(enet_fit, perm_test_x)
ridge_perf <- postResample(ridge_pred, perm_test_y)
lasso_perf <- postResample(lasso_pred, perm_test_y)
enet_perf <- postResample(enet_pred, perm_test_y)
model_results_62 <- tibble(
Model = c("PLS", "Ridge", "Lasso", "Elastic Net"),
Rsquared = c(pls_perf["Rsquared"], ridge_perf["Rsquared"], lasso_perf["Rsquared"], enet_perf["Rsquared"]),
RMSE = c(pls_perf["RMSE"], ridge_perf["RMSE"], lasso_perf["RMSE"], enet_perf["RMSE"]),
MAE = c(pls_perf["MAE"], ridge_perf["MAE"], lasso_perf["MAE"], enet_perf["MAE"])
)
model_results_62
## # A tibble: 4 × 4
## Model Rsquared RMSE MAE
## <chr> <dbl> <dbl> <dbl>
## 1 PLS 0.279 13.9 10.7
## 2 Ridge 0.369 12.6 9.30
## 3 Lasso 0.445 11.9 8.31
## 4 Elastic Net 0.469 11.5 8.01
Based on the test-set results, Elastic Net performs the best, followed by Lasso, then Ridge, and finally PLS. This makes sense because the permeability data has many sparse binary predictors, so methods that shrink coefficients and perform feature selection are especially helpful.
Would you recommend any of your models to replace the permeability laboratory experiment?
I would not fully replace the lab experiment yet. The Elastic Net model performs better than the other models, but its predictive strength is still moderate. I think it would be more useful as a screening tool to help prioritize which compounds should move forward for laboratory testing. That could still save time and money without depending completely on the model.
This problem uses data from a chemical manufacturing process. The goal is to understand how biological measurements and manufacturing process variables relate to product yield.
Start R and use these commands to load the data.
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176 58
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
The data has 176 observations and 58 variables total, including the
response Yield.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.
cmp_y <- ChemicalManufacturingProcess$Yield
cmp_x <- ChemicalManufacturingProcess %>% select(-Yield)
missing_counts <- sapply(cmp_x, function(x) sum(is.na(x)))
missing_counts[missing_counts > 0]
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 3 15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 1 2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10
## 1 1 9
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14
## 10 1 1
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 5 5 5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 5 5 5
## ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34
## 5 5 5
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40
## 5 5 1
## ManufacturingProcess41
## 1
set.seed(5889)
cmp_imputer <- preProcess(cmp_x, method = "knnImpute")
cmp_x_imputed <- predict(cmp_imputer, cmp_x)
sum(is.na(cmp_x_imputed))
## [1] 0
I used KNN imputation to fill in the missing predictor values. After imputation, there are no missing values left in the predictor set.
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
I chose a PLS model for this part.
set.seed(5889)
cmp_index <- createDataPartition(cmp_y, p = 0.80, list = FALSE)
cmp_train_x <- cmp_x_imputed[cmp_index, ]
cmp_test_x <- cmp_x_imputed[-cmp_index, ]
cmp_train_y <- cmp_y[cmp_index]
cmp_test_y <- cmp_y[-cmp_index]
set.seed(5889)
cmp_pls_fit <- train(
x = cmp_train_x,
y = cmp_train_y,
method = "pls",
preProcess = c("center", "scale", "BoxCox"),
tuneLength = 20,
trControl = ctrl
)
cmp_pls_fit$bestTune
## ncomp
## 3 3
cmp_pls_fit$results %>% filter(ncomp == cmp_pls_fit$bestTune$ncomp)
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 3 1.42273 0.5327401 1.087 0.5126658 0.1796494 0.2441878
The best PLS model uses 3 components. The resampled performance for that setting gives an \(R^2\) of about 0.53.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
cmp_pred <- predict(cmp_pls_fit, cmp_test_x)
cmp_perf <- postResample(cmp_pred, cmp_test_y)
cmp_perf
## RMSE Rsquared MAE
## 0.9518583 0.7325815 0.7905189
The test-set \(R^2\) is about 0.73, which is higher than the resampled training \(R^2\) from cross-validation. This means the model performed better on the test split than expected. That can happen sometimes with a favorable split, although in general I would still rely on repeated validation before drawing a strong conclusion.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
cmp_imp <- varImp(cmp_pls_fit, scale = TRUE)
cmp_imp
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 90.09
## ManufacturingProcess09 88.26
## ManufacturingProcess17 86.82
## ManufacturingProcess36 82.08
## ManufacturingProcess33 59.65
## ManufacturingProcess06 59.21
## ManufacturingProcess11 58.39
## BiologicalMaterial02 55.77
## BiologicalMaterial08 54.37
## BiologicalMaterial06 53.37
## ManufacturingProcess12 51.76
## BiologicalMaterial04 50.79
## BiologicalMaterial03 49.79
## BiologicalMaterial12 49.79
## BiologicalMaterial11 48.74
## BiologicalMaterial01 47.91
## ManufacturingProcess28 42.90
## ManufacturingProcess04 40.03
## ManufacturingProcess37 38.31
plot(cmp_imp, top = 20, main = "Top 20 Predictors for Yield")
The most important predictors are mostly manufacturing process variables, which suggests that the process predictors dominate the model rather than the biological raw material variables.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
imp_df <- cmp_imp$importance %>%
tibble::rownames_to_column("Predictor") %>%
arrange(desc(Overall))
head(imp_df, 10)
## Predictor Overall
## 1 ManufacturingProcess32 100.00000
## 2 ManufacturingProcess13 90.08629
## 3 ManufacturingProcess09 88.26199
## 4 ManufacturingProcess17 86.81590
## 5 ManufacturingProcess36 82.08255
## 6 ManufacturingProcess33 59.64919
## 7 ManufacturingProcess06 59.21093
## 8 ManufacturingProcess11 58.38776
## 9 BiologicalMaterial02 55.76742
## 10 BiologicalMaterial08 54.37120
top_vars <- imp_df$Predictor[1:3]
top_vars
## [1] "ManufacturingProcess32" "ManufacturingProcess13" "ManufacturingProcess09"
cmp_plot_data <- cmp_x_imputed %>%
mutate(Yield = cmp_y)
for (v in top_vars) {
print(
ggplot(cmp_plot_data, aes(x = .data[[v]], y = Yield)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, color = "steelblue") +
labs(
title = paste("Yield vs", v),
x = v,
y = "Yield"
) +
theme_minimal()
)
}
The top predictors show how specific process conditions are related to yield. Some of them appear to have positive relationships with yield, while others show flatter or negative patterns. This is useful because the process variables can be adjusted, unlike the biological variables. If a certain process measurement consistently aligns with higher yield, the company can use that information to fine-tune future runs and improve production efficiency.
Overall, these two exercises show that different regression methods can behave very differently depending on the structure of the predictors. For the permeability data, regularized models such as Elastic Net performed better than PLS because the predictors were sparse and high dimensional. For the chemical manufacturing data, the PLS model performed reasonably well and highlighted that manufacturing process variables were more influential than the biological variables. In both cases, the models are useful for insight and screening, but I would still be careful about using them as complete replacements for real world experiments or production testing.