This is homework seven of the Fall 2024 edition of DATA 624. The assignment covers questions 6.2 and 6.3 from the exercise section of chapter 6 in Applied Predictive Modeling by Max Kuhn and Kjell Johnson
First, most of the requried libraries.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(lars)
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:caret':
##
## R2
##
## The following object is masked from 'package:stats':
##
## loadings
library(stats)
library(corrplot)
## corrplot 0.95 loaded
##
## Attaching package: 'corrplot'
##
## The following object is masked from 'package:pls':
##
## corrplot
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(robustbase)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
This exercise consists of component questions A through F that are about developing permeability prediction models for a pharmaceutical company.
The first question just says to load the library for the book and the dataset used.
library(AppliedPredictiveModeling)
data(permeability)
The question states: ” The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?”
Answer: there are 388 predictors left after filtering out those with low frequencies.
nzv_fingers <- preProcess(fingerprints, method = c("nzv")) |>
predict(fingerprints)
prem_predict <- ncol(nzv_fingers)
prem_predict
## [1] 388
The question states: “Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?”
Answer:
There are 10 optimal latent variables and the corresponding resampled estimate of R2 is 0.36914.
nzv_perm_s <- cbind(nzv_fingers, permeability)
nzv_perm <- preProcess(nzv_perm_s, method = c("BoxCox", "center", "scale", "knnImpute")) |>
predict(nzv_perm_s)
part <- createDataPartition(nzv_perm[permeability], p = 0.75, list = FALSE)
nzv_perm_train <- nzv_perm[part,]
nzv_perm_test <- nzv_perm[-part,]
pls_control <- trainControl(method = "LOOCV")
pls_train <- train(permeability ~ .,
data = nzv_perm_train,
method = "pls",
tuneLength = 20,
trControl = pls_control)
pls_train
## Partial Least Squares
##
## 102 samples
## 388 predictors
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8781172 0.1819578 0.6729183
## 2 0.8292031 0.2712637 0.6167052
## 3 0.8111174 0.3106445 0.6081974
## 4 0.8358035 0.2981609 0.6292403
## 5 0.8134162 0.3507931 0.5976846
## 6 0.8137202 0.3396873 0.5801619
## 7 0.7865501 0.3760534 0.5630216
## 8 0.7704538 0.4047623 0.5462394
## 9 0.7563150 0.4337708 0.5623967
## 10 0.7428234 0.4578516 0.5524178
## 11 0.7429511 0.4672472 0.5597181
## 12 0.7598164 0.4636119 0.5857559
## 13 0.7712966 0.4587113 0.5939703
## 14 0.7833992 0.4554922 0.5997252
## 15 0.7976450 0.4455459 0.6057548
## 16 0.8235224 0.4323917 0.6299468
## 17 0.8265127 0.4379625 0.6334531
## 18 0.8366675 0.4321632 0.6458152
## 19 0.8308357 0.4389615 0.6446405
## 20 0.8291036 0.4427953 0.6515345
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
opt_lat_vars <- pls_train$bestTune$ncomp
rs_r2 <- max(pls_train$results$Rsquared)
cat("There are ", opt_lat_vars, " optimal latent variables and the corresponding resampled estimate of R2 is ", rs_r2)
## There are 10 optimal latent variables and the corresponding resampled estimate of R2 is 0.4672472
Answer:
The test set estimate of R2 is 0.5593098
predict(pls_train, nzv_perm_test) %>%
data.frame(pred = .,obs = nzv_perm_test[,"permeability"]) %>%
defaultSummary()
## RMSE Rsquared MAE
## 0.9063238 0.4024881 0.7284953
Answer:
First model: robust linear regression
Results: Samples: 102 Predictors: 388
The robust linear model yielded an RMSE of 0.9446 and an R² of 0.3589, which is lower than the previous model’s R² of 0.5593, indicating that the model did not perform as well in terms of explained variance. The MAE of 0.7090 is slightly higher, suggesting less accuracy in predicting permeability compared to the earlier model. Overall, the previous model with a better R² and lower RMSE was more effective in capturing the relationship between predictors and permeability.
nzv_fingers_lr <- preProcess(fingerprints, method = c("nzv")) |>
predict(fingerprints)
nzv_perm_rl <- cbind(nzv_fingers_lr, permeability)
nzv_perm_lr <- preProcess(nzv_perm_rl, method = c("BoxCox", "center", "scale", "knnImpute")) |>
predict(nzv_perm_rl)
part_lr <- createDataPartition(nzv_perm_rl[permeability], p = 0.75, list = FALSE)
perm_train_rl <- nzv_perm_lr[part_lr,]
perm_test_rl <- nzv_perm_lr[-part_lr,]
control_lr <- trainControl(method = "LOOCV")
robust_lines <- train(permeability ~ .,
data = perm_train_rl,
method = "rlm",
trControl = control_lr,
preProcess = "pca"
)
robust_lines
## Robust Linear Model
##
## 102 samples
## 388 predictors
##
## Pre-processing: principal component signal extraction (388), centered
## (388), scaled (388)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 0.8976401 0.2227545 0.6643225
## FALSE psi.hampel 0.8872977 0.2296714 0.6624136
## FALSE psi.bisquare 0.8772084 0.2463988 0.6370917
## TRUE psi.huber 0.8494446 0.2596366 0.5852497
## TRUE psi.hampel 0.8498867 0.2595142 0.5938601
## TRUE psi.bisquare 0.8557224 0.2713834 0.5979869
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
predict(robust_lines, perm_test_rl) %>%
data.frame(pred = .,obs = perm_test_rl[,"permeability"]) %>%
defaultSummary()
## RMSE Rsquared MAE
## 1.0322024 0.2818927 0.8391725
Second model: Elastic net regression Results: Samples: 102 Predictors: 388
The Elastic Net model, using bootstrapped resampling with 10 repetitions, resulted in an RMSE of 0.6883, an R² of 0.5943, and an MAE of 0.5342. This model explored a grid of lambda and fraction parameters and selected the optimal values of lambda = 0.1 and fraction = 0.15 based on the smallest RMSE. The resampling process indicates that the Elastic Net model performs moderately well, with an R² value showing that it captures about 59.4% of the variance in the data. The relatively low MAE further indicates that the model is accurate in predicting permeability values with minimal error.
ctrl_enet <- trainControl(method = "boot", number = 10)
enet_g <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
enet_reg <- train(permeability ~ .,
data = nzv_perm_train,
method = "enet",
tuneGrid = enet_g,
trControl = ctrl_enet)
## Warning: model fit failed for Resample01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample02: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample03: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample07: lambda=0.10, fraction=1 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning: model fit failed for Resample07: lambda=0.01, fraction=1 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning: model fit failed for Resample07: lambda=0.00, fraction=1 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) :
## Some of the columns of x have zero variance
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enet_reg
## Elasticnet
##
## 102 samples
## 388 predictors
##
## No pre-processing
## Resampling: Bootstrapped (10 reps)
## Summary of sample sizes: 102, 102, 102, 102, 102, 102, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 4.394198e+18 0.13963134 2.524463e+18
## 0.00 0.10 8.793506e+18 0.11951013 5.061322e+18
## 0.00 0.15 1.319305e+19 0.11328248 7.598182e+18
## 0.00 0.20 1.759264e+19 0.11084286 1.013504e+19
## 0.00 0.25 2.199226e+19 0.11246765 1.267190e+19
## 0.00 0.30 2.639190e+19 0.11519907 1.520876e+19
## 0.00 0.35 3.079153e+19 0.11690819 1.774563e+19
## 0.00 0.40 3.519118e+19 0.11846028 2.028249e+19
## 0.00 0.45 3.959082e+19 0.11950352 2.281935e+19
## 0.00 0.50 4.399047e+19 0.11780550 2.535622e+19
## 0.00 0.55 4.839012e+19 0.11561806 2.789308e+19
## 0.00 0.60 5.278977e+19 0.10844410 3.042994e+19
## 0.00 0.65 5.718942e+19 0.09850609 3.296680e+19
## 0.00 0.70 6.158907e+19 0.08941738 3.550367e+19
## 0.00 0.75 6.598872e+19 0.08106742 3.804053e+19
## 0.00 0.80 7.038838e+19 0.07299187 4.057739e+19
## 0.00 0.85 7.478803e+19 0.06617802 4.311426e+19
## 0.00 0.90 7.918768e+19 0.06120767 4.565112e+19
## 0.00 0.95 8.358734e+19 0.05570198 4.818798e+19
## 0.00 1.00 8.798699e+19 0.05585483 5.072484e+19
## 0.01 0.05 8.040407e-01 0.38399608 6.163561e-01
## 0.01 0.10 9.057823e-01 0.32839687 6.578943e-01
## 0.01 0.15 1.012421e+00 0.32255602 7.070616e-01
## 0.01 0.20 1.100088e+00 0.31823323 7.460490e-01
## 0.01 0.25 1.179027e+00 0.31748344 7.862242e-01
## 0.01 0.30 1.252522e+00 0.32413778 8.313183e-01
## 0.01 0.35 1.336961e+00 0.31805691 8.900324e-01
## 0.01 0.40 1.403742e+00 0.33177944 9.327543e-01
## 0.01 0.45 1.484727e+00 0.32659500 9.804213e-01
## 0.01 0.50 1.574956e+00 0.31245560 1.039180e+00
## 0.01 0.55 1.666594e+00 0.29357814 1.101546e+00
## 0.01 0.60 1.761351e+00 0.26982362 1.167418e+00
## 0.01 0.65 1.850694e+00 0.24952984 1.229821e+00
## 0.01 0.70 1.937829e+00 0.23145432 1.289463e+00
## 0.01 0.75 2.020336e+00 0.21890626 1.346155e+00
## 0.01 0.80 2.097705e+00 0.21032750 1.401407e+00
## 0.01 0.85 2.171083e+00 0.20339907 1.455422e+00
## 0.01 0.90 2.244306e+00 0.19712141 1.510278e+00
## 0.01 0.95 2.318920e+00 0.19051065 1.564348e+00
## 0.01 1.00 2.388428e+00 0.18357246 1.613712e+00
## 0.10 0.05 7.846397e-01 0.41544994 6.204690e-01
## 0.10 0.10 7.477181e-01 0.39263831 5.717002e-01
## 0.10 0.15 7.632911e-01 0.36531357 5.731313e-01
## 0.10 0.20 7.872249e-01 0.36000783 5.846228e-01
## 0.10 0.25 8.069199e-01 0.35900920 5.960359e-01
## 0.10 0.30 8.221232e-01 0.36275591 6.051782e-01
## 0.10 0.35 8.292357e-01 0.36581136 6.074162e-01
## 0.10 0.40 8.352146e-01 0.36508725 6.113573e-01
## 0.10 0.45 8.371041e-01 0.36716001 6.116030e-01
## 0.10 0.50 8.407872e-01 0.36738913 6.130232e-01
## 0.10 0.55 8.433480e-01 0.36693080 6.157443e-01
## 0.10 0.60 8.467317e-01 0.36564692 6.200721e-01
## 0.10 0.65 8.490593e-01 0.36503848 6.240524e-01
## 0.10 0.70 8.528130e-01 0.36298615 6.291960e-01
## 0.10 0.75 8.582206e-01 0.35890671 6.356003e-01
## 0.10 0.80 8.643251e-01 0.35355880 6.415157e-01
## 0.10 0.85 8.708280e-01 0.34759902 6.473002e-01
## 0.10 0.90 8.769254e-01 0.34186247 6.523038e-01
## 0.10 0.95 8.829158e-01 0.33655499 6.569560e-01
## 0.10 1.00 8.880407e-01 0.33182394 6.611402e-01
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.1.
predict(enet_reg, nzv_perm_test) %>%
data.frame(pred = .,obs = nzv_perm_test[,"permeability"]) %>%
defaultSummary()
## RMSE Rsquared MAE
## 0.7336849 0.5555956 0.6215807
Overall conclusion: When comparing the models overall, the Elastic Net model performed the best, with the highest R² (0.5943) and the lowest RMSE (0.6883). The robust linear model, with an R² of 0.3589 and RMSE of 0.9446, was less effective in explaining the variance and predicting permeability accurately compared to the Elastic Net.
Answer: I wouldn’t recommend using any of the models trained here to replace lab experiments for permeability due to a lot of variability. Pharmaceutical companies need much tighter standards than the results here provide.
This exercise consists of component questions A through F
data(ChemicalManufacturingProcess)
The question states: “A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.”
Answer:
I used median impute to insert the median value where a missing value is.
resolve_miss <- preProcess(ChemicalManufacturingProcess, method = c("medianImpute")) |>
predict(ChemicalManufacturingProcess)
The question states: “Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?”
Answer:
I will use Ridge Regression. This is a model that which imposes a penalty on the coefficients based on the different lambdas that we will be training over.
First, preprocessing and setting up the train and test sets:
model_tuner <- preProcess(resolve_miss, method = c("center", "scale")) |>
predict(resolve_miss)
ridge_slices <- createDataPartition(model_tuner$Yield, p = 0.75, list = FALSE)
ridge_train <- model_tuner[ridge_slices,]
ridge_test <- model_tuner[-ridge_slices,]
cat("The training set has dimensions of: \n")
## The training set has dimensions of:
dim(ridge_train)
## [1] 132 58
Finding optimal performance metric: The optimal value of the performance metric RMSE is 1.831127, with a lambda of 0.1.
ridge_control <- trainControl(method = "boot", number = 10)
ridge_roller <- expand.grid(.alpha = 0, .lambda = seq(0, .1, length = 15))
ridge_reg <- train(Yield ~ .,
data = ridge_train,
method = "glmnet",
tuneGrid = ridge_roller,
trControl = ridge_control)
ridge_reg
## glmnet
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (10 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 1.196050 0.3507894 0.7094885
## 0.007142857 1.196050 0.3507894 0.7094885
## 0.014285714 1.196050 0.3507894 0.7094885
## 0.021428571 1.196050 0.3507894 0.7094885
## 0.028571429 1.196050 0.3507894 0.7094885
## 0.035714286 1.196050 0.3507894 0.7094885
## 0.042857143 1.196050 0.3507894 0.7094885
## 0.050000000 1.196246 0.3506916 0.7098193
## 0.057142857 1.196356 0.3503859 0.7096693
## 0.064285714 1.193111 0.3505462 0.7080116
## 0.071428571 1.188541 0.3490222 0.7047708
## 0.078571429 1.176216 0.3468999 0.7002409
## 0.085714286 1.161984 0.3455241 0.6954936
## 0.092857143 1.148870 0.3447713 0.6910561
## 0.100000000 1.136762 0.3443828 0.6869802
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.1.
The question states: Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Answer: The RMSE for the test set is 0.7467, which is significantly lower than the resampled RMSE of 1.8311 on the training set. This suggests that the model generalizes well to the test set, showing better performance than during the resampling on the training set.
predict(ridge_reg, ridge_test) %>%
data.frame(pred = .,obs = ridge_test$Yield) %>%
defaultSummary()
## RMSE Rsquared MAE
## 1.0316564 0.3028784 0.6387433
The question states: Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Answer: There are 8 process predictors and 7 biological predictors in the top 15. Neither process dominates the list.
plot(varImp(ridge_reg), top = 15)
ggtitle("Top 15 predictors")
## $title
## [1] "Top 15 predictors"
##
## attr(,"class")
## [1] "labels"
The question states: “Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?”
Answer: The relationships between the top predictors and yield show some clear trends. For example, ManufacturingProcess32 and BiologicalMaterial06 have strong positive correlations, meaning boosting them could improve yield in future runs. On the other hand, ManufacturingProcess36 and ManufacturingProcess13 negatively impact yield, so controlling those might help reduce their drag on the process.
ridge_15 <- varImp(ridge_reg)$importance
ridge_15 <- as.data.frame(ridge_15)
ridge_predict <- rownames(ridge_15)[order(ridge_15[, 1], decreasing = TRUE)[1:15]]
for (rider in ridge_predict) {
yields <- ggplot(ridge_train, aes(x = .data[[rider]], y = .data[["Yield"]])) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "skyblue") +
ggtitle(paste("Relationship between", rider, "and Yield")) +
theme_minimal()
print(yields)
}
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ridge_corrs <- sapply(ridge_predict, function(x) cor(ridge_train[[x]], ridge_train$Yield, use = "complete.obs"))
ridges <- data.frame(Predictor = ridge_predict, Correlation = ridge_corrs)
ridges
## Predictor Correlation
## ManufacturingProcess16 ManufacturingProcess16 -0.11954627
## ManufacturingProcess32 ManufacturingProcess32 0.61144797
## ManufacturingProcess09 ManufacturingProcess09 0.52399357
## ManufacturingProcess34 ManufacturingProcess34 0.27984300
## ManufacturingProcess04 ManufacturingProcess04 -0.20106926
## ManufacturingProcess45 ManufacturingProcess45 0.09796569
## ManufacturingProcess19 ManufacturingProcess19 0.08636538
## BiologicalMaterial03 BiologicalMaterial03 0.43941513
## ManufacturingProcess13 ManufacturingProcess13 -0.48442916
## ManufacturingProcess36 ManufacturingProcess36 -0.53070122
## ManufacturingProcess15 ManufacturingProcess15 0.19536228
## ManufacturingProcess28 ManufacturingProcess28 0.24254129
## ManufacturingProcess24 ManufacturingProcess24 -0.17173179
## ManufacturingProcess11 ManufacturingProcess11 0.37170717
## ManufacturingProcess37 ManufacturingProcess37 -0.10184077