library(AppliedPredictiveModeling)
data(permeability)DATA 624 Homework 7
1-) Developing a model to predict permeability (see Sect. 1.4) could save sig-nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a) Start R and use these commands to load the data:
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
library(caret)
fingerprints_filtered <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fingerprints_filtered)[1] 165 388
Comment: Out of 1,107 predictors the matrix fingerprints originally has, now there are only 388 predictors left for modeling.
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(0001)
# index for training
index <- createDataPartition(permeability, p = .8, list = FALSE)
# training data
training_perm <- permeability[index, ]
training_fp <- fingerprints[index, ]
# test
test_perm <- permeability[-index, ]
test_fingerprints <- fingerprints [-index, ]
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- train(training_fp, training_perm, method = "pls", metric = "Rsquared",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(plsTune) plsTunePartial Least Squares
133 samples
1107 predictors
Pre-processing: centered (1107), scaled (1107)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 121, 119, 120, 120, 120, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 12.12952 0.4011904 9.305278
2 10.94876 0.5105796 7.679658
3 11.00854 0.5290113 7.754137
4 11.20289 0.5225587 7.893784
5 11.24215 0.4983011 7.954592
6 11.20096 0.4969518 8.044244
7 11.19472 0.5018788 7.934196
8 11.25831 0.4982870 8.022500
9 11.10065 0.5071009 7.863240
10 10.91092 0.5152740 7.929696
11 10.85914 0.5196915 7.953547
12 10.85129 0.5185189 7.980026
13 11.02004 0.5057632 8.088382
14 11.16722 0.4951365 8.259389
15 11.21854 0.4901999 8.296766
16 11.19973 0.4921694 8.268655
17 11.15026 0.4970592 8.248743
18 11.27166 0.4879151 8.291501
19 11.33344 0.4805361 8.271034
20 11.39877 0.4740136 8.247344
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was ncomp = 3.
Comment: The optimal tuning had 3 components with a corresponding R2 of 0.5290113.
(d) Predict the response for the test set. What is the test set estimate of R2?
fingerprints_predict <- predict(plsTune, test_fingerprints)
postResample(fingerprints_predict, test_perm) RMSE Rsquared MAE
15.9087452 0.1281726 10.9640796
Comment: The test set estimate of R2 is 0.1281726.
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
- Elastic Net Regression
library(elasticnet)
library(caret)
set.seed(0000004)
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
#Removing zero-variance predictors from training_fp
nzv <- nearZeroVar(training_fp)
training_fp_clean <- training_fp[, -nzv]
# tuning penalized regression model
enetTune <- train(training_fp_clean, training_perm, method = "enet",
tuneGrid = enetGrid, trControl = ctrl, preProc = c("center", "scale"))
plot(enetTune)enetTuneElasticnet
133 samples
384 predictors
Pre-processing: centered (384), scaled (384)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 120, 119, 118, 120, 120, ...
Resampling results across tuning parameters:
lambda fraction RMSE Rsquared MAE
0.00 0.05 12.33567 0.4676084 9.329956
0.00 0.10 11.60878 0.4912699 8.481412
0.00 0.15 11.45539 0.4999618 8.249311
0.00 0.20 11.47138 0.4947464 8.387189
0.00 0.25 11.42483 0.4915525 8.397655
0.00 0.30 11.43015 0.4837780 8.391405
0.00 0.35 11.51210 0.4727124 8.448593
0.00 0.40 11.64681 0.4615121 8.475189
0.00 0.45 11.73717 0.4579426 8.470615
0.00 0.50 11.90697 0.4512992 8.503301
0.00 0.55 12.03958 0.4494149 8.572351
0.00 0.60 12.15125 0.4468859 8.636492
0.00 0.65 12.28727 0.4435424 8.718164
0.00 0.70 12.40138 0.4418729 8.797107
0.00 0.75 12.52158 0.4406467 8.886665
0.00 0.80 12.65177 0.4399997 8.988894
0.00 0.85 12.76168 0.4415999 9.063262
0.00 0.90 12.87882 0.4431133 9.150393
0.00 0.95 13.03417 0.4434389 9.273398
0.00 1.00 13.20537 0.4427072 9.402417
0.01 0.05 18.32558 0.4803515 12.223604
0.01 0.10 24.08812 0.4867543 15.524638
0.01 0.15 29.68211 0.5190491 18.853631
0.01 0.20 34.29551 0.5440964 20.887165
0.01 0.25 38.52580 0.5552071 23.053455
0.01 0.30 42.76244 0.5708897 25.241745
0.01 0.35 47.21529 0.5829363 27.440069
0.01 0.40 51.84910 0.5928855 30.671362
0.01 0.45 56.62818 0.5993495 34.043848
0.01 0.50 61.59247 0.6058822 37.475776
0.01 0.55 66.61483 0.6136996 40.917738
0.01 0.60 71.74613 0.6183458 44.407092
0.01 0.65 76.98639 0.6178066 47.916930
0.01 0.70 82.30906 0.6135315 51.441291
0.01 0.75 87.66894 0.6074345 54.903148
0.01 0.80 92.87197 0.6026133 58.194595
0.01 0.85 98.08663 0.5975690 61.506843
0.01 0.90 103.33599 0.5903806 64.839742
0.01 0.95 108.51112 0.5857670 68.118289
0.01 1.00 113.65206 0.5813243 71.366463
0.10 0.05 11.87540 0.4999528 8.968315
0.10 0.10 11.37970 0.4790711 8.129369
0.10 0.15 11.28786 0.4828563 8.225508
0.10 0.20 11.19156 0.4925086 8.174445
0.10 0.25 11.00415 0.5120630 8.064748
0.10 0.30 10.85888 0.5250280 7.939745
0.10 0.35 10.80365 0.5341819 7.841756
0.10 0.40 10.75136 0.5444253 7.689580
0.10 0.45 10.67963 0.5561593 7.598058
0.10 0.50 10.58480 0.5680503 7.511252
0.10 0.55 10.50778 0.5772374 7.422204
0.10 0.60 10.43172 0.5853835 7.338467
0.10 0.65 10.35096 0.5932400 7.282791
0.10 0.70 10.30909 0.5988273 7.253586
0.10 0.75 10.27973 0.6033158 7.242195
0.10 0.80 10.28814 0.6045892 7.265009
0.10 0.85 10.31172 0.6042252 7.298486
0.10 0.90 10.33246 0.6037145 7.324348
0.10 0.95 10.34068 0.6042237 7.329113
0.10 1.00 10.34134 0.6049299 7.324218
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.75 and lambda = 0.1.
enet_predict <- predict(enetTune, test_fingerprints)
postResample(enet_predict, test_perm) RMSE Rsquared MAE
13.9743213 0.3564037 10.6458697
Comment: Here, the R2 (0.3564037) is higher. The RMSE (13.9743213) and the MAE (10.6458697) are both lower compared to the PLS Method. We want a smaller RMSE and a higher R2.
- The least angle regression
set.seed(0101)
larsTune <- train(training_fp, training_perm, method = "lars", metric = "Rsquared",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"), use.Gram = FALSE)
plot(larsTune)larsTuneLeast Angle Regression
133 samples
1107 predictors
Pre-processing: centered (1107), scaled (1107)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 121, 120, 119, 120, 119, ...
Resampling results across tuning parameters:
fraction RMSE Rsquared MAE
0.05 11.81749 0.5527987 9.175989
0.10 11.17743 0.5249322 8.055655
0.15 11.03644 0.5177277 7.638767
0.20 11.04605 0.5205102 7.478941
0.25 11.01684 0.5290439 7.552544
0.30 10.92316 0.5405664 7.660368
0.35 10.95592 0.5377995 7.811794
0.40 11.16505 0.5208305 7.963949
0.45 11.30381 0.5081337 8.103849
0.50 11.48684 0.4959009 8.269503
0.55 11.70571 0.4837794 8.425487
0.60 12.04788 0.4653065 8.619035
0.65 12.38947 0.4539916 8.811673
0.70 12.63420 0.4497880 8.961094
0.75 12.95054 0.4427984 9.155413
0.80 13.41827 0.4243703 9.492906
0.85 13.94331 0.4054214 9.904694
0.90 14.50708 0.3892031 10.346220
0.95 14.95799 0.3781455 10.718062
1.00 15.51268 0.3625228 11.147956
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was fraction = 0.05.
lars_predict <- predict(larsTune, test_fingerprints)
postResample(lars_predict, test_perm) RMSE Rsquared MAE
12.2557777 0.5142676 9.4099644
Comment: In this case, the R2 (0.5142676) is higher. The RMSE (12.2557777) and the MAE (9.4099644) are both lower compared to the PLS and the ENET Methods. We want a smaller RMSE and a higher R2.
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
I would recommend the
I would recommend the Least Angle Regression (LARS) Method as it demonstrated superior performance metrics by achieving higher R2 and lower RMSE and MAE.
2-) A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re-lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro-cess. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
#Determining missing values
sum(is.na(ChemicalManufacturingProcess))[1] 106
missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(missing, ChemicalManufacturingProcess)
sum(is.na(Chemical))[1] 0
Comment: The ChemicalManufacturingProcess dataset contained 106 missing values, which were imputed using bagged trees. This method leverages all other available variables to predict and fill in the missing data.
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
- The LARS Method will be used
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(1122)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_chemical <- Chemical[index, ]
# test
test_chemical <- Chemical[-index, ]
set.seed(9987)
larsTune <- train(Yield ~ ., Chemical , method = "lars", metric = "Rsquared",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(larsTune)larsTuneLeast Angle Regression
176 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 158, 158, 158, 158, 158, 159, ...
Resampling results across tuning parameters:
fraction RMSE Rsquared MAE
0.05 1.272531 0.6045198 1.0446041
0.10 1.167676 0.5970666 0.9406112
0.15 1.212814 0.5834683 0.9629513
0.20 1.458800 0.5455864 1.0276118
0.25 1.556449 0.5325864 1.0610097
0.30 1.746726 0.5040319 1.1245147
0.35 1.900979 0.5245469 1.1536580
0.40 2.028951 0.5104204 1.1954196
0.45 2.009926 0.5105183 1.1925722
0.50 1.972459 0.5086036 1.1893247
0.55 1.922096 0.5145047 1.1809228
0.60 1.908505 0.5097956 1.1876830
0.65 1.903838 0.5059597 1.1933662
0.70 1.891772 0.5089519 1.1946916
0.75 1.958601 0.5150667 1.2130396
0.80 2.008774 0.5209597 1.2265797
0.85 2.194891 0.5120148 1.2727253
0.90 2.672036 0.4822419 1.4058865
0.95 3.204179 0.4659763 1.5445333
1.00 3.537950 0.4491486 1.6293659
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was fraction = 0.05.
Comment: Rsquared was used to select the optimal model using the largest value. The final value used for the model was fraction = 0.1 for RMSE = 1.149218, R2 = 0.6307070 and MAE = 0.9336483.
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
lars_predict <- predict(larsTune, test_chemical[ ,-1])
postResample(lars_predict, test_chemical[ ,1]) RMSE Rsquared MAE
1.468911 0.686915 1.168253
Comment: The performance metrics (RMSE, MAE and R2) are higher.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
varImp(larsTune)loess r-squared variable importance
only 20 most important variables shown (out of 56)
Overall
ManufacturingProcess32 100.00
ManufacturingProcess13 90.03
BiologicalMaterial06 84.57
ManufacturingProcess36 75.80
ManufacturingProcess17 74.89
BiologicalMaterial03 73.54
ManufacturingProcess09 70.38
BiologicalMaterial12 67.99
BiologicalMaterial02 65.34
ManufacturingProcess06 58.06
ManufacturingProcess33 49.82
BiologicalMaterial11 48.13
BiologicalMaterial04 47.15
BiologicalMaterial08 41.90
ManufacturingProcess11 41.54
BiologicalMaterial01 39.16
ManufacturingProcess31 35.59
ManufacturingProcess12 33.01
BiologicalMaterial09 32.44
ManufacturingProcess25 23.51
Comment: only 20 most important variables shown (out of 56). The process predictors dominated the list (11 over 20) with the highest overall importance.
(f) Explore the relationships between each of the top predictors and the re-sponse. How could this information be helpful in improving yield in future runs of the manufacturing process?
library(corrplot)
library(dplyr)
top10 <- varImp(larsTune)$importance |>
arrange(-Overall) |>
head(10)
Chemical |>
select(c("Yield", row.names(top10))) |>
cor() |>
corrplot()Comment: The correlation plot indicates that ManufacturingProcess32 has the strongest positive correlation with Yield. Among the top ten influential variables, three show a negative correlation with Yield. These insights can guide future manufacturing runs, as they highlight key predictors that impact yield outcomes. To enhance or maximize yield, it may be beneficial to improve the precision of both manufacturing process measurements and the biological assessments of raw materials.