6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
- Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(caret)
library(AppliedPredictiveModeling)data(permeability)
str(permeability)## num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr "permeability"
- The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
The data presents that 388 predictors were left for modeling.
low_freq <- nearZeroVar(fingerprints)
X <- fingerprints[,-low_freq]
str(X)## num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...
- Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
For the data below, it was splitted into train/test set, using the createDataPartition function.
set.seed(17)
t_Row <- createDataPartition(permeability, p=0.8, list=FALSE)
X_train <- X[t_Row, ]
y_train <- permeability[t_Row, ]
X_test <- X[-t_Row, ]
y_test <- permeability[-t_Row, ]set.seed(17)
pls_Fit <- train(x=X_train,
y=y_train,
method='pls',
metric='Rsquared',
tuneLength=20,
trControl=trainControl(method='cv'),
preProcess=c('center', 'scale')
)
pls_Result <- pls_Fit$results
pls_Fit## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 121, 120, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.78558 0.3512671 9.984752
## 2 11.48566 0.4770397 8.255759
## 3 11.68110 0.4609739 8.677375
## 4 11.85819 0.4512404 8.909198
## 5 11.74985 0.4588907 9.022578
## 6 11.88420 0.4591695 9.075945
## 7 12.06412 0.4541519 9.255652
## 8 12.11218 0.4698237 9.401114
## 9 12.15479 0.4783946 9.271606
## 10 12.42207 0.4654890 9.326995
## 11 12.70168 0.4581171 9.406034
## 12 12.94741 0.4575447 9.663124
## 13 13.10568 0.4468662 9.735696
## 14 13.27580 0.4408783 9.883621
## 15 13.49070 0.4393861 9.954010
## 16 13.92057 0.4236587 10.308067
## 17 14.30686 0.4144228 10.473196
## 18 14.73955 0.3977299 10.818888
## 19 14.91533 0.3963792 11.034378
## 20 15.29482 0.3823758 11.384072
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
- Predict the response for the test set. What is the test set estimate of R2?
R2 test set is 0.2986943.
pls_Pred <- predict(pls_Fit, newdata=X_test)
postResample(pred=pls_Pred, obs=y_test)## RMSE Rsquared MAE
## 11.455821 0.530198 8.381798
- Try building other models discussed in this chapter. Do any have better prediective performance?
FOr these problem, there will be 3 models. The model with the MAX R2 will be the Lasso model showing us R2 0.5459518
set.seed(17)
r_Fit <- train(x=X_train,
y=y_train,
method='ridge',
metric='Rsquared',
tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)## Warning: model fit failed for Fold02: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
r_Fit## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 121, 120, 119, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 17.69824 0.3387828 12.323778
## 0.1 13.16484 0.4586224 9.573189
## 0.2 12.90509 0.4751686 9.501242
## 0.3 13.03168 0.4820743 9.650479
## 0.4 13.30036 0.4863153 9.935894
## 0.5 13.67743 0.4894886 10.256849
## 0.6 14.14675 0.4912908 10.638401
## 0.7 14.66246 0.4929628 11.020308
## 0.8 15.21765 0.4943953 11.426074
## 0.9 15.81805 0.4954898 11.900490
## 1.0 16.44342 0.4964945 12.452772
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
plot(r_Fit)set.seed(17)
lasso_Fit <- train(x=X_train,
y=y_train,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)## Warning: model fit failed for Fold02: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = X_train, y = y_train, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results
lasso_Fit## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 121, 120, 119, 119, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 14.99248 NaN 11.899869
## 0.05 12.04587 0.5004433 8.961563
## 0.10 12.58317 0.4619831 8.828892
## 0.15 14.10187 0.4392581 9.776441
## 0.20 15.02759 0.4202044 10.174406
## 0.25 15.08830 0.4083345 10.103200
## 0.30 15.14589 0.4052072 10.229122
## 0.35 15.18624 0.4047464 10.293555
## 0.40 15.35506 0.3980124 10.405897
## 0.45 15.58076 0.3886588 10.627774
## 0.50 15.84203 0.3764416 10.839115
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
plot(lasso_Fit)set.seed(1)
enet_Fit <- train(x=X_train,
y=y_train,
method='enet',
metric='Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = X_train, y = y_train, method = "enet", metric =
## "Rsquared", : missing values found in aggregated results
enet_Fit## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 15.41970 NaN 12.091419
## 0.0 0.1 11.61723 0.4611141 8.257619
## 0.0 0.2 12.14064 0.4505336 8.783101
## 0.0 0.3 12.78913 0.4447343 9.268132
## 0.0 0.4 13.58610 0.4374408 9.827298
## 0.0 0.5 14.48706 0.4210673 10.382041
## 0.0 0.6 15.26000 0.4102524 10.671864
## 0.0 0.7 15.93989 0.4048366 10.913214
## 0.0 0.8 16.69682 0.3992267 11.321393
## 0.0 0.9 17.48924 0.3871120 11.763618
## 0.0 1.0 18.58494 0.3778306 12.330318
## 0.1 0.0 15.41970 NaN 12.091419
## 0.1 0.1 11.35996 0.4915684 8.017335
## 0.1 0.2 11.27531 0.5015887 8.043442
## 0.1 0.3 11.20513 0.5110566 8.027647
## 0.1 0.4 11.26273 0.5110763 8.129069
## 0.1 0.5 11.49214 0.5022354 8.305879
## 0.1 0.6 11.71000 0.4927793 8.430194
## 0.1 0.7 11.90351 0.4878547 8.538342
## 0.1 0.8 12.11296 0.4819953 8.661963
## 0.1 0.9 12.33892 0.4753494 8.822856
## 0.1 1.0 12.55343 0.4696305 8.992569
## 0.2 0.0 15.41970 NaN 12.091419
## 0.2 0.1 11.29819 0.4971055 8.081225
## 0.2 0.2 11.34551 0.5072120 7.928544
## 0.2 0.3 11.28566 0.5174924 7.990192
## 0.2 0.4 11.30461 0.5220751 8.080592
## 0.2 0.5 11.46627 0.5177649 8.298248
## 0.2 0.6 11.62510 0.5109649 8.433006
## 0.2 0.7 11.82620 0.5055078 8.564932
## 0.2 0.8 12.00071 0.5017981 8.651009
## 0.2 0.9 12.22749 0.4948431 8.813655
## 0.2 1.0 12.42486 0.4892806 8.962422
## 0.3 0.0 15.41970 NaN 12.091419
## 0.3 0.1 11.27466 0.5009573 8.091070
## 0.3 0.2 11.47359 0.5099557 7.881453
## 0.3 0.3 11.47396 0.5213435 8.063961
## 0.3 0.4 11.49929 0.5262336 8.167894
## 0.3 0.5 11.65714 0.5240146 8.410613
## 0.3 0.6 11.79819 0.5187685 8.587430
## 0.3 0.7 11.98769 0.5141657 8.760497
## 0.3 0.8 12.19169 0.5096006 8.902773
## 0.3 0.9 12.39220 0.5049634 9.058682
## 0.3 1.0 12.59417 0.4995325 9.233599
## 0.4 0.0 15.41970 NaN 12.091419
## 0.4 0.1 11.25404 0.5056775 8.080373
## 0.4 0.2 11.63823 0.5113009 7.858957
## 0.4 0.3 11.71973 0.5236318 8.190912
## 0.4 0.4 11.76623 0.5286027 8.344216
## 0.4 0.5 11.94615 0.5268705 8.576430
## 0.4 0.6 12.09957 0.5230824 8.833440
## 0.4 0.7 12.29631 0.5190477 9.046553
## 0.4 0.8 12.51110 0.5146867 9.232418
## 0.4 0.9 12.72124 0.5100153 9.400000
## 0.4 1.0 12.93584 0.5044749 9.573351
## 0.5 0.0 15.41970 NaN 12.091419
## 0.5 0.1 11.24428 0.5077260 8.072662
## 0.5 0.2 11.80485 0.5129863 7.828495
## 0.5 0.3 12.00793 0.5251863 8.351482
## 0.5 0.4 12.09280 0.5299907 8.561475
## 0.5 0.5 12.30083 0.5288039 8.822076
## 0.5 0.6 12.47467 0.5261406 9.117000
## 0.5 0.7 12.68591 0.5224979 9.369244
## 0.5 0.8 12.90625 0.5186852 9.569376
## 0.5 0.9 13.13479 0.5136638 9.763197
## 0.5 1.0 13.35206 0.5085455 9.954409
## 0.6 0.0 15.41970 NaN 12.091419
## 0.6 0.1 11.24510 0.5086957 8.058453
## 0.6 0.2 12.00165 0.5143831 7.848258
## 0.6 0.3 12.34253 0.5257645 8.540670
## 0.6 0.4 12.47138 0.5308852 8.805267
## 0.6 0.5 12.71386 0.5299948 9.132140
## 0.6 0.6 12.91580 0.5278109 9.424597
## 0.6 0.7 13.14201 0.5245287 9.706237
## 0.6 0.8 13.37205 0.5213038 9.919503
## 0.6 0.9 13.61274 0.5164102 10.144165
## 0.6 1.0 13.83953 0.5114326 10.351894
## 0.7 0.0 15.41970 NaN 12.091419
## 0.7 0.1 11.24866 0.5092526 8.040058
## 0.7 0.2 12.21310 0.5152180 7.909587
## 0.7 0.3 12.71338 0.5259876 8.746622
## 0.7 0.4 12.89518 0.5311812 9.102461
## 0.7 0.5 13.17641 0.5305080 9.466376
## 0.7 0.6 13.41332 0.5285888 9.771724
## 0.7 0.7 13.65402 0.5258617 10.054410
## 0.7 0.8 13.89839 0.5229935 10.301344
## 0.7 0.9 14.14595 0.5185389 10.538146
## 0.7 1.0 14.38595 0.5135687 10.753672
## 0.8 0.0 15.41970 NaN 12.091419
## 0.8 0.1 11.25438 0.5092670 8.022478
## 0.8 0.2 12.43212 0.5159349 7.999328
## 0.8 0.3 13.10814 0.5258492 8.960724
## 0.8 0.4 13.35914 0.5307945 9.435246
## 0.8 0.5 13.67805 0.5304771 9.816573
## 0.8 0.6 13.95180 0.5287587 10.124423
## 0.8 0.7 14.20887 0.5266314 10.431117
## 0.8 0.8 14.47080 0.5239292 10.705678
## 0.8 0.9 14.72472 0.5199382 10.947697
## 0.8 1.0 14.97598 0.5152163 11.169986
## 0.9 0.0 15.41970 NaN 12.091419
## 0.9 0.1 11.26398 0.5091136 8.000174
## 0.9 0.2 12.67388 0.5162232 8.143464
## 0.9 0.3 13.51715 0.5256316 9.245476
## 0.9 0.4 13.85627 0.5302268 9.805245
## 0.9 0.5 14.20852 0.5303995 10.197555
## 0.9 0.6 14.51987 0.5288872 10.527307
## 0.9 0.7 14.79832 0.5270899 10.845497
## 0.9 0.8 15.07545 0.5245864 11.125941
## 0.9 0.9 15.33779 0.5210748 11.373539
## 0.9 1.0 15.60098 0.5165384 11.612292
## 1.0 0.0 15.41970 NaN 12.091419
## 1.0 0.1 11.27752 0.5088380 7.975301
## 1.0 0.2 12.93414 0.5159634 8.304355
## 1.0 0.3 13.93146 0.5254618 9.522955
## 1.0 0.4 14.37678 0.5295917 10.179910
## 1.0 0.5 14.76285 0.5301598 10.590262
## 1.0 0.6 15.11221 0.5288528 10.965464
## 1.0 0.7 15.41130 0.5273473 11.279058
## 1.0 0.8 15.70466 0.5250636 11.577120
## 1.0 0.9 15.98121 0.5216979 11.869047
## 1.0 1.0 16.25240 0.5175524 12.137198
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.7.
plot(enet_Fit) f. Would you recommend any of your models to replace the permeability laboratory experiment?
I would not recommend any of the models used to replace the permeability laboratory experiment. The histo graph below will show us the majority of the results, which are below 5.
hist(permeability)6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is the understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.
- Start R and use these commands to load the data:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
df.raw <- ChemicalManufacturingProcess- A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
library(mice) ## Warning: package 'mice' was built under R version 4.0.5
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
imp_Data <- mice(df.raw, meth='pmm', seed=500, printFlag = FALSE)## Warning: Number of logged events: 675
df <- complete(imp_Data,1)
# Split df into predictor columns and yield column data frames
df.pred <- df[, -which(names(df) == "Yield")]
df.yield <- df[, which(names(df) == "Yield")]- Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# Split the data into training and test sets
set.seed(139)
trn.rows <- createDataPartition(df$Yield, p = 0.80, list = FALSE)
# Pre-processing, removed highly correlated and near-zero predictors
df.pp <- preProcess(df.pred, method = c("BoxCox", "center", "scale", "nzv", "corr"))
df.pred_trans <- predict(df.pp, df.pred)
# Train model
df.yield <- as.data.frame(df.yield)
df.pred_train <- df.pred_trans[trn.rows, ]
df.yield_train <- df.yield[trn.rows, ]
df.pred_test <- df.pred_trans[-trn.rows, ]
df.yield_test <- df.yield[-trn.rows, ]
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
model <- train(x = df.pred_train, y = df.yield_train,
method = "lmStepAIC", trControl = train.control, trace = FALSE)
model$results## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 1.690168 0.5082275 1.198341 1.274591 0.2478585 0.5173695
model$finalModel##
## Call:
## lm(formula = .outcome ~ BiologicalMaterial05 + ManufacturingProcess01 +
## ManufacturingProcess03 + ManufacturingProcess04 + ManufacturingProcess09 +
## ManufacturingProcess10 + ManufacturingProcess13 + ManufacturingProcess18 +
## ManufacturingProcess20 + ManufacturingProcess26 + ManufacturingProcess28 +
## ManufacturingProcess29 + ManufacturingProcess32 + ManufacturingProcess33 +
## ManufacturingProcess35 + ManufacturingProcess37 + ManufacturingProcess43 +
## ManufacturingProcess45, data = dat)
##
## Coefficients:
## (Intercept) BiologicalMaterial05 ManufacturingProcess01
## 40.1278 0.2944 0.1446
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess09
## -0.1290 0.4066 0.8204
## ManufacturingProcess10 ManufacturingProcess13 ManufacturingProcess18
## -0.3212 -0.4319 2.1384
## ManufacturingProcess20 ManufacturingProcess26 ManufacturingProcess28
## -2.0278 -1.3721 -0.2547
## ManufacturingProcess29 ManufacturingProcess32 ManufacturingProcess33
## 1.5687 1.6208 -0.7606
## ManufacturingProcess35 ManufacturingProcess37 ManufacturingProcess43
## -0.1428 -0.1946 0.1735
## ManufacturingProcess45
## 0.2959
- Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
df.pred_test_results <- predict(model, df.pred_test)
df.compare <- defaultSummary(data.frame(obs = df.yield_test, pred = df.pred_test_results))
df.compare## RMSE Rsquared MAE
## 1.1619742 0.6380871 0.9782865
df.compare2 <- defaultSummary(data.frame(obs = df.yield_test[2:32], pred = df.pred_test_results[2:32]))
df.compare2## RMSE Rsquared MAE
## 1.1777534 0.6289377 0.9952139
- Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
model.imp <- varImp(model)
ggplot(model.imp, top = 10) + ggtitle("Top 10 Predictors in Model")- Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
From looking at the top 10, I noticed that 7 are manufacturing processes and 3 of them are biological material.
par(mfrow=c(2,3))
plot(df$ManufacturingProcess32, df$Yield, xlab = "ManufacturingProcess32", ylab = "Yield")
plot(df$BiologicalMaterial06, df$Yield, xlab = "BiologicalMaterial06", ylab = "Yield")
plot(df$ManufacturingProcess13, df$Yield, xlab = "ManufacturingProcess13", ylab = "Yield")
plot(df$BiologicalMaterial03, df$Yield, xlab = "BiologicalMaterial03", ylab = "Yield")
plot(df$ManufacturingProcess17, df$Yield, xlab = "ManufacturingProcess17", ylab = "Yield")
plot(df$ManufacturingProcess36, df$Yield, xlab = "ManufacturingProcess36", ylab = "Yield")