Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
Start R and use these commands to load the data:
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.How many predictors are left for modeling?
remove<- nearZeroVar(fingerprints)
fingerprints_remove <- fingerprints[,-remove]
1107-length(remove)
## [1] 388
388 Predictors are left for modeling.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\) ?
set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
fingerprints.train <- fingerprints_remove[trainRow, ]
perm.train <- permeability[trainRow, ]
fingerprints.test <- fingerprints_remove[-trainRow, ]
perm.test <- permeability[-trainRow, ]
ctrl<-trainControl(method = "cv",number=10)
set.seed(1)
plsTune <- train(x=fingerprints.train,
y=perm.train,
method='pls',
metric='Rsquared',
tuneLength=20,
trControl=ctrl,
preProcess=c('center', 'scale')
)
plsTune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.66375 0.3461521 9.603300
## 2 11.66557 0.4884962 8.364131
## 3 11.86090 0.4462082 9.081715
## 4 12.15438 0.4254096 9.389547
## 5 12.00549 0.4492658 9.076973
## 6 11.70875 0.4618761 8.826006
## 7 11.42799 0.4775314 8.747634
## 8 11.25481 0.4925856 8.533191
## 9 11.03884 0.5066896 8.344461
## 10 10.88506 0.5187689 8.028416
## 11 10.80216 0.5271005 8.012505
## 12 10.81509 0.5247495 8.034195
## 13 10.74326 0.5245241 8.003891
## 14 10.70988 0.5238880 7.892343
## 15 10.87078 0.5198265 8.032440
## 16 11.25383 0.4999003 8.375209
## 17 11.52949 0.4886010 8.619614
## 18 11.46969 0.4907743 8.541789
## 19 11.46414 0.4906053 8.591222
## 20 11.38025 0.5003197 8.484932
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 11.
plot(plsTune)
The optimal number of the model is 11 and the corresponding \(R^2\) is 0.5271005.
Predict the response for the test set. What is the test set estimate of \(R^2\) ?
pre_pls<-predict(plsTune,fingerprints.test)
plsEval<-postResample(perm.test,pre_pls)
plsEval
## RMSE Rsquared MAE
## 15.4341377 0.2899935 11.6639869
The test set estimate of \(R^2\) is 0.2899935.
Try building other models discussed in this chapter. Do any have better predictive performance?
# Ridge Regression Model
ridgeGrid<-data.frame(.lambda=seq(0, 1, by=0.1))
set.seed(100)
ridgeTune <- train(x=fingerprints.train,
y=perm.train,
method='ridge',
tuneGrid=ridgeGrid,
trControl=ctrl,
preProcess=c('center', 'scale')
)
## Warning: model fit failed for Fold02: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
ridgeTune
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 16.50084 0.4546017 11.573579
## 0.1 10.05743 0.6326680 7.480219
## 0.2 10.12322 0.6341198 7.650604
## 0.3 10.42128 0.6294136 7.981911
## 0.4 10.84725 0.6232755 8.411012
## 0.5 11.36024 0.6173390 8.905333
## 0.6 11.94311 0.6119499 9.425962
## 0.7 12.57245 0.6063708 9.963344
## 0.8 13.24342 0.6017165 10.568040
## 0.9 13.95206 0.5968604 11.188737
## 1.0 14.68431 0.5926265 11.805196
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
plot(ridgeTune)
pre_ridge<-predict(ridgeTune,fingerprints.test)
ridgeEval<-postResample(perm.test,pre_ridge)
ridgeEval
## RMSE Rsquared MAE
## 15.0822363 0.3047575 11.4460367
# Lasso
lassoGrid<-data.frame(.fraction = seq(0, 1, length=15))
set.seed(100)
lassoTune <- train(x=fingerprints.train,
y=perm.train,
method='lasso',
metric='Rsquared',
tuneGrid = lassoGrid,
trControl=ctrl,
preProcess=c('center', 'scale')
)
## Warning: model fit failed for Fold02: fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = fingerprints.train, y = perm.train, method =
## "lasso", : missing values found in aggregated results
lassoTune
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00000000 15.21986 NaN 12.147734
## 0.07142857 11.92336 0.5235776 8.952903
## 0.14285714 11.59693 0.5017420 8.344428
## 0.21428571 11.75040 0.4782170 8.216624
## 0.28571429 11.98869 0.4845000 8.396859
## 0.35714286 12.58079 0.4779456 8.808119
## 0.42857143 13.05193 0.4791460 9.029293
## 0.50000000 13.29917 0.4877874 9.157739
## 0.57142857 13.55496 0.4908656 9.370046
## 0.64285714 13.92496 0.4893982 9.711711
## 0.71428571 14.44127 0.4824490 10.092412
## 0.78571429 14.95779 0.4772738 10.527560
## 0.85714286 15.37833 0.4724820 10.874131
## 0.92857143 15.84402 0.4637750 11.184983
## 1.00000000 16.50084 0.4546017 11.573579
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.07142857.
plot(lassoTune)
pre_lasso<-predict(lassoTune,fingerprints.test)
lassoEval<-postResample(perm.test,pre_lasso)
lassoEval
## RMSE Rsquared MAE
## 11.5653430 0.4652193 8.4497749
# Elastic net model
enetGrid<-expand.grid(.lambda=c(0,0.01,.1),
.fraction=seq(.05,1,length=20))
set.seed(100)
enetTune <- train(x=fingerprints.train,
y=perm.train,
method='enet',
tuneGrid = enetGrid,
trControl=ctrl,
preProcess=c('center', 'scale')
)
## Warning: model fit failed for Fold02: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
enetTune
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 12.024841 0.5513574 9.186678
## 0.00 0.10 11.827952 0.5084157 8.735324
## 0.00 0.15 11.589189 0.4999243 8.305935
## 0.00 0.20 11.675716 0.4804038 8.186218
## 0.00 0.25 11.800389 0.4846398 8.193339
## 0.00 0.30 12.117972 0.4823334 8.504219
## 0.00 0.35 12.533039 0.4777591 8.782175
## 0.00 0.40 12.879531 0.4780507 8.951166
## 0.00 0.45 13.153864 0.4814678 9.065555
## 0.00 0.50 13.299166 0.4877874 9.157739
## 0.00 0.55 13.460915 0.4899097 9.286841
## 0.00 0.60 13.702404 0.4905555 9.507235
## 0.00 0.65 13.973514 0.4886576 9.749367
## 0.00 0.70 14.326698 0.4838837 10.008189
## 0.00 0.75 14.715544 0.4791799 10.322219
## 0.00 0.80 15.048529 0.4762092 10.599880
## 0.00 0.85 15.333798 0.4731119 10.839859
## 0.00 0.90 15.653917 0.4675183 11.067612
## 0.00 0.95 16.014838 0.4605197 11.292103
## 0.00 1.00 16.500836 0.4546017 11.573579
## 0.01 0.05 11.609580 0.4968994 8.291314
## 0.01 0.10 11.532781 0.4901951 8.086809
## 0.01 0.15 11.057345 0.5174952 7.831186
## 0.01 0.20 10.504522 0.5550969 7.531925
## 0.01 0.25 10.308664 0.5719308 7.448272
## 0.01 0.30 10.381179 0.5709778 7.484887
## 0.01 0.35 10.500756 0.5656639 7.529464
## 0.01 0.40 10.595755 0.5651465 7.571646
## 0.01 0.45 10.542463 0.5751054 7.546266
## 0.01 0.50 10.485403 0.5852606 7.519345
## 0.01 0.55 10.474674 0.5910366 7.501623
## 0.01 0.60 10.514679 0.5932466 7.536307
## 0.01 0.65 10.568773 0.5932853 7.606838
## 0.01 0.70 10.669847 0.5877317 7.669878
## 0.01 0.75 10.797136 0.5783334 7.738966
## 0.01 0.80 10.926647 0.5669895 7.825939
## 0.01 0.85 11.045931 0.5570111 7.903453
## 0.01 0.90 11.136319 0.5496761 7.975449
## 0.01 0.95 11.194900 0.5441324 8.045975
## 0.01 1.00 11.219003 0.5412863 8.096688
## 0.10 0.05 12.013457 0.5085162 9.005731
## 0.10 0.10 11.503435 0.5011261 7.976228
## 0.10 0.15 11.549143 0.5003183 7.984338
## 0.10 0.20 11.332394 0.5104083 7.951524
## 0.10 0.25 10.944072 0.5364371 7.709298
## 0.10 0.30 10.569466 0.5609265 7.501385
## 0.10 0.35 10.390329 0.5744176 7.370364
## 0.10 0.40 10.259650 0.5860805 7.294416
## 0.10 0.45 10.159169 0.5952657 7.245888
## 0.10 0.50 10.092269 0.6019772 7.236091
## 0.10 0.55 10.057030 0.6074831 7.253177
## 0.10 0.60 9.999928 0.6163368 7.245178
## 0.10 0.65 9.968883 0.6220938 7.278809
## 0.10 0.70 9.950607 0.6258279 7.301450
## 0.10 0.75 9.966800 0.6274508 7.338499
## 0.10 0.80 9.978597 0.6295832 7.377226
## 0.10 0.85 10.003690 0.6305064 7.410184
## 0.10 0.90 10.024686 0.6313960 7.437714
## 0.10 0.95 10.043811 0.6319760 7.465305
## 0.10 1.00 10.057426 0.6326680 7.480219
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.7 and lambda = 0.1.
plot(enetTune)
pre_enet<-predict(enetTune,fingerprints.test)
enetEval<-postResample(perm.test,pre_enet)
enetEval
## RMSE Rsquared MAE
## 14.1795387 0.3415212 10.7570758
Would you recommend any of your models to replace the permeability laboratory experiment?
eval<-rbind(plsEval,ridgeEval,lassoEval,enetEval)
eval
## RMSE Rsquared MAE
## plsEval 15.43414 0.2899935 11.663987
## ridgeEval 15.08224 0.3047575 11.446037
## lassoEval 11.56534 0.4652193 8.449775
## enetEval 14.17954 0.3415212 10.757076
I would recommend the model of lasso because it has the lowest RMSE, 11.56534 and highest \(R^2\),0.4652193.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand,manufacturing process predictors can be changed in the manufacturing process.Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors)for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
I use KNN() to impute the data.
chemical<-data.frame(ChemicalManufacturingProcess)
chemical<-kNN(chemical) %>%
select(Yield:ManufacturingProcess45)
missmap(chemical)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
indx<-createDataPartition(chemical$Yield,p=0.8,list=FALSE)
chemical_train<-chemical[indx,]
chemical_test<-chemical[-indx,]
#pls model
ctrl<-trainControl(method = "cv",number=10)
set.seed(100)
chemicalTune <- train(x=chemical_train[,-chemical_train$Yield],
y=chemical_train$Yield,
method='pls',
metric='Rsquared',
tuneLength=20,
trControl=ctrl,
preProcess=c('center', 'scale')
)
chemicalTune
## Partial Least Squares
##
## 144 samples
## 48 predictor
##
## Pre-processing: centered (48), scaled (48)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.25622540 0.5348006 0.98058481
## 2 1.18615347 0.6744680 0.82941758
## 3 0.77148156 0.8038311 0.62203254
## 4 0.90005248 0.7708044 0.57929060
## 5 0.81900787 0.8152887 0.46748508
## 6 0.55617593 0.8973282 0.32065388
## 7 0.47166948 0.9131598 0.25032239
## 8 0.48847741 0.9023947 0.21542826
## 9 0.44468505 0.9112637 0.18943482
## 10 0.35339471 0.9283237 0.14801182
## 11 0.30734138 0.9346357 0.12095786
## 12 0.26438977 0.9413105 0.09968859
## 13 0.23621004 0.9467387 0.08703166
## 14 0.20887379 0.9542907 0.07653071
## 15 0.19151855 0.9605248 0.06853747
## 16 0.14451545 0.9763464 0.05152206
## 17 0.09807242 0.9893936 0.03658180
## 18 0.06748535 0.9952778 0.02711116
## 19 0.04198628 0.9981986 0.01882880
## 20 0.02059588 0.9995890 0.01092757
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 20.
plot(chemicalTune)
I choose PLS model. The optimal number of the model is 20 and the corresponding \(R^2\) is 0.9995890.
Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
predictChemical<-predict(chemicalTune,chemical_test)
result<-postResample(predictChemical,chemical_test$Yield)
result
## RMSE Rsquared MAE
## 0.00964045 0.99997899 0.00758506
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
I use varImp()function to find the important variables in the model.
valuePre<-varImp(chemicalTune)
## Warning: package 'pls' was built under R version 3.6.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
valuePre
## pls variable importance
##
## only 20 most important variables shown (out of 48)
##
## Overall
## Yield 100.00
## ManufacturingProcess32 45.66
## ManufacturingProcess36 39.11
## ManufacturingProcess13 36.53
## ManufacturingProcess09 34.60
## ManufacturingProcess17 34.20
## BiologicalMaterial02 31.17
## BiologicalMaterial06 29.81
## BiologicalMaterial08 27.49
## ManufacturingProcess06 25.99
## BiologicalMaterial01 25.76
## ManufacturingProcess12 25.69
## BiologicalMaterial03 25.29
## BiologicalMaterial12 25.29
## BiologicalMaterial04 24.79
## BiologicalMaterial11 24.62
## ManufacturingProcess11 24.36
## ManufacturingProcess04 23.67
## ManufacturingProcess02 17.88
## BiologicalMaterial10 16.82
plot(valuePre)
Yield is the response value, so the most important predictor is ManufacturingProcess32. From the table, we can see that process predictors account for the top 5 most important predictors in the list.
Explore the relationships between each of the top predictors and the response.How could this information be helpful in improving yield in future runs of the manufacturing process?
chemical%>%
select(c('Yield','ManufacturingProcess32','ManufacturingProcess13','ManufacturingProcess36',
'ManufacturingProcess09','ManufacturingProcess17'))%>%
cor()
## Yield ManufacturingProcess32
## Yield 1.0000000 0.60833215
## ManufacturingProcess32 0.6083321 1.00000000
## ManufacturingProcess13 -0.5036797 -0.10120679
## ManufacturingProcess36 -0.5257340 -0.78800943
## ManufacturingProcess09 0.5034705 0.04100301
## ManufacturingProcess17 -0.4258069 0.01604178
## ManufacturingProcess13 ManufacturingProcess36
## Yield -0.5036797 -0.52573401
## ManufacturingProcess32 -0.1012068 -0.78800943
## ManufacturingProcess13 1.0000000 0.09558950
## ManufacturingProcess36 0.0955895 1.00000000
## ManufacturingProcess09 -0.7913537 -0.05005281
## ManufacturingProcess17 0.7824135 -0.01226725
## ManufacturingProcess09 ManufacturingProcess17
## Yield 0.50347051 -0.42580687
## ManufacturingProcess32 0.04100301 0.01604178
## ManufacturingProcess13 -0.79135366 0.78241345
## ManufacturingProcess36 -0.05005281 -0.01226725
## ManufacturingProcess09 1.00000000 -0.71545604
## ManufacturingProcess17 -0.71545604 1.00000000
I create a correlation table for the top 5 most important predictors, and find out some positive and negative correlations among those variables.ManufacturingProcess13 and ManufacturingProcess09 has a very high negative correlation, -0.7913537. Maybe we can use PCA to combine those variables with the high correlations.