Kuhn and Johnson
6-2) Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
library(ggplot2)
library(caret)
data(permeability)
The permeability data set consists of permeability values for 165 compounds.
fingerprints_pred <- fingerprints[, -nearZeroVar(fingerprints)]
ncol(fingerprints_pred)
## [1] 388
There are 388 predictors left for modeling.
train <- fingerprints_pred[1:130,]
test <- fingerprints_pred[131:165,]
perm_train <- permeability[1:130]
perm_test <- permeability[131:165]
Pre-processing the Data: Remove highly correlated predictors
## corrplot 0.84 loaded
## [1] 337
## [1] 51
After removing predictors with correlations above 0.70, there are 51 predictors remaining. (337 predictors were removed.)
#library(pls)
set.seed(100)
#permeability_factor <- factor(perm_train)
ctrl <- trainControl(method="cv", number=10)
plsTune <- train(filtered_train,perm_train,
method="pls",
tuneLength=20,
trControl=ctrl,
preProc=c("center","scale"))
plsTune$bestTune
## ncomp
## 6 6
plsTune$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 11.72611 0.4540572 9.012387 2.953374 0.2431859 1.915306
## 2 2 11.47829 0.4880072 8.952498 2.257391 0.2540143 1.300561
## 3 3 11.33301 0.5068382 8.664895 2.471456 0.2585736 1.568554
## 4 4 11.16332 0.5192035 8.564489 2.493549 0.2556703 1.709089
## 5 5 11.10975 0.5264137 8.636208 2.610961 0.2598653 1.923886
## 6 6 11.09803 0.5336531 8.439225 2.395265 0.2467953 1.848771
## 7 7 11.48336 0.5169864 8.800891 2.002553 0.2270718 1.561122
## 8 8 11.83481 0.5071111 9.198143 1.787063 0.2209411 1.294123
## 9 9 12.05611 0.4986067 9.326616 1.711448 0.2212461 1.310682
## 10 10 12.34111 0.4853267 9.526789 1.805035 0.2185356 1.461593
## 11 11 12.41641 0.4820316 9.591620 1.903479 0.2185578 1.549909
## 12 12 12.70801 0.4733312 9.773184 2.160841 0.2199437 1.751309
## 13 13 12.88747 0.4715066 9.878199 2.203631 0.2188165 1.753641
## 14 14 13.02409 0.4653322 9.938856 2.127580 0.2164901 1.743219
## 15 15 13.21747 0.4601848 10.008844 2.287943 0.2151912 1.873761
## 16 16 13.31960 0.4553963 10.113993 2.333647 0.2132939 1.947461
## 17 17 13.31612 0.4566779 10.138810 2.366780 0.2107587 1.928633
## 18 18 13.35162 0.4550333 10.167580 2.422803 0.2093736 2.008751
## 19 19 13.30279 0.4577714 10.119411 2.437498 0.2090291 2.004324
## 20 20 13.34659 0.4576896 10.126281 2.733897 0.2165527 2.270845
summary(plsTune)
## Data: X dimension: 130 51
## Y dimension: 130 1
## Fit method: oscorespls
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 8.467 13.84 19.45 24.04 29.38 35.31
## .outcome 50.646 64.33 70.37 73.79 74.88 75.87
There are 6 optimal latent variables. That model has an R\(^2\) = 0.526 and RMSE = 11.1.
prediction_model <- predict(plsTune, newdata=test)
cor(as.numeric(prediction_model), perm_test) ^ 2
## [1] 0.4365137
R\(^2\) = 0.44 for the test set
Ordinary Linear Regression
lm_permeability <- lm(perm_train ~ ., data=data.frame(filtered_train))
predict_olr <- predict(lm_permeability,data.frame(test))
summary(lm_permeability)
##
## Call:
## lm(formula = perm_train ~ ., data = data.frame(filtered_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1307 -3.7464 -0.2306 3.5310 23.2987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9388 7.4204 -0.666 0.50765
## X6 5.7616 2.1757 2.648 0.00979 **
## X11 -3.4212 5.1522 -0.664 0.50863
## X12 -11.1294 7.6506 -1.455 0.14976
## X15 7.2624 6.0661 1.197 0.23485
## X54 -4.9533 2.9117 -1.701 0.09290 .
## X93 -4.0623 2.7311 -1.487 0.14094
## X97 22.3864 8.2017 2.729 0.00784 **
## X98 28.6653 11.5877 2.474 0.01554 *
## X99 -22.3819 8.0957 -2.765 0.00711 **
## X108 -0.8113 5.6805 -0.143 0.88681
## X118 -25.4249 7.5576 -3.364 0.00119 **
## X126 23.6048 5.4457 4.335 4.31e-05 ***
## X141 5.5183 2.0392 2.706 0.00836 **
## X146 -4.3179 13.1439 -0.329 0.74340
## X225 -9.2219 6.4445 -1.431 0.15643
## X230 5.7430 7.0057 0.820 0.41485
## X231 9.3839 5.5819 1.681 0.09674 .
## X237 7.9687 7.9792 0.999 0.32104
## X257 -7.8163 10.1908 -0.767 0.44540
## X258 -34.7656 11.6183 -2.992 0.00371 **
## X278 -4.9290 2.9780 -1.655 0.10191
## X298 -3.0791 7.6935 -0.400 0.69009
## X302 -1.2517 15.2819 -0.082 0.93493
## X306 -0.6484 8.5406 -0.076 0.93967
## X310 17.4500 14.4082 1.211 0.22951
## X315 -4.6924 4.3447 -1.080 0.28346
## X334 4.5317 5.8908 0.769 0.44405
## X337 5.1697 7.5757 0.682 0.49700
## X339 -1.6280 4.6666 -0.349 0.72814
## X345 -5.3291 4.2543 -1.253 0.21409
## X361 -6.1258 5.7336 -1.068 0.28863
## X370 7.0781 5.3352 1.327 0.18848
## X374 6.1117 4.8016 1.273 0.20685
## X403 7.9048 7.4641 1.059 0.29285
## X503 15.8566 6.5797 2.410 0.01831 *
## X507 3.8895 10.6912 0.364 0.71699
## X510 -16.8709 9.7097 -1.738 0.08624 .
## X511 12.3958 10.3243 1.201 0.23352
## X522 -25.0874 17.5697 -1.428 0.15732
## X551 -7.1729 7.9554 -0.902 0.37002
## X561 6.2734 8.6196 0.728 0.46892
## X568 -12.4175 11.1000 -1.119 0.26670
## X571 -3.3140 12.3033 -0.269 0.78836
## X592 8.3698 8.2154 1.019 0.31145
## X597 4.8275 10.2173 0.472 0.63790
## X679 5.7171 7.9480 0.719 0.47410
## X698 3.9105 5.3757 0.727 0.46913
## X704 7.0596 5.8378 1.209 0.23021
## X732 1.7618 5.3396 0.330 0.74232
## X806 -19.7573 9.1619 -2.156 0.03413 *
## X812 8.6987 11.1200 0.782 0.43643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.895 on 78 degrees of freedom
## Multiple R-squared: 0.8028, Adjusted R-squared: 0.6739
## F-statistic: 6.226 on 51 and 78 DF, p-value: 4.316e-13
cor((predict_olr), perm_test) ^ 2
## [1] 0.3324975
R\(^2\) = 0.67 for the training set R\(^2\) = 0.33 for the test set
Ridge Regression
#library(elasticnet)
ridgeGrid <- data.frame(.lambda=seq(0,.1,length=15))
ridgeRegFit <- train(filtered_train,perm_train,
method="ridge",
tuneGrid=ridgeGrid,
trControl=ctrl,
preProc = c("center","scale"))
ridgeRegFit
## Ridge Regression
##
## 130 samples
## 51 predictor
##
## Pre-processing: centered (51), scaled (51)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 117, 116, 117, 118, 118, 117, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 13.90384 0.4328858 10.401992
## 0.007142857 12.82931 0.4536563 9.593021
## 0.014285714 12.59287 0.4594242 9.422923
## 0.021428571 12.46063 0.4629300 9.333541
## 0.028571429 12.36447 0.4657437 9.262933
## 0.035714286 12.28781 0.4681881 9.208732
## 0.042857143 12.22425 0.4703615 9.163062
## 0.050000000 12.17046 0.4723098 9.126238
## 0.057142857 12.12437 0.4740626 9.096748
## 0.064285714 12.08457 0.4756434 9.071188
## 0.071428571 12.05001 0.4770719 9.047777
## 0.078571429 12.01990 0.4783652 9.027480
## 0.085714286 11.99360 0.4795382 9.009945
## 0.092857143 11.97061 0.4806037 8.999248
## 0.100000000 11.95052 0.4815733 8.990590
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
#summary(ridgeRegFit)
ridgeRegFit$bestTune
## lambda
## 15 0.1
ridgeRegFit$results
## lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.000000000 13.90384 0.4328858 10.401992 4.646651 0.2947083 2.969082
## 2 0.007142857 12.82931 0.4536563 9.593021 3.609178 0.2526136 2.424169
## 3 0.014285714 12.59287 0.4594242 9.422923 3.400985 0.2426287 2.281848
## 4 0.021428571 12.46063 0.4629300 9.333541 3.304443 0.2392667 2.225072
## 5 0.028571429 12.36447 0.4657437 9.262933 3.247645 0.2377684 2.192965
## 6 0.035714286 12.28781 0.4681881 9.208732 3.211615 0.2370077 2.169663
## 7 0.042857143 12.22425 0.4703615 9.163062 3.188404 0.2366152 2.153294
## 8 0.050000000 12.17046 0.4723098 9.126238 3.173790 0.2364374 2.152184
## 9 0.057142857 12.12437 0.4740626 9.096748 3.165219 0.2363986 2.161078
## 10 0.064285714 12.08457 0.4756434 9.071188 3.161030 0.2364563 2.169906
## 11 0.071428571 12.05001 0.4770719 9.047777 3.160086 0.2365843 2.180324
## 12 0.078571429 12.01990 0.4783652 9.027480 3.161580 0.2367653 2.191679
## 13 0.085714286 11.99360 0.4795382 9.009945 3.164929 0.2369873 2.203812
## 14 0.092857143 11.97061 0.4806037 8.999248 3.169698 0.2372413 2.214958
## 15 0.100000000 11.95052 0.4815733 8.990590 3.175559 0.2375208 2.225974
The optimal model has lambda=0.1. That is associated with an R\(^2\)=0.48 and RMSE=11.95.
ridge_prediction_model <- predict(ridgeRegFit, newdata=test)
cor(as.numeric(ridge_prediction_model), perm_test) ^ 2
## [1] 0.4275137
R\(^2\) = 0.43 for the training set
The PLS model has the highest R\(^2\) and the best predictive performance.
6.3) A chemical manufacturing process for a pharmaceutical product was discussed in section 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of the product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing predictors can be changed in the manufacturing process. Impoving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
#head(ChemicalManufacturingProcess)
The matrix processPredictors contains 57 predictors (12 describing the input biologcal material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in therese missing values.
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
#chem_pred <- completedData[, -nearZeroVar(completedData)]
#ncol(chem_pred)
#head(chem_pred)
train_chem <- completedData[1:130,]
test_chem <- completedData[131:176,]
Pre-processing the Data: Remove highly correlated predictors
## [1] 19
## [1] 39
There are 19 correlated variables that I am removing.
set.seed(100)
#permeability_factor <- factor(perm_train)
plsTune_chem <- train(filtered_train_chem[,2:39],filtered_train_chem[,1],
method="pls",
tuneLength=20,
trControl=ctrl,
preProc=c("center","scale"))
plsTune_chem$bestTune
## ncomp
## 5 5
plsTune_chem$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 1.978451 0.7650070 1.5478702 0.7045911 0.1336206 0.3925377
## 2 2 2.199356 0.7405841 1.4647704 1.2499217 0.2119798 0.5607437
## 3 3 1.885282 0.7937955 1.2880121 1.1407381 0.1675791 0.5543885
## 4 4 1.891212 0.8119017 1.2679752 1.6491992 0.1867405 0.7252319
## 5 5 1.757979 0.8373102 1.1777604 1.6158007 0.1787803 0.6680151
## 6 6 1.998671 0.8159033 1.2059371 1.9342066 0.2004384 0.7469778
## 7 7 2.349187 0.7759785 1.3192842 2.4000964 0.2338215 0.8182041
## 8 8 2.275282 0.7762730 1.2730384 2.3427283 0.2380256 0.7986450
## 9 9 1.999307 0.8242078 1.1508536 2.4031460 0.2180645 0.8069380
## 10 10 1.874186 0.8460245 1.0910656 2.3984525 0.2179124 0.7861001
## 11 11 1.929406 0.8300246 1.0709868 2.3611373 0.2289541 0.7718576
## 12 12 1.901694 0.8274486 1.0348524 2.2983107 0.2382114 0.7699849
## 13 13 1.938567 0.8168646 1.0260835 2.2663003 0.2464884 0.7667808
## 14 14 1.955562 0.8133945 1.0249662 2.3048674 0.2549575 0.7844938
## 15 15 1.888448 0.8216938 0.9956461 2.2321657 0.2506471 0.7722854
## 16 16 1.864541 0.8223586 0.9902119 2.0983564 0.2436425 0.7366781
## 17 17 1.888672 0.8173438 0.9993337 2.0729646 0.2471155 0.7355206
## 18 18 1.867688 0.8185216 0.9948316 2.0352808 0.2463349 0.7270629
## 19 19 1.838288 0.8187710 0.9770530 2.0126545 0.2518560 0.7151173
## 20 20 1.815811 0.8193667 0.9677411 1.9549900 0.2507441 0.6964364
summary(plsTune_chem)
## Data: X dimension: 130 38
## Y dimension: 130 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 15.48 20.49 25.28 30.64 34.47
## .outcome 80.42 88.63 91.45 92.70 93.59
The model is built using 5 components. The best model is picked based on having the lowest RMSE, which is 1.76. The R\(^2\) for the model is 0.84. 84% of the variation in the data is explained by this model.
prediction_model_chem <- predict(plsTune_chem, newdata=test_chem[,2:58])
cor(as.numeric(prediction_model_chem), test_chem[,1]) ^ 2
## [1] 0.1182948
The R\(^2\) is .12. This means that 12% of the variation in the data is explained by this model.
Imp <- varImp(plsTune_chem)
plot(Imp, top = 60)
The 4 most important predictors are biological predictors. However the vast majority of the important predictors are manufacturing process predictors.
Yield is negatively correlated with Manufacturing Process 36 and positively correlated with Biological Material 06. The biological predictors cannot be changed, but the Manufacturing Processes can be changed. I would recommend minimizing Manufacturing Process 36 to enhance the yield.