6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

A

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

B

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)
x <- nearZeroVar(data.frame(fingerprints))
fp <- fingerprints[,-x]

paste("Dimensions of Fingerprints:")
## [1] "Dimensions of Fingerprints:"
dim(fingerprints)
## [1]  165 1107
paste("Dimensions of Fingerprints without low frequencies:")
## [1] "Dimensions of Fingerprints without low frequencies:"
dim(fp)
## [1] 165 388

As you can see from the dimensions, the fingerprint dataset with low frequency predictors filtered has 388 predictors remaining.

C

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

#load PLS package
library(pls,warn.conflicts = FALSE)
#set data as a dataframe
fp <- as.data.frame(fp)
#split the data
set.seed(12)
trainingRows <- createDataPartition(permeability, p=.80, list = FALSE)
fp.train <- fp[trainingRows,]
fp.test <- fp[-trainingRows,]
p.train <- permeability[trainingRows,]
p.test <- permeability[-trainingRows,]

Prior to performing PLS, the predictors should be centered and scaled, especially if the predictors are on scales of differing magnitude.

#PLS Model
set.seed(13)

p.pls <- train(fp.train,
               p.train,
              method = 'pls',
              metric = 'Rsquared',
              tuneLength = 10,
              preProcess = c('center','scale'),
              trControl = trainControl(method = 'repeatedcv',repeats = 5))

plot(p.pls)

p.pls$results
##    ncomp     RMSE  Rsquared       MAE   RMSESD RsquaredSD    MAESD
## 1      1 12.72462 0.3651445 10.047501 2.631606  0.2234657 1.940926
## 2      2 11.26119 0.5079742  8.002020 3.187312  0.2568022 1.992533
## 3      3 11.29178 0.5127242  8.389938 2.848592  0.2341558 1.780225
## 4      4 11.75818 0.4879665  8.843052 2.689239  0.2169731 1.917375
## 5      5 11.68253 0.4996245  8.519151 2.929627  0.2188679 1.947097
## 6      6 11.64158 0.4994828  8.358427 2.899145  0.2209665 1.927925
## 7      7 11.52283 0.5071795  8.362761 2.781527  0.2079151 1.978294
## 8      8 11.65642 0.4988509  8.612291 2.619304  0.2015002 1.907011
## 9      9 11.89904 0.4866130  8.909316 2.784417  0.2140803 1.861240
## 10    10 12.16455 0.4752184  9.094156 2.775450  0.2099491 1.888054

Based on our results, ncomp3 has the lowest RMSE(11.49) and the highest rsquared value (0.502) indicating that it is the strongest model.

D

Predict the response for the test set. What is the test set estimate of R2?

p.pred <- predict(p.pls, newdata=fp.test)
postResample(pred = p.pred, obs = p.test)
##       RMSE   Rsquared        MAE 
## 13.8155969  0.3141046  9.8310535

E

Try building other models discussed in this chapter. Do any have better predictive performance? ####Ridget

ctrl <- trainControl(method = "cv", number = 10)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeRegFit <- train(fp.train,p.train,
                     method = "ridge",
                     ## Fir the model over many penalty values
                     tuneGrid = ridgeGrid,
                     trControl = ctrl,
                     ## put the predictors on the same scale
                     preProc = c("center", "scale"))

ridgeRegFit
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000   19.06358  0.3197872  12.238583
##   0.007142857   15.72473  0.3306688  11.653505
##   0.014285714   14.75215  0.3600286  11.071735
##   0.021428571  152.32779  0.3869760  87.526855
##   0.028571429   13.65908  0.4008254  10.368303
##   0.035714286   13.36385  0.4139591  10.161625
##   0.042857143   13.13638  0.4250001   9.995798
##   0.050000000   12.96552  0.4339068   9.865352
##   0.057142857   12.81701  0.4422135   9.750288
##   0.064285714   12.80165  0.4434259   9.724615
##   0.071428571   12.60919  0.4550532   9.573428
##   0.078571429   12.52866  0.4607153   9.503284
##   0.085714286   12.47022  0.4652653   9.454446
##   0.092857143   12.39962  0.4702314   9.398336
##   0.100000000   12.35983  0.4738991   9.363776
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

Lasso

Elasticnet

library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                        .fraction = seq(.05, 1, length = 15))
set.seed(100)
enetTune <- train(fp.train,p.train,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))

enetTune
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction   RMSE      Rsquared   MAE      
##   0.00    0.0500000  10.97653  0.5477488   8.029136
##   0.00    0.1178571  10.76896  0.5495103   7.434702
##   0.00    0.1857143  10.97972  0.5386672   7.708130
##   0.00    0.2535714  11.22744  0.5168881   8.102942
##   0.00    0.3214286  11.53964  0.4929626   8.414226
##   0.00    0.3892857  12.14795  0.4532626   8.952964
##   0.00    0.4571429  12.93934  0.4142961   9.452206
##   0.00    0.5250000  13.39868  0.3947928   9.819215
##   0.00    0.5928571  14.09072  0.3737067  10.167608
##   0.00    0.6607143  14.66681  0.3611112  10.422076
##   0.00    0.7285714  15.32796  0.3529545  10.703084
##   0.00    0.7964286  16.27133  0.3467691  11.068167
##   0.00    0.8642857  17.19271  0.3373228  11.456761
##   0.00    0.9321429  18.12846  0.3275404  11.848944
##   0.00    1.0000000  19.06358  0.3197872  12.238583
##   0.01    0.0500000  10.55477  0.5830915   7.469326
##   0.01    0.1178571  10.82296  0.5546742   7.672218
##   0.01    0.1857143  11.21445  0.5119579   8.390975
##   0.01    0.2535714  11.80771  0.4682625   8.961662
##   0.01    0.3214286  12.30230  0.4393598   9.388717
##   0.01    0.3892857  12.58326  0.4292055   9.634180
##   0.01    0.4571429  12.85809  0.4188562   9.864308
##   0.01    0.5250000  13.06542  0.4109832  10.009855
##   0.01    0.5928571  13.23831  0.4071533  10.134343
##   0.01    0.6607143  13.54514  0.3981232  10.321681
##   0.01    0.7285714  13.92872  0.3855687  10.566047
##   0.01    0.7964286  14.34889  0.3738676  10.833367
##   0.01    0.8642857  14.74801  0.3627486  11.073660
##   0.01    0.9321429  15.05710  0.3536017  11.255867
##   0.01    1.0000000  15.36628  0.3435692  11.425724
##   0.10    0.0500000  11.85352  0.5366824   8.994292
##   0.10    0.1178571  10.51394  0.5814344   7.315997
##   0.10    0.1857143  10.55873  0.5825807   7.279600
##   0.10    0.2535714  10.62470  0.5770193   7.477477
##   0.10    0.3214286  10.75754  0.5637897   7.758130
##   0.10    0.3892857  10.93363  0.5503113   7.981392
##   0.10    0.4571429  11.12765  0.5393982   8.247257
##   0.10    0.5250000  11.34912  0.5248881   8.516625
##   0.10    0.5928571  11.53072  0.5137987   8.701154
##   0.10    0.6607143  11.65658  0.5068645   8.805328
##   0.10    0.7285714  11.80452  0.4987927   8.926092
##   0.10    0.7964286  11.95940  0.4908311   9.052030
##   0.10    0.8642857  12.11341  0.4834499   9.173058
##   0.10    0.9321429  12.24253  0.4782647   9.273371
##   0.10    1.0000000  12.35983  0.4738991   9.363776
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1178571 and lambda = 0.1.

Interesting to find that the lambda of 1 has the strongest Rsquared. According to the booklambda of 0, fits the lasso model. Since the lambda of 0.1 has the best fit, the elastnic net has the best prediction.

6.3

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch: ###A

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

B

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(DataExplorer)
plot_missing(ChemicalManufacturingProcess)

It looks like 3 manufacturing process has over 3% of the values missing.

library("caret")
library("rnn")
PP <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
cmp <- predict(PP, ChemicalManufacturingProcess)
library(DataExplorer)
plot_missing(cmp)

###C Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

y <- ChemicalManufacturingProcess$Yield
cmp <- cmp %>% select(-Yield)

trainingrows <- createDataPartition(y, p=.80, list = FALSE)


#Traing and test rows for complete cases
x.train <- cmp[trainingrows,]
x.test <- cmp[-trainingrows,]
y.train <- y[trainingrows]
y.test <- y[-trainingrows]
#PLS Model
set.seed(13)

p.pls <- train(x.train,
               y.train,
              method = 'pls',
              metric = 'Rsquared',
              tuneLength = 10,
              preProcess = c('center','scale'),
              trControl = trainControl(method = 'repeatedcv',repeats = 5))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
plot(p.pls)

p.pls$results
##    ncomp     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1      1 1.423043 0.4394708 1.134340 0.2969004  0.1632513 0.2061634
## 2      2 1.801038 0.4841826 1.207533 1.3081317  0.2324248 0.4915311
## 3      3 1.334308 0.5648355 1.048120 0.4926768  0.1709773 0.2637704
## 4      4 1.504981 0.5416215 1.109215 0.8501903  0.2022379 0.3579844
## 5      5 1.849288 0.5100506 1.226166 1.4966719  0.2321061 0.5405090
## 6      6 1.906421 0.5147433 1.229845 1.7317057  0.2433972 0.6177219
## 7      7 2.083584 0.5062454 1.284964 2.0213285  0.2475283 0.7295149
## 8      8 2.252770 0.4982705 1.350616 2.4463451  0.2498341 0.8837983
## 9      9 2.593900 0.4831696 1.455706 3.0048306  0.2569507 1.0569759
## 10    10 2.839170 0.4699322 1.525045 3.3996646  0.2598900 1.1768309
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                        .fraction = seq(.05, 1, length = 15))
ctrl <- trainControl(method = "cv", number = 10)

set.seed(100)
enetTune <- train(x.train,y.train,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))

enetTune
## Elasticnet 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction   RMSE      Rsquared   MAE      
##   0.00    0.0500000  1.194023  0.6012936  0.9635360
##   0.00    0.1178571  1.122666  0.6287469  0.8992816
##   0.00    0.1857143  1.424786  0.5362659  1.0374346
##   0.00    0.2535714  2.253200  0.5023912  1.2875214
##   0.00    0.3214286  2.982601  0.4613981  1.5048995
##   0.00    0.3892857  3.296479  0.4547221  1.6110154
##   0.00    0.4571429  3.293536  0.4583287  1.6272604
##   0.00    0.5250000  3.147376  0.5070916  1.5923946
##   0.00    0.5928571  3.306887  0.4688437  1.6594489
##   0.00    0.6607143  3.631948  0.4195279  1.7760754
##   0.00    0.7285714  3.766038  0.4085364  1.8319052
##   0.00    0.7964286  3.710676  0.4060130  1.8314146
##   0.00    0.8642857  3.728410  0.4055051  1.8494402
##   0.00    0.9321429  3.969112  0.3880442  1.9302811
##   0.00    1.0000000  4.311444  0.3761358  2.0335878
##   0.01    0.0500000  1.460847  0.5842655  1.1983025
##   0.01    0.1178571  1.205917  0.6007919  0.9736601
##   0.01    0.1857143  1.160208  0.6150736  0.9447116
##   0.01    0.2535714  1.174283  0.6084475  0.9520994
##   0.01    0.3214286  1.134544  0.6176079  0.9187663
##   0.01    0.3892857  1.363851  0.5497650  1.0047982
##   0.01    0.4571429  1.717500  0.5392663  1.1057608
##   0.01    0.5250000  2.039045  0.5212217  1.2103792
##   0.01    0.5928571  2.336098  0.5209717  1.3031113
##   0.01    0.6607143  2.682101  0.4870149  1.4203374
##   0.01    0.7285714  2.973478  0.4743905  1.5128186
##   0.01    0.7964286  3.159862  0.4688631  1.5733159
##   0.01    0.8642857  3.277601  0.4640925  1.6143278
##   0.01    0.9321429  3.374909  0.4600189  1.6483681
##   0.01    1.0000000  3.439166  0.4564505  1.6736043
##   0.10    0.0500000  1.599563  0.4888145  1.3127413
##   0.10    0.1178571  1.394000  0.5919333  1.1437660
##   0.10    0.1857143  1.251720  0.5991533  1.0183158
##   0.10    0.2535714  1.181089  0.6027306  0.9547308
##   0.10    0.3214286  1.177239  0.6034753  0.9505441
##   0.10    0.3892857  1.171457  0.6103310  0.9519020
##   0.10    0.4571429  1.163312  0.6113846  0.9492304
##   0.10    0.5250000  1.190275  0.5907541  0.9638741
##   0.10    0.5928571  1.296361  0.5561948  1.0035676
##   0.10    0.6607143  1.421959  0.5451032  1.0487662
##   0.10    0.7285714  1.550349  0.5347742  1.0999093
##   0.10    0.7964286  1.721472  0.5251099  1.1567160
##   0.10    0.8642857  1.854713  0.5186700  1.2001285
##   0.10    0.9321429  1.945102  0.5159651  1.2303014
##   0.10    1.0000000  2.006981  0.5141554  1.2535822
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1178571 and lambda = 0.

The Lasso model has the highest R squared value and the loest RMSE.

D

Predict the response for the test set. What is the value of the performancemetric and how does this compare with the resampled performance metric on the training set?

cmp_pred <- predict(enetTune,x.test)
cmp.eval <- data.frame(obs = y.test, pred = cmp_pred)
defaultSummary(cmp.eval)
##      RMSE  Rsquared       MAE 
## 1.1451611 0.6837186 0.8799826

E

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

important <- varImp(enetTune, scale = FALSE)
plot(important,top=10, scales = list(y = list(cex = 0.8)))

Manufacturing Process32 has the strongest impact on the model, while Manufacturing Process 13 is a close second. The top 3 most important predictores are process predictors.

F

The top 3 most important predictors are examined in the relationship with the response. In the manufacturing process, if we can control these predictors, they can make a bigger impact to yield than the rest of variables. Only concern here is that ManufacturingProcess43 has an outlier, which causes the strong relationship with the response. This outlier should be investigated, and it might affect the relationship with yield.

top <- order(abs(important$importance),decreasing=TRUE)
top3 = rownames(important$importance)[top[c(1:3)]]

featurePlot(x.train[, top3],
            y.train,
            plot = "scatter",
            type = c("g", "p", "smooth"))