6.2a

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.4.4
library(knitr)
data(permeability)
dim(permeability)
## [1] 165   1

6.2b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

ANS

388 Predictors rare left

library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2020d.
## 1.0/zoneinfo/America/New_York'
#str(fingerprints)
low_freq = nearZeroVar(fingerprints)
head(low_freq,20)
##  [1]  7  8  9 10 13 14 17 18 19 22 23 24 30 31 32 33 34 45 77 81
dim(fingerprints)
## [1]  165 1107
fingerprints_mod <- fingerprints[, -low_freq]
dim(fingerprints_mod)
## [1] 165 388

6.2c

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

ANS

Optimal latent variable = 6 R2 = 0.4134156

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
split <- createDataPartition(permeability, p = .8, list = FALSE)
x_train <- fingerprints_mod[split, ]
x_test <- fingerprints_mod[-split, ]
y_train <- permeability[split, ]
y_test <- permeability[-split, ]

plsModel <- train(x = x_train, y = y_train,
                method = "pls",
                metric='Rsquared',
                #tuneGrid = expand.grid(ncomp = 1:20),
                tuneLength = 25,
                 preProcess=c('center', 'scale'),
                trControl = trainControl(method = "cv")
          )

plsModel$results
##    ncomp     RMSE  Rsquared       MAE   RMSESD RsquaredSD    MAESD
## 1      1 13.31831 0.3510139 10.056137 3.048949  0.3080094 2.340990
## 2      2 12.11637 0.4681143  8.683817 3.865193  0.3427076 2.733703
## 3      3 12.27746 0.4287951  9.286901 3.381632  0.3290341 2.399297
## 4      4 12.29808 0.4056076  9.572933 3.373612  0.3158981 2.105239
## 5      5 12.45136 0.4101553  9.560444 3.518779  0.3107306 2.166172
## 6      6 12.28314 0.4176184  9.335474 3.345252  0.2868268 2.274462
## 7      7 12.29332 0.4207393  9.396583 3.363152  0.2871234 2.552239
## 8      8 12.48948 0.4278259  9.634706 3.364080  0.2616385 2.575149
## 9      9 12.43078 0.4382165  9.641899 3.647312  0.2802843 2.815822
## 10    10 12.71323 0.4283256  9.873832 3.609001  0.2715464 2.727247
## 11    11 12.94289 0.4203995  9.772981 3.615176  0.2632383 2.623715
## 12    12 13.02096 0.4153005  9.887467 3.576483  0.2647847 2.739921
## 13    13 13.13881 0.4160452 10.007913 3.911930  0.2796092 2.826050
## 14    14 13.12442 0.4173972 10.028739 3.911703  0.2704674 2.750586
## 15    15 13.18062 0.4136883 10.060093 4.040629  0.2699734 2.971478
## 16    16 13.34378 0.4112354 10.226867 4.089526  0.2651834 2.932879
## 17    17 13.43831 0.4130573 10.254919 4.312595  0.2659651 3.029609
## 18    18 13.37039 0.4156534 10.166224 3.967276  0.2531463 2.790825
## 19    19 13.47148 0.4156267 10.272169 3.948411  0.2529685 2.764084
## 20    20 13.72332 0.3979978 10.334129 3.886329  0.2536122 2.792335
## 21    21 13.97621 0.3827927 10.454817 3.821791  0.2480859 2.709723
## 22    22 14.19810 0.3647156 10.609059 3.463789  0.2346457 2.551320
## 23    23 14.39272 0.3555223 10.654922 3.335432  0.2264955 2.477234
## 24    24 14.49462 0.3536899 10.788655 3.170498  0.2177835 2.286006
## 25    25 14.55363 0.3561616 10.890812 3.256004  0.2160580 2.311935
plot(plsModel)

6.2d

Predict the response for the test set. What is the test set estimate of R2?

# Predict using pls
pls_Predict <- predict(plsModel, newdata=x_test)
# Finding R squared for prediction
postResample(pred=pls_Predict, obs=y_test)
##     RMSE Rsquared      MAE 
## 9.600764 0.566260 7.060612

6.2e

Try building other models discussed in this chapter. Do any have better predictive performance?

Ridge Regression

ridgeModel <- train(x=x_train,
                  y=y_train,
                  method='ridge',
                  metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
## Warning: model fit failed for Fold06: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
ridgeModel
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     13.69443  0.4174436  10.223431
##   0.1     12.74328  0.4451309   9.698201
##   0.2     12.76773  0.4673060   9.764792
##   0.3     13.02276  0.4770641  10.002382
##   0.4     13.38561  0.4834519  10.336600
##   0.5     13.81463  0.4872222  10.687869
##   0.6     14.30204  0.4898365  11.079783
##   0.7     14.84287  0.4918793  11.528626
##   0.8     15.41394  0.4929513  11.994964
##   0.9     16.02262  0.4937894  12.504361
##   1.0     16.66100  0.4943765  13.054912
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
plot(ridgeModel)

Elasticnet

EnetModel <- train(x=x_train,
                 y=y_train,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2), 
                                      .lambda = seq(0, 1, by=0.2)),
                 trControl=trainControl(method='cv',number = 10),
                 preProcess=c('center','scale')
                  )
## Warning: model fit failed for Fold01: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = x_train, y = y_train, method = "enet", metric
## = "Rsquared", : missing values found in aggregated results
EnetModel
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 120, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE       
##   0.0     0.0        15.33200        NaN   12.264531
##   0.0     0.2       181.61093  0.5006206   79.232433
##   0.0     0.4       377.34007  0.4562549  157.821727
##   0.0     0.6       570.25913  0.4281670  233.569898
##   0.0     0.8       765.24697  0.3914445  308.991946
##   0.0     1.0       953.08771  0.3486669  385.634422
##   0.2     0.0        15.61342        NaN   12.519571
##   0.2     0.2        11.48961  0.5133987    8.392898
##   0.2     0.4        11.70541  0.5210781    8.912043
##   0.2     0.6        12.03273  0.5165124    9.195574
##   0.2     0.8        12.31190  0.5083282    9.536378
##   0.2     1.0        12.68389  0.4932131    9.842629
##   0.4     0.0        15.61342        NaN   12.519571
##   0.4     0.2        11.65580  0.5066712    8.246826
##   0.4     0.4        12.08351  0.5240873    9.061063
##   0.4     0.6        12.44097  0.5237892    9.483765
##   0.4     0.8        12.88938  0.5162779   10.008867
##   0.4     1.0        13.30904  0.5046888   10.413628
##   0.6     0.0        15.61342        NaN   12.519571
##   0.6     0.2        11.91737  0.4970200    8.250184
##   0.6     0.4        12.66908  0.5219582    9.420102
##   0.6     0.6        13.23198  0.5204417   10.044822
##   0.6     0.8        13.75836  0.5176946   10.614714
##   0.6     1.0        14.24591  0.5086338   11.120193
##   0.8     0.0        15.61342        NaN   12.519571
##   0.8     0.2        12.21003  0.4908036    8.267310
##   0.8     0.4        13.38426  0.5192736    9.900780
##   0.8     0.6        14.22172  0.5165188   10.775442
##   0.8     0.8        14.85172  0.5169858   11.434564
##   0.8     1.0        15.40918  0.5097863   12.050619
##   1.0     0.0        15.61342        NaN   12.519571
##   1.0     0.2        12.54519  0.4862380    8.338754
##   1.0     0.4        14.23547  0.5157809   10.555123
##   1.0     0.6        15.33993  0.5132102   11.698039
##   1.0     0.8        16.09025  0.5150767   12.482319
##   1.0     1.0        16.71524  0.5098920   13.148758
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.4.
plot(EnetModel)

Lasso Regression

lassoModel <- train(x=x_train,
                  y=y_train,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(.01, .5, .02)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
## Warning: model fit failed for Fold02: fraction=0.49 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold03: fraction=0.49 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold04: fraction=0.49 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
lassoModel
## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 121, 121, 120, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.01      14.45928  0.3924549  11.506603
##   0.03      13.16883  0.4321263  10.358607
##   0.05      12.53464  0.4566478   9.676124
##   0.07      12.33564  0.4484233   9.233382
##   0.09      12.24431  0.4361742   8.943501
##   0.11      12.23273  0.4229047   8.780282
##   0.13      12.28223  0.4090724   8.758722
##   0.15      12.39036  0.3980402   8.790822
##   0.17      12.47679  0.3913770   8.885504
##   0.19      12.56915  0.3868593   9.000616
##   0.21      12.59689  0.3877780   9.057625
##   0.23      12.63076  0.3900039   9.083509
##   0.25      12.67865  0.3922171   9.117568
##   0.27      12.73338  0.3940052   9.164527
##   0.29      12.82377  0.3938140   9.254206
##   0.31      12.92441  0.3934767   9.348935
##   0.33      13.02625  0.3934781   9.443282
##   0.35      13.13278  0.3932617   9.556290
##   0.37      13.23222  0.3932985   9.665364
##   0.39      13.31589  0.3929378   9.728507
##   0.41      13.40832  0.3922120   9.785678
##   0.43      13.49990  0.3909769   9.851165
##   0.45      13.59580  0.3895129   9.914293
##   0.47      13.68749  0.3874632   9.967952
##   0.49      13.77272  0.3857790  10.007907
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
plot(lassoModel)

6.2f

Would you recommend any of your models to replace the permeability laboratory experiment?

Ans:

The PLS model seems to perform the best with R2 0f 0.62. If I had to recommend a model it would be PLS, however an R2 of .62 is not very encouraging and would continue to look for additional models that would produce an r2 of at least .70

# The PLS model seems to perform the best with R2 0f 0.62. If I had to recommend a model it would be PLS, however an R2 of .62 is not very encouring and would continue to look for additional models that would produce an r2 of at least .70

6.3a

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturing) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176  58

6.3b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# remove near zero variables
chem_low_freq <- nearZeroVar(ChemicalManufacturingProcess)
chemManProcess_mod <- ChemicalManufacturingProcess[, -chem_low_freq]
dim(chemManProcess_mod)
## [1] 176  57
# Checking for missing values and impute.
#impute using knn
table(is.na(chemManProcess_mod))
## 
## FALSE  TRUE 
##  9926   106
chemManProcess_mod_Impute <- preProcess(chemManProcess_mod[,-c(1)], method=c('knnImpute'))
ChemManProcess_Final <- predict(chemManProcess_mod_Impute, ChemicalManufacturingProcess[,-c(1)])

6.3c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

cmp_split <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
x_train <- ChemManProcess_Final[cmp_split, ]
y_train <- ChemicalManufacturingProcess$Yield[cmp_split]
x_test <- ChemManProcess_Final[-cmp_split, ]
y_test <- ChemicalManufacturingProcess$Yield[-cmp_split]

6.3d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

EnetModel <- train(x=x_train,
                 y=y_train,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2), 
                                      .lambda = seq(0, 1, by=0.2)),
                 trControl=trainControl(method='cv',number = 10),
                 preProcess=c('center','scale')
                  )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = x_train, y = y_train, method = "enet", metric
## = "Rsquared", : missing values found in aggregated results
EnetModel
## Elasticnet 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 131, 130, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       1.807180        NaN  1.4855211
##   0.0     0.2       1.963422  0.4877692  1.1891600
##   0.0     0.4       3.491484  0.4408354  1.6554474
##   0.0     0.6       3.462514  0.4253402  1.6987124
##   0.0     0.8       3.039518  0.4125657  1.6204022
##   0.0     1.0       2.462089  0.4047948  1.4957632
##   0.2     0.0       1.807180        NaN  1.4855211
##   0.2     0.2       1.277240  0.5836335  1.0590494
##   0.2     0.4       1.157380  0.5926030  0.9463185
##   0.2     0.6       1.493290  0.5701047  1.0482312
##   0.2     0.8       1.762857  0.5330229  1.1453829
##   0.2     1.0       1.942509  0.4966184  1.2132904
##   0.4     0.0       1.807180        NaN  1.4855211
##   0.4     0.2       1.293645  0.5811742  1.0738433
##   0.4     0.4       1.172937  0.5857248  0.9485363
##   0.4     0.6       1.458645  0.5648819  1.0539227
##   0.4     0.8       1.729016  0.5383721  1.1513518
##   0.4     1.0       1.849290  0.5009173  1.2135353
##   0.6     0.0       1.807180        NaN  1.4855211
##   0.6     0.2       1.289571  0.5801150  1.0697302
##   0.6     0.4       1.186378  0.5832192  0.9538751
##   0.6     0.6       1.435633  0.5636273  1.0673757
##   0.6     0.8       1.684293  0.5400377  1.1698086
##   0.6     1.0       1.861634  0.5052551  1.2605026
##   0.8     0.0       1.807180        NaN  1.4855211
##   0.8     0.2       1.282603  0.5782920  1.0632733
##   0.8     0.4       1.198524  0.5825707  0.9604851
##   0.8     0.6       1.434967  0.5632098  1.0911774
##   0.8     0.8       1.692416  0.5420618  1.2197898
##   0.8     1.0       1.917090  0.5121017  1.3260410
##   1.0     0.0       1.807180        NaN  1.4855211
##   1.0     0.2       1.274588  0.5765192  1.0552914
##   1.0     0.4       1.213783  0.5810262  0.9682351
##   1.0     0.6       1.460624  0.5630218  1.1305573
##   1.0     0.8       1.728334  0.5483696  1.2795496
##   1.0     1.0       2.002102  0.5219743  1.4040937
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.2.
plot(EnetModel)

# Predict using pls
Enet_Predict <- predict(EnetModel, newdata=x_test)
# Finding R squared for prediction
postResample(pred=Enet_Predict, obs=y_test)
##      RMSE  Rsquared       MAE 
## 1.1208794 0.6463099 0.9020061

6.3e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Ans:

The process predictors ManfaturingProcess32, BiologicalMaterial06,ManfaturingProcess13 are on top of this list.

EnetImp <- varImp(EnetModel, scale = FALSE)
plot(EnetImp, top=20, scales = list(y = list(cex = 0.8)))

6.3f

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Ans:

These top 3 predictors are showing very good relationship to the response. If you know your top predictors, you can continue to develop the process and upgrade the material to produce better yields.

viporder <- order(abs(EnetImp$importance),decreasing=TRUE)
topPred = rownames(EnetImp$importance)[viporder[c(1:3)]]

featurePlot(x_train[, topPred],
            y_train,
            plot = "scatter",
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"),
            layout = c(3,1),
            labels = rep("", 2))