Question 6.2


Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:


a.

Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

b.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.How many predictors are left for modeling?

remove<- nearZeroVar(fingerprints)
fingerprints_remove <- fingerprints[,-remove]
1107-length(remove)
## [1] 388

388 Predictors are left for modeling.

c.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\) ?

set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
fingerprints.train <- fingerprints_remove[trainRow, ]
perm.train <- permeability[trainRow, ]
fingerprints.test <- fingerprints_remove[-trainRow, ]
perm.test <- permeability[-trainRow, ]

ctrl<-trainControl(method = "cv",number=10)
set.seed(1)
plsTune <- train(x=fingerprints.train,
                y=perm.train, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=ctrl,
                preProcess=c('center', 'scale')
                )

plsTune
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.66375  0.3461521  9.603300
##    2     11.66557  0.4884962  8.364131
##    3     11.86090  0.4462082  9.081715
##    4     12.15438  0.4254096  9.389547
##    5     12.00549  0.4492658  9.076973
##    6     11.70875  0.4618761  8.826006
##    7     11.42799  0.4775314  8.747634
##    8     11.25481  0.4925856  8.533191
##    9     11.03884  0.5066896  8.344461
##   10     10.88506  0.5187689  8.028416
##   11     10.80216  0.5271005  8.012505
##   12     10.81509  0.5247495  8.034195
##   13     10.74326  0.5245241  8.003891
##   14     10.70988  0.5238880  7.892343
##   15     10.87078  0.5198265  8.032440
##   16     11.25383  0.4999003  8.375209
##   17     11.52949  0.4886010  8.619614
##   18     11.46969  0.4907743  8.541789
##   19     11.46414  0.4906053  8.591222
##   20     11.38025  0.5003197  8.484932
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 11.
plot(plsTune)

The optimal number of the model is 11 and the corresponding \(R^2\) is 0.5271005.

d.

Predict the response for the test set. What is the test set estimate of \(R^2\) ?

pre_pls<-predict(plsTune,fingerprints.test)

plsEval<-postResample(perm.test,pre_pls)
plsEval
##       RMSE   Rsquared        MAE 
## 15.4341377  0.2899935 11.6639869

The test set estimate of \(R^2\) is 0.2899935.

e.

Try building other models discussed in this chapter. Do any have better predictive performance?

# Ridge Regression Model
ridgeGrid<-data.frame(.lambda=seq(0, 1, by=0.1))
set.seed(100)
ridgeTune <- train(x=fingerprints.train,
                y=perm.train, 
                method='ridge',
                tuneGrid=ridgeGrid,
                trControl=ctrl,
                preProcess=c('center', 'scale')
                )
## Warning: model fit failed for Fold02: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
ridgeTune 
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     16.50084  0.4546017  11.573579
##   0.1     10.05743  0.6326680   7.480219
##   0.2     10.12322  0.6341198   7.650604
##   0.3     10.42128  0.6294136   7.981911
##   0.4     10.84725  0.6232755   8.411012
##   0.5     11.36024  0.6173390   8.905333
##   0.6     11.94311  0.6119499   9.425962
##   0.7     12.57245  0.6063708   9.963344
##   0.8     13.24342  0.6017165  10.568040
##   0.9     13.95206  0.5968604  11.188737
##   1.0     14.68431  0.5926265  11.805196
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
plot(ridgeTune)

pre_ridge<-predict(ridgeTune,fingerprints.test)
ridgeEval<-postResample(perm.test,pre_ridge)
ridgeEval
##       RMSE   Rsquared        MAE 
## 15.0822363  0.3047575 11.4460367
# Lasso

lassoGrid<-data.frame(.fraction = seq(0, 1, length=15))

set.seed(100)
lassoTune <- train(x=fingerprints.train,
                y=perm.train, 
                method='lasso',
                metric='Rsquared',
                tuneGrid = lassoGrid,
                trControl=ctrl,
                preProcess=c('center', 'scale')
                )
## Warning: model fit failed for Fold02: fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = fingerprints.train, y = perm.train, method =
## "lasso", : missing values found in aggregated results
lassoTune
## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction    RMSE      Rsquared   MAE      
##   0.00000000  15.21986        NaN  12.147734
##   0.07142857  11.92336  0.5235776   8.952903
##   0.14285714  11.59693  0.5017420   8.344428
##   0.21428571  11.75040  0.4782170   8.216624
##   0.28571429  11.98869  0.4845000   8.396859
##   0.35714286  12.58079  0.4779456   8.808119
##   0.42857143  13.05193  0.4791460   9.029293
##   0.50000000  13.29917  0.4877874   9.157739
##   0.57142857  13.55496  0.4908656   9.370046
##   0.64285714  13.92496  0.4893982   9.711711
##   0.71428571  14.44127  0.4824490  10.092412
##   0.78571429  14.95779  0.4772738  10.527560
##   0.85714286  15.37833  0.4724820  10.874131
##   0.92857143  15.84402  0.4637750  11.184983
##   1.00000000  16.50084  0.4546017  11.573579
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.07142857.
plot(lassoTune)

pre_lasso<-predict(lassoTune,fingerprints.test)
lassoEval<-postResample(perm.test,pre_lasso)
lassoEval
##       RMSE   Rsquared        MAE 
## 11.5653430  0.4652193  8.4497749
# Elastic net model

enetGrid<-expand.grid(.lambda=c(0,0.01,.1),
                      .fraction=seq(.05,1,length=20))
set.seed(100)
enetTune <- train(x=fingerprints.train,
                y=perm.train, 
                method='enet',
                tuneGrid = enetGrid,
                trControl=ctrl,
                preProcess=c('center', 'scale')
                )
## Warning: model fit failed for Fold02: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
enetTune
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.05      12.024841  0.5513574   9.186678
##   0.00    0.10      11.827952  0.5084157   8.735324
##   0.00    0.15      11.589189  0.4999243   8.305935
##   0.00    0.20      11.675716  0.4804038   8.186218
##   0.00    0.25      11.800389  0.4846398   8.193339
##   0.00    0.30      12.117972  0.4823334   8.504219
##   0.00    0.35      12.533039  0.4777591   8.782175
##   0.00    0.40      12.879531  0.4780507   8.951166
##   0.00    0.45      13.153864  0.4814678   9.065555
##   0.00    0.50      13.299166  0.4877874   9.157739
##   0.00    0.55      13.460915  0.4899097   9.286841
##   0.00    0.60      13.702404  0.4905555   9.507235
##   0.00    0.65      13.973514  0.4886576   9.749367
##   0.00    0.70      14.326698  0.4838837  10.008189
##   0.00    0.75      14.715544  0.4791799  10.322219
##   0.00    0.80      15.048529  0.4762092  10.599880
##   0.00    0.85      15.333798  0.4731119  10.839859
##   0.00    0.90      15.653917  0.4675183  11.067612
##   0.00    0.95      16.014838  0.4605197  11.292103
##   0.00    1.00      16.500836  0.4546017  11.573579
##   0.01    0.05      11.609580  0.4968994   8.291314
##   0.01    0.10      11.532781  0.4901951   8.086809
##   0.01    0.15      11.057345  0.5174952   7.831186
##   0.01    0.20      10.504522  0.5550969   7.531925
##   0.01    0.25      10.308664  0.5719308   7.448272
##   0.01    0.30      10.381179  0.5709778   7.484887
##   0.01    0.35      10.500756  0.5656639   7.529464
##   0.01    0.40      10.595755  0.5651465   7.571646
##   0.01    0.45      10.542463  0.5751054   7.546266
##   0.01    0.50      10.485403  0.5852606   7.519345
##   0.01    0.55      10.474674  0.5910366   7.501623
##   0.01    0.60      10.514679  0.5932466   7.536307
##   0.01    0.65      10.568773  0.5932853   7.606838
##   0.01    0.70      10.669847  0.5877317   7.669878
##   0.01    0.75      10.797136  0.5783334   7.738966
##   0.01    0.80      10.926647  0.5669895   7.825939
##   0.01    0.85      11.045931  0.5570111   7.903453
##   0.01    0.90      11.136319  0.5496761   7.975449
##   0.01    0.95      11.194900  0.5441324   8.045975
##   0.01    1.00      11.219003  0.5412863   8.096688
##   0.10    0.05      12.013457  0.5085162   9.005731
##   0.10    0.10      11.503435  0.5011261   7.976228
##   0.10    0.15      11.549143  0.5003183   7.984338
##   0.10    0.20      11.332394  0.5104083   7.951524
##   0.10    0.25      10.944072  0.5364371   7.709298
##   0.10    0.30      10.569466  0.5609265   7.501385
##   0.10    0.35      10.390329  0.5744176   7.370364
##   0.10    0.40      10.259650  0.5860805   7.294416
##   0.10    0.45      10.159169  0.5952657   7.245888
##   0.10    0.50      10.092269  0.6019772   7.236091
##   0.10    0.55      10.057030  0.6074831   7.253177
##   0.10    0.60       9.999928  0.6163368   7.245178
##   0.10    0.65       9.968883  0.6220938   7.278809
##   0.10    0.70       9.950607  0.6258279   7.301450
##   0.10    0.75       9.966800  0.6274508   7.338499
##   0.10    0.80       9.978597  0.6295832   7.377226
##   0.10    0.85      10.003690  0.6305064   7.410184
##   0.10    0.90      10.024686  0.6313960   7.437714
##   0.10    0.95      10.043811  0.6319760   7.465305
##   0.10    1.00      10.057426  0.6326680   7.480219
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.7 and lambda = 0.1.
plot(enetTune)

pre_enet<-predict(enetTune,fingerprints.test)
enetEval<-postResample(perm.test,pre_enet)
enetEval
##       RMSE   Rsquared        MAE 
## 14.1795387  0.3415212 10.7570758

f.

Would you recommend any of your models to replace the permeability laboratory experiment?

eval<-rbind(plsEval,ridgeEval,lassoEval,enetEval)
eval
##               RMSE  Rsquared       MAE
## plsEval   15.43414 0.2899935 11.663987
## ridgeEval 15.08224 0.3047575 11.446037
## lassoEval 11.56534 0.4652193  8.449775
## enetEval  14.17954 0.3415212 10.757076

I would recommend the model of lasso because it has the lowest RMSE, 11.56534 and highest \(R^2\),0.4652193.

Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand,manufacturing process predictors can be changed in the manufacturing process.Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

a.

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors)for the 176 manufacturing runs. yield contains the percent yield for each run.

b.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

I use KNN() to impute the data.

chemical<-data.frame(ChemicalManufacturingProcess)
chemical<-kNN(chemical) %>%
         select(Yield:ManufacturingProcess45)
missmap(chemical)

c.

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

indx<-createDataPartition(chemical$Yield,p=0.8,list=FALSE)
chemical_train<-chemical[indx,]
chemical_test<-chemical[-indx,]


#pls model
ctrl<-trainControl(method = "cv",number=10)
set.seed(100)
chemicalTune <- train(x=chemical_train[,-chemical_train$Yield],
                y=chemical_train$Yield, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=ctrl,
                preProcess=c('center', 'scale')
                )

chemicalTune
## Partial Least Squares 
## 
## 144 samples
##  48 predictor
## 
## Pre-processing: centered (48), scaled (48) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE        Rsquared   MAE       
##    1     1.25622540  0.5348006  0.98058481
##    2     1.18615347  0.6744680  0.82941758
##    3     0.77148156  0.8038311  0.62203254
##    4     0.90005248  0.7708044  0.57929060
##    5     0.81900787  0.8152887  0.46748508
##    6     0.55617593  0.8973282  0.32065388
##    7     0.47166948  0.9131598  0.25032239
##    8     0.48847741  0.9023947  0.21542826
##    9     0.44468505  0.9112637  0.18943482
##   10     0.35339471  0.9283237  0.14801182
##   11     0.30734138  0.9346357  0.12095786
##   12     0.26438977  0.9413105  0.09968859
##   13     0.23621004  0.9467387  0.08703166
##   14     0.20887379  0.9542907  0.07653071
##   15     0.19151855  0.9605248  0.06853747
##   16     0.14451545  0.9763464  0.05152206
##   17     0.09807242  0.9893936  0.03658180
##   18     0.06748535  0.9952778  0.02711116
##   19     0.04198628  0.9981986  0.01882880
##   20     0.02059588  0.9995890  0.01092757
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 20.
plot(chemicalTune)

I choose PLS model. The optimal number of the model is 20 and the corresponding \(R^2\) is 0.9995890.

d.

Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

predictChemical<-predict(chemicalTune,chemical_test)
result<-postResample(predictChemical,chemical_test$Yield)
result
##       RMSE   Rsquared        MAE 
## 0.00964045 0.99997899 0.00758506

e.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I use varImp()function to find the important variables in the model.

valuePre<-varImp(chemicalTune)
## Warning: package 'pls' was built under R version 3.6.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
valuePre
## pls variable importance
## 
##   only 20 most important variables shown (out of 48)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   45.66
## ManufacturingProcess36   39.11
## ManufacturingProcess13   36.53
## ManufacturingProcess09   34.60
## ManufacturingProcess17   34.20
## BiologicalMaterial02     31.17
## BiologicalMaterial06     29.81
## BiologicalMaterial08     27.49
## ManufacturingProcess06   25.99
## BiologicalMaterial01     25.76
## ManufacturingProcess12   25.69
## BiologicalMaterial03     25.29
## BiologicalMaterial12     25.29
## BiologicalMaterial04     24.79
## BiologicalMaterial11     24.62
## ManufacturingProcess11   24.36
## ManufacturingProcess04   23.67
## ManufacturingProcess02   17.88
## BiologicalMaterial10     16.82
plot(valuePre)

Yield is the response value, so the most important predictor is ManufacturingProcess32. From the table, we can see that process predictors account for the top 5 most important predictors in the list.

f.

Explore the relationships between each of the top predictors and the response.How could this information be helpful in improving yield in future runs of the manufacturing process?

chemical%>%
       select(c('Yield','ManufacturingProcess32','ManufacturingProcess13','ManufacturingProcess36',
                'ManufacturingProcess09','ManufacturingProcess17'))%>%
       cor()
##                             Yield ManufacturingProcess32
## Yield                   1.0000000             0.60833215
## ManufacturingProcess32  0.6083321             1.00000000
## ManufacturingProcess13 -0.5036797            -0.10120679
## ManufacturingProcess36 -0.5257340            -0.78800943
## ManufacturingProcess09  0.5034705             0.04100301
## ManufacturingProcess17 -0.4258069             0.01604178
##                        ManufacturingProcess13 ManufacturingProcess36
## Yield                              -0.5036797            -0.52573401
## ManufacturingProcess32             -0.1012068            -0.78800943
## ManufacturingProcess13              1.0000000             0.09558950
## ManufacturingProcess36              0.0955895             1.00000000
## ManufacturingProcess09             -0.7913537            -0.05005281
## ManufacturingProcess17              0.7824135            -0.01226725
##                        ManufacturingProcess09 ManufacturingProcess17
## Yield                              0.50347051            -0.42580687
## ManufacturingProcess32             0.04100301             0.01604178
## ManufacturingProcess13            -0.79135366             0.78241345
## ManufacturingProcess36            -0.05005281            -0.01226725
## ManufacturingProcess09             1.00000000            -0.71545604
## ManufacturingProcess17            -0.71545604             1.00000000

I create a correlation table for the top 5 most important predictors, and find out some positive and negative correlations among those variables.ManufacturingProcess13 and ManufacturingProcess09 has a very high negative correlation, -0.7913537. Maybe we can use PCA to combine those variables with the high correlations.