Exercise 6.2

Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
#install.packages('AppliedPredictiveModeling')
library(AppliedPredictiveModeling)
data(permeability)
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVat function from the caret package. How many predictors are left for modeling?

After utilizing the nearZeroVar function, the fingerprint matrix now has 388 from 1107 predictors.

dim(fingerprints)
## [1]  165 1107
near_zero = nearZeroVar(fingerprints)
high_freq = fingerprints[, -near_zero]
dim(high_freq)
## [1] 165 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R square?

RMSE was used to select the optimal model using the smallest value. The final value used for the model was ncomp = 9. Additionally, R square has its largest value at ncomp = 9.

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 119, 121, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.71638  0.3726546   9.803750
##    2     11.43482  0.4812433   8.287554
##    3     11.56869  0.4775294   8.635061
##    4     11.82234  0.4622131   9.000811
##    5     11.96746  0.4722405   9.004607
##    6     11.80342  0.4763650   8.712800
##    7     11.75711  0.4782825   8.794521
##    8     11.57750  0.4943179   8.741823
##    9     11.42685  0.5092455   8.573922
##   10     11.65230  0.5024878   8.601296
##   11     12.19187  0.4755751   8.866223
##   12     12.24185  0.4766699   8.927797
##   13     12.56914  0.4611196   9.122023
##   14     12.87329  0.4497174   9.246870
##   15     13.26387  0.4322739   9.658887
##   16     13.21115  0.4341617   9.699480
##   17     13.45490  0.4270706   9.834210
##   18     13.65659  0.4172002  10.008847
##   19     13.68960  0.4115089  10.210977
##   20     13.50473  0.4179000  10.122616
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.

  1. Predict the response for the test set. What is the test set estimate of R square?

R square is higher in the prediction (0.80) when compared to the training set (0.51).

##      RMSE  Rsquared       MAE 
## 6.9841682 0.7979941 5.1395247
  1. Try building other models discussed in this chapter. Do any have better predictive performance?

Pls provided the best results when comparing RMSE and R square to the other models.

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Linear Regression 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 120, 120, 119, 119, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   31.07298  0.1988006  20.87865
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Warning: model fit failed for Fold08: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 119, 121, 119, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE      
##   0.000000000  14.73314  0.4598381  11.137947
##   0.007142857  14.57374  0.3832517  11.033327
##   0.014285714  13.59025  0.4288226  10.348159
##   0.021428571  13.18649  0.4493503  10.074866
##   0.028571429  12.94008  0.4634247   9.896530
##   0.035714286  12.69788  0.4770145   9.725875
##   0.042857143  12.56784  0.4854845   9.629233
##   0.050000000  12.45986  0.4920169   9.549563
##   0.057142857  12.39794  0.4976161   9.488381
##   0.064285714  12.31467  0.5031968   9.409630
##   0.071428571  12.23896  0.5082542   9.345222
##   0.078571429  12.17124  0.5133017   9.281102
##   0.085714286  12.11938  0.5170257   9.235251
##   0.092857143  12.09821  0.5195938   9.210768
##   0.100000000  12.04437  0.5232864   9.159174
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold02: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold06: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 121, 120, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      12.05500  0.4972413   8.469496
##   0.00    0.10      11.99768  0.4681608   8.400393
##   0.00    0.15      12.21350  0.4524943   8.778776
##   0.00    0.20      12.39718  0.4533848   9.096279
##   0.00    0.25      12.50963  0.4569811   9.241537
##   0.00    0.30      12.59068  0.4630397   9.345331
##   0.00    0.35      12.73778  0.4641009   9.531024
##   0.00    0.40      12.89001  0.4629444   9.720604
##   0.00    0.45      13.00617  0.4627482   9.923570
##   0.00    0.50      13.31246  0.4490701  10.186072
##   0.00    0.55      13.64578  0.4383642  10.483213
##   0.00    0.60      13.95987  0.4326433  10.766972
##   0.00    0.65      14.33329  0.4273412  11.073196
##   0.00    0.70      14.70119  0.4224003  11.342834
##   0.00    0.75      15.02981  0.4181737  11.556056
##   0.00    0.80      15.34786  0.4135673  11.769453
##   0.00    0.85      15.61921  0.4092783  11.982167
##   0.00    0.90      15.98723  0.4046158  12.239297
##   0.00    0.95      16.21398  0.4012868  12.397839
##   0.00    1.00      16.34694  0.3994760  12.490594
##   0.01    0.05      12.00311  0.4432811   8.541659
##   0.01    0.10      11.73877  0.4577646   8.442520
##   0.01    0.15      12.01307  0.4527905   8.702248
##   0.01    0.20      12.20135  0.4478607   8.840800
##   0.01    0.25      12.19961  0.4498527   8.864502
##   0.01    0.30      12.25606  0.4497001   8.928818
##   0.01    0.35      12.29561  0.4519955   8.961331
##   0.01    0.40      12.37765  0.4535000   9.025607
##   0.01    0.45      12.46253  0.4529305   9.094534
##   0.01    0.50      12.52663  0.4538468   9.202575
##   0.01    0.55      12.61490  0.4533859   9.305514
##   0.01    0.60      12.71660  0.4519459   9.409789
##   0.01    0.65      12.79307  0.4498799   9.448750
##   0.01    0.70      12.84677  0.4484607   9.534262
##   0.01    0.75      12.93775  0.4457568   9.619744
##   0.01    0.80      13.08526  0.4408351   9.727770
##   0.01    0.85      13.28313  0.4328238   9.902945
##   0.01    0.90      13.53842  0.4219862  10.122445
##   0.01    0.95      13.80846  0.4096135  10.349692
##   0.01    1.00      14.05687  0.3982597  10.555394
##   0.10    0.05      12.13615  0.4897295   9.166732
##   0.10    0.10      11.87082  0.4503997   8.353546
##   0.10    0.15      11.77572  0.4521499   8.335728
##   0.10    0.20      11.70048  0.4624965   8.332907
##   0.10    0.25      11.71634  0.4668777   8.387695
##   0.10    0.30      11.77893  0.4676413   8.521179
##   0.10    0.35      11.82911  0.4679777   8.609616
##   0.10    0.40      11.84028  0.4696330   8.659048
##   0.10    0.45      11.88772  0.4693870   8.741836
##   0.10    0.50      11.98615  0.4666372   8.855073
##   0.10    0.55      12.06613  0.4646154   8.914644
##   0.10    0.60      12.10204  0.4647323   8.932619
##   0.10    0.65      12.11097  0.4665819   8.930454
##   0.10    0.70      12.13252  0.4675700   8.930057
##   0.10    0.75      12.17634  0.4665709   8.932311
##   0.10    0.80      12.23772  0.4646346   8.987128
##   0.10    0.85      12.31147  0.4625688   9.060522
##   0.10    0.90      12.39566  0.4598667   9.148148
##   0.10    0.95      12.50449  0.4557333   9.256571
##   0.10    1.00      12.61255  0.4518290   9.359723
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.1.
  1. Would you recommend any of your models to replace the permeability laboratory experiment?

If I had to recommend a model, I’d recommend the PLS model due to a low RMSE and a R square of 0.8.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the reponse of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:

  2. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

  3. Split the data into a training and a stest set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I will use PLS model since it performed best for the previous exercise. RMSE was used to select the optimal model using the smallest value. The final value used for the model was ncomp = 10. R squared at ncomp=10 is at its highest value (.61).

## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 130, 130, 130, 129, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8312122  0.4582919  0.6368723
##    2     0.8784130  0.5065290  0.6044327
##    3     0.7449064  0.5860351  0.5387970
##    4     0.7191004  0.5974733  0.5526837
##    5     0.7404487  0.5932325  0.5659655
##    6     0.7452959  0.5893215  0.5681318
##    7     0.7684710  0.5780940  0.5925164
##    8     0.7513039  0.5867866  0.5912885
##    9     0.7132363  0.6101240  0.5655940
##   10     0.7044069  0.6113084  0.5601250
##   11     0.7084133  0.6090025  0.5526973
##   12     0.7206937  0.6059508  0.5488486
##   13     0.7350137  0.6045316  0.5580296
##   14     0.7261172  0.6036587  0.5626825
##   15     0.7348801  0.5972221  0.5747648
##   16     0.7197914  0.6006316  0.5705944
##   17     0.7221599  0.5943178  0.5710099
##   18     0.7256506  0.5867697  0.5691657
##   19     0.7400848  0.5720982  0.5808337
##   20     0.7650570  0.5470784  0.5935269
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The predictive R square was greater and the RMSE was less than that on the training set.

##      RMSE  Rsquared       MAE 
## 0.4702021 0.7831245 0.3666030
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

ManufacturingProcess32 is the most important predictor in the model. Process predictors seem to have more out of the 20 important variables shown but they seem to have fairly similar amount of importance.

## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   88.06
## ManufacturingProcess13   86.60
## ManufacturingProcess17   79.38
## ManufacturingProcess36   77.46
## BiologicalMaterial02     59.84
## ManufacturingProcess06   57.66
## BiologicalMaterial08     57.62
## ManufacturingProcess33   57.34
## BiologicalMaterial06     56.95
## BiologicalMaterial11     55.19
## ManufacturingProcess12   54.80
## BiologicalMaterial12     54.05
## ManufacturingProcess11   53.01
## BiologicalMaterial03     50.57
## ManufacturingProcess30   49.55
## ManufacturingProcess31   47.83
## ManufacturingProcess29   47.60
## ManufacturingProcess28   47.23
## BiologicalMaterial04     46.75
  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The most important predictors have a strong positive or negative correlation to the Yield response variable. Having information whether variables correlate can provide suggest variable removal to improve the model.