Questions

6.2:

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.4

library(caret)

## Loading required package: lattice

library(elasticnet)

## Loading required package: lars

## Loaded lars 1.2

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.

No_low_prints <- fingerprints[,-nearZeroVar(fingerprints)]

How many predictors are left for modeling?

We can see that 388 predictors of the original 1107 remain.

(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(4)
pre_proc_opt <- c( "scale","nzv",  "corr", "medianImpute")
train_row <- sort(sample(nrow(No_low_prints), nrow(No_low_prints)*.7))
train_x_set <- No_low_prints[train_row,]
test_x_set <- No_low_prints[-train_row,]
train_y_set <- permeability[train_row,]
test_y_set <- permeability[-train_row,]

#figure out what this does
perm_model <- train(train_x_set, train_y_set, method = 'pls', tuneLength = 20, preProcess = pre_proc_opt)
perm_model

## Partial Least Squares 
## 
## 115 samples
## 388 predictors
## 
## Pre-processing: scaled (386), median imputation (386), remove (2) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 115, 115, 115, 115, 115, 115, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.50442  0.3717815   9.923906
##    2     11.39381  0.4931459   8.121219
##    3     11.48760  0.4877560   8.273475
##    4     11.63757  0.4831159   8.501135
##    5     11.49887  0.4960501   8.356625
##    6     11.36282  0.5101725   8.186009
##    7     11.30655  0.5142884   8.112126
##    8     11.25266  0.5203064   8.185179
##    9     11.32247  0.5194914   8.311607
##   10     11.51455  0.5147289   8.514509
##   11     11.77010  0.5054601   8.697358
##   12     12.00567  0.4983576   8.884370
##   13     12.26310  0.4885414   9.063848
##   14     12.52222  0.4734699   9.269464
##   15     12.73720  0.4641109   9.433073
##   16     12.94948  0.4527701   9.592024
##   17     13.15640  0.4408205   9.742899
##   18     13.37134  0.4310308   9.876247
##   19     13.59463  0.4225643  10.063085
##   20     13.81806  0.4146572  10.240061
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.

The optimal number of components selected is 8 and the corresponding \(R^{2}\) is 0.5203064.

(d)

Predict the response for the test set. What is the test set estimate of R2?

perm_pred <- predict(perm_model, test_x_set)
postResample(perm_pred, test_y_set)

##       RMSE   Rsquared        MAE 
## 14.7183235  0.3402496 10.2871388

The \(R^{2}\) for the test set is a less impressive 0.34025.

(e)

Try building other models discussed in this chapter. Do any have better predictive performance?

ridge_lambdas <- seq(0.1,1,by=0.1)
perm_ridge_model <- train(train_x_set,
                          train_y_set,
                          method='ridge',
                          metric='Rsquared',
                          tuneGrid=data.frame(.lambda = ridge_lambdas),
                          trControl=trainControl(method='cv'),
                          preProcess=pre_proc_opt
                  )
perm_ridge_model

## Ridge Regression 
## 
## 115 samples
## 388 predictors
## 
## Pre-processing: scaled (386), median imputation (386), remove (2) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 103, 103, 105, 104, 103, 103, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.1     10.63897  0.5558039   7.724500
##   0.2     10.42253  0.5835223   7.495573
##   0.3     10.55908  0.5941645   7.594759
##   0.4     10.87297  0.5988681   7.846585
##   0.5     11.30943  0.6005228   8.155334
##   0.6     11.83929  0.6005872   8.506909
##   0.7     12.44350  0.5997490   8.882475
##   0.8     13.10166  0.5985093   9.317398
##   0.9     13.80993  0.5970388   9.850548
##   1.0     14.54942  0.5954249  10.421656
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.6.

The Ridge model selects for a \(\lambda = 1\) results in an \(R^{2}\) of 0.61544.

plot(perm_ridge_model)

perm_ridge_pred <- predict(perm_ridge_model, test_x_set)
postResample(perm_ridge_pred, test_y_set)

##      RMSE  Rsquared       MAE 
## 16.506577  0.349986 11.679768

perm_lasso_model <- train(train_x_set,
                          train_y_set,
                          method='lasso',
                          metric='Rsquared',
                          tuneGrid = data.frame(.fraction = seq(0,0.9,by=0.1)),
                          trControl=trainControl(method='cv'),
                          preProcess=pre_proc_opt
                  )

## Warning: model fit failed for Fold09: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

## Warning in train.default(train_x_set, train_y_set, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results

perm_lasso_model

## The lasso 
## 
## 115 samples
## 388 predictors
## 
## Pre-processing: scaled (386), median imputation (386), remove (2) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 103, 103, 103, 104, 104, 103, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.0       14.11733        NaN  11.058984
##   0.1        9.46410  0.6258182   6.757854
##   0.2       10.00701  0.5494658   7.053366
##   0.3       10.37778  0.5400501   7.346476
##   0.4       11.09833  0.5081900   7.842371
##   0.5       11.58552  0.4923993   8.142333
##   0.6       12.06804  0.4766097   8.462424
##   0.7       12.57129  0.4612538   8.720223
##   0.8       13.02599  0.4497222   8.970678
##   0.9       13.48448  0.4375562   9.273523
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.1.

With a selected fraction of 0.1, the \(R^{2}=0.6944\) for the lasso model.

plot(perm_lasso_model)

perm_lasso_pred <- predict(perm_lasso_model, test_x_set)
postResample(perm_lasso_pred, test_y_set)

##       RMSE   Rsquared        MAE 
## 14.9623101  0.3048541  9.5855534

(f)

Would you recommend any of your models to replace the permeability laboratory experiment?

Personally, I wouldn’t replace a lab experiment with any of these models, but I’m not very good at modeling yet. There is certainly promise here and additional info, a larger training set and further refining of models could move us in the direction of phasing out the lab experiments. Keeping both going at first to cross-reference and further train models would provide a lot of benefit.

6.3:

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(Chem_Imp <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('knnImpute')))

## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - centered (57)
##   - ignored (0)
##   - 5 nearest neighbor imputation (57)
##   - scaled (57)

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

library(RANN)

## Warning: package 'RANN' was built under R version 4.0.4

chem_mod <- predict(Chem_Imp, ChemicalManufacturingProcess[,-c(1)])
train_row <- sort(sample(nrow(chem_mod), nrow(chem_mod)*.7))
train_x_set <- chem_mod[train_row,]
test_x_set <- chem_mod[-train_row,]
train_y_set <- ChemicalManufacturingProcess[train_row,1]
test_y_set <- ChemicalManufacturingProcess[-train_row,1]

ridge_lambdas <- seq(0.5,2,by=0.25)
pre_proc_opt <- c( "scale","nzv",  "corr", "knnImpute")
chem_ridge_model <- train(train_x_set,
                          train_y_set,
                          method='ridge',
                          metric='Rsquared',
                          tuneGrid=data.frame(.lambda = ridge_lambdas),
                          trControl=trainControl(method='cv'),
                          preProcess=pre_proc_opt
                  )
chem_ridge_model

## Ridge Regression 
## 
## 123 samples
##  57 predictor
## 
## Pre-processing: scaled (56), nearest neighbor imputation (56), centered
##  (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 110, 111, 110, 111, 111, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.50    1.798339  0.5335353  1.241686
##   0.75    1.900270  0.5362526  1.324535
##   1.00    2.029999  0.5366977  1.417125
##   1.25    2.174020  0.5359990  1.529031
##   1.50    2.324800  0.5346803  1.643254
##   1.75    2.477871  0.5330204  1.762665
##   2.00    2.630529  0.5311797  1.890504
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.

The Ridge model selects for a \(\lambda = 1.5\) results in an \(R^{2}\) of 0.6086476.

plot(chem_ridge_model)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

chem_ridge_pred <- predict(chem_ridge_model, test_x_set)
postResample(chem_ridge_pred, test_y_set)

##      RMSE  Rsquared       MAE 
## 3.0310168 0.3229974 1.5449255

The \(R^{2}\) for our test dataset is 0.4004 - not quite as good as it was for the training set.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.0.4

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-1

chem_ridge_f <- glmnet(as.matrix(train_x_set),
                          train_y_set,
                          alpha = 0,
                          lambda = 1.5
                       )
coef(chem_ridge_f)

## 58 x 1 sparse Matrix of class "dgCMatrix"
##                                  s0
## (Intercept)            40.107995666
## BiologicalMaterial01    0.029187755
## BiologicalMaterial02    0.033798588
## BiologicalMaterial03    0.117685300
## BiologicalMaterial04    0.017493304
## BiologicalMaterial05    0.015997030
## BiologicalMaterial06    0.096403273
## BiologicalMaterial07   -0.071689997
## BiologicalMaterial08    0.015133273
## BiologicalMaterial09   -0.015668782
## BiologicalMaterial10   -0.081060316
## BiologicalMaterial11    0.003641695
## BiologicalMaterial12    0.041882677
## ManufacturingProcess01  0.085428035
## ManufacturingProcess02 -0.012211049
## ManufacturingProcess03 -0.018971418
## ManufacturingProcess04  0.023194947
## ManufacturingProcess05 -0.044292568
## ManufacturingProcess06  0.097343350
## ManufacturingProcess07 -0.043774264
## ManufacturingProcess08  0.025105894
## ManufacturingProcess09  0.203256752
## ManufacturingProcess10  0.019439912
## ManufacturingProcess11  0.094513980
## ManufacturingProcess12  0.083477605
## ManufacturingProcess13 -0.170885031
## ManufacturingProcess14  0.029244151
## ManufacturingProcess15  0.097957567
## ManufacturingProcess16 -0.194184070
## ManufacturingProcess17 -0.187610875
## ManufacturingProcess18  0.013452195
## ManufacturingProcess19  0.054021083
## ManufacturingProcess20  0.009954535
## ManufacturingProcess21 -0.053631633
## ManufacturingProcess22 -0.057140500
## ManufacturingProcess23 -0.045777925
## ManufacturingProcess24 -0.123061169
## ManufacturingProcess25 -0.126065906
## ManufacturingProcess26  1.035198860
## ManufacturingProcess27 -0.235103214
## ManufacturingProcess28 -0.071247239
## ManufacturingProcess29  0.287504591
## ManufacturingProcess30  0.097145485
## ManufacturingProcess31 -0.151665474
## ManufacturingProcess32  0.278094802
## ManufacturingProcess33  0.081020955
## ManufacturingProcess34  0.181942832
## ManufacturingProcess35  0.005860643
## ManufacturingProcess36 -0.161999484
## ManufacturingProcess37 -0.148905306
## ManufacturingProcess38 -0.002042031
## ManufacturingProcess39  0.072789842
## ManufacturingProcess40  0.002401224
## ManufacturingProcess41  0.011550535
## ManufacturingProcess42  0.012230242
## ManufacturingProcess43  0.049275154
## ManufacturingProcess44  0.072171153
## ManufacturingProcess45  0.058562136

The strongest 3 predictors in our model appear to be ManufacturingProcess20 with a strong negative correlation coefficient of -0.325, ManufacturingProcess32 with a positive coef of 0.2857 and ManufacturingProcess06 with a coef of 0.2669. The next few are: ManufacturingProcess09, ManufacturingProcess13 and ManufacturingProcess36.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

From the information above, it appears that the manufacturing processes are stronger predictors than the Biological Materials involved, suggesting that focusing on including certain processes (32, 6 and 9) while eliminating the use of others (20, 13, 36) may be the best way to improve the overall yield.

DATA624 HW7

M Kollontai

3/28/2021

Questions

6.2:

(a)

(b)

(c)

(d)

(e)

(f)

6.3:

(a)

(b)

(c)