Resources: https://topepo.github.io/caret/model-training-and-tuning.html https://daviddalpiaz.github.io/r4sl/the-caret-package.html https://towardsdatascience.com/create-predictive-models-in-r-with-caret-12baf9941236

6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

a.

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

b.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Notes: details on what nearZeroVar does: diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large. The result is a vector of integers indicating which columns should be removed.
https://www.rdocumentation.org/packages/caret/versions/6.0-92/topics/nearZeroVar

Doing this reduced the columns from 1107 columns to 388 columns.

nZ_f<-nearZeroVar(fingerprints)
fp2 <- fingerprints[,-nZ_f]

c. 

Split the data into a training and a test set, pre–process the data, and tune a partial least squares model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

PCA needed 27 components to capture 95 percent of the variance. The \(R^2\) of the optimal model was 0.5560486.

#Split the data
set.seed(888)
train_rw <- createDataPartition(permeability, 
                                p = .80,
                                list = FALSE)
train_fp <- fp2[train_rw,]
train_perm <- permeability[train_rw]
test_fp <- fp2[-train_rw,]
test_perm <- permeability[-train_rw]
train_fp2 <- preProcess(train_fp,
                        method = c("BoxCox", "center", "scale", "pca"))

train_fp2
## Created from 133 samples and 388 variables
## 
## Pre-processing:
##   - centered (388)
##   - ignored (0)
##   - principal component signal extraction (388)
##   - scaled (388)
## 
## PCA needed 27 components to capture 95 percent of the variance
#pre–process the data & tune a partial least squares model
ctrl <- trainControl(method = "repeatedcv", number=10, repeats=10) # k-fold cross validation

pls_fp <- train(x = train_fp,
                 y = train_perm,
                 method = "pls",
                 metric = "Rsquared",
                 tuneLength = 20,
                 trControl = ctrl,
                 preProcess = c("center", "scale"))

pls_fp
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 120, 120, 118, 121, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.21780  0.3852210  9.384064
##    2     10.88838  0.5152888  7.643714
##    3     11.21725  0.4971614  8.434656
##    4     11.47374  0.4870101  8.918321
##    5     11.27540  0.5015980  8.541304
##    6     11.11734  0.5128541  8.403436
##    7     10.97480  0.5231071  8.280814
##    8     11.01305  0.5245794  8.440170
##    9     11.04921  0.5213340  8.386952
##   10     11.03752  0.5222550  8.363731
##   11     10.95903  0.5311582  8.262164
##   12     10.99336  0.5364373  8.254835
##   13     11.16201  0.5234876  8.314936
##   14     11.40712  0.5072845  8.525777
##   15     11.56262  0.5007992  8.673189
##   16     11.68678  0.4972684  8.777597
##   17     11.81182  0.4927122  8.856997
##   18     11.96554  0.4860936  8.968375
##   19     12.01798  0.4857320  9.049894
##   20     12.13898  0.4805915  9.165832
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 12.
plot(pls_fp$results$Rsquared)

d. 

Predict the response for the test set. What is the test set estimate of \(R^2\)?

The test set estimate of \(R^2\) was 0.3618712, less than what we saw with the training data (0.5560486).

pred_fp <-predict(pls_fp, test_fp)

postResample(pred=pred_fp, obs = test_perm)
##       RMSE   Rsquared        MAE 
## 15.4606148  0.3618712 11.2793360

e.

Try building other models discussed in this chapter. Do any have better predictive performance?

ridge<- train(x = train_fp,
              y = train_perm,
              method = "ridge",
              trControl = ctrl,
              preProcess = c("center", "scale"))

ridge
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 121, 120, 121, 120, 121, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE          Rsquared   MAE         
##   0e+00       19.49514  0.3824814  1.365134e+01
##   1e-04   245217.28192  0.2233628  1.296777e+05
##   1e-01       11.26457  0.5146406  8.292739e+00
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
pred_fp_ridge <-predict(ridge, test_fp)

postResample(pred=pred_fp_ridge, obs = test_perm)
##       RMSE   Rsquared        MAE 
## 15.6980392  0.3611984 11.1377879
#tuning ridge
#this takes a while to run, setting {r, cache = TRUE} so it does not have to run when knit.  
lambdaGrid <- expand.grid(lambda = 5^seq(5, -2, length=25))

ridge_tune<- train(x = train_fp,
                   y = train_perm, 
                   method = "ridge",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
# Test all the lambda values in the lambdaGrid dataframe    
                   tuneGrid = lambdaGrid,   
                   na.action = na.omit)  

ridge_tune
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 121, 121, 119, 119, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE       Rsquared   MAE       
##   4.000000e-02   11.45786  0.5119509    8.527641
##   6.396262e-02   11.28150  0.5244551    8.336605
##   1.022804e-01   11.89871  0.5291534    8.684450
##   1.635531e-01   11.26451  0.5411190    8.239649
##   2.615321e-01   11.52041  0.5459928    8.475253
##   4.182070e-01   12.14980  0.5481449    9.084921
##   6.687403e-01   13.48898  0.5475048   10.273594
##   1.069360e+00   16.14932  0.5443186   12.461898
##   1.709976e+00   21.03681  0.5396353   16.403463
##   2.734364e+00   29.29566  0.5342033   22.897422
##   4.372426e+00   42.29032  0.5286751   32.676444
##   6.991796e+00   61.60812  0.5229477   46.933155
##   1.118034e+01   88.90323  0.5163281   66.923115
##   1.787810e+01  125.51540  0.5077004   93.652067
##   2.858825e+01  171.96666  0.4959090  128.159153
##   4.571448e+01  227.47306  0.4804742  170.083574
##   7.310044e+01  289.51314  0.4623039  217.215787
##   1.168924e+02  353.66793  0.4435859  267.632232
##   1.869186e+02  414.38703  0.4266781  317.020107
##   2.988951e+02  466.80390  0.4130115  360.756194
##   4.779528e+02  508.36400  0.4028597  396.142227
##   7.642778e+02  539.06858  0.3957556  422.713268
##   1.222130e+03  560.56458  0.3909804  441.477590
##   1.954266e+03  575.04612  0.3878539  454.193715
##   3.125000e+03  584.54938  0.3858412  462.568857
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1635531.
pred_fp_ridge2 <-predict(ridge_tune, test_fp)

postResample(pred=pred_fp_ridge2, obs = test_perm)
##       RMSE   Rsquared        MAE 
## 15.6985045  0.3787032 11.0347931
lasso_cv <- train(x = train_fp,
             y = train_perm,
             method = "lasso",
             trControl = ctrl,
             preProcess = c("center", "scale"))
lasso_cv
## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 121, 117, 119, 120, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE          Rsquared   MAE         
##   0.1       3.136064e+15  0.5371098  8.381485e+14
##   0.5       1.545908e+16  0.4960255  4.131614e+15
##   0.9       2.778211e+16  0.4585198  7.425080e+15
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.
pred_fp_lasso <-predict(lasso_cv, test_fp)

postResample(pred=pred_fp_lasso, obs = test_perm)
##       RMSE   Rsquared        MAE 
## 14.2848666  0.3408836 10.8034913
enet_cv <- train(x = train_fp,
             y = train_perm,
             method = "glmnet",
             trControl = ctrl
             )
enet_cv
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 119, 119, 121, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda     RMSE      Rsquared   MAE      
##   0.10   0.6354172  10.72563  0.5209538   7.899483
##   0.10   2.0093656  10.47341  0.5361194   7.652397
##   0.10   6.3541719  10.53302  0.5436366   7.532187
##   0.55   0.6354172  10.49749  0.5358297   7.695583
##   0.55   2.0093656  10.61913  0.5377893   7.515236
##   0.55   6.3541719  11.42736  0.5270486   8.782490
##   1.00   0.6354172  10.72573  0.5213515   7.701661
##   1.00   2.0093656  10.83441  0.5309045   7.892880
##   1.00   6.3541719  12.77227  0.4988502  10.159245
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 2.009366.
pred_fp_enet <-predict(enet_cv, test_fp)

postResample(pred=pred_fp_enet, obs = test_perm)
##       RMSE   Rsquared        MAE 
## 13.4729493  0.4242363 10.1623342

f. 

Would you recommend any of your models to replace the permeability laboratory experiment?

No. Do not think the models performed well enough for that.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors) and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.

a.

Start R and use these commands to load the data:

#library(AppliedPredictiveModeling) # done in 6.2
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material, and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g. see Sect. 3.8).

cmp_trans <- preProcess(ChemicalManufacturingProcess,
                        method = "knnImpute")
cmp<-predict(cmp_trans,ChemicalManufacturingProcess )

c. 

Split the data into a training and a test set, pre–process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

PCA needed 26 components to capture 95 percent of the variance. The \(R^2\) of the optimal model was 0.6435931.

#split
set.seed(888)
train_rw2 <- createDataPartition(ChemicalManufacturingProcess$Yield,
                                 p = 0.8,
                                 list = FALSE)

train_cmp <- cmp[train_rw2,]
train_yld <- cmp$Yield[train_rw2]

test_cmp <- cmp[-train_rw2,]
test_yld <- cmp$Yield[-train_rw2]
#preprocess & tune
pproc <- preProcess(train_cmp,
                method=c("BoxCox",
                         "center",
                         "scale",
                         "knnImpute",
                         "pca")
                    )
pproc
## Created from 144 samples and 58 variables
## 
## Pre-processing:
##   - centered (58)
##   - ignored (0)
##   - 5 nearest neighbor imputation (58)
##   - principal component signal extraction (58)
##   - scaled (58)
## 
## PCA needed 26 components to capture 95 percent of the variance
train_cmp2 <- predict(pproc,train_cmp)
test_cmp2 <- predict(pproc,test_cmp)

nZ_f<-nearZeroVar(train_cmp2)
train_cmp2 <- train_cmp2[,-nZ_f]
test_cmp2 <- test_cmp2[,-nZ_f]
#tune model
set.seed(888)

pls_cmp <- train(x = train_cmp, y = train_yld,
                 method = "pls",
                 metric = "Rsquared",
                 trControl = ctrl,
                 preProcess = c("center", "scale")
                )

pls_cmp
## Partial Least Squares 
## 
## 144 samples
##  58 predictor
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 132, 130, 129, 128, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##   1      0.7311866  0.5611505  0.5623948
##   2      0.8789732  0.6307552  0.5420330
##   3      0.5085050  0.7789439  0.3895012
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.

d. 

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The \(R^2\) was 0.8108199, which is even higher than the training set (0.6435931).

pred_cmp <-predict(pls_cmp, test_cmp)

postResample(pred=pred_cmp, obs = test_yld)
##      RMSE  Rsquared       MAE 
## 0.3966586 0.8108199 0.3169431

e.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

See list below, showing that the process predictors dominate.

varImp (pls_cmp)
## pls variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   50.76
## ManufacturingProcess13   49.96
## ManufacturingProcess09   45.68
## ManufacturingProcess17   44.72
## ManufacturingProcess36   42.71
## BiologicalMaterial02     38.05
## BiologicalMaterial06     36.68
## ManufacturingProcess12   33.04
## BiologicalMaterial03     32.93
## ManufacturingProcess33   32.88
## BiologicalMaterial08     32.82
## ManufacturingProcess06   32.32
## BiologicalMaterial12     32.08
## BiologicalMaterial11     30.29
## ManufacturingProcess11   30.07
## BiologicalMaterial01     29.21
## BiologicalMaterial04     28.74
## ManufacturingProcess28   24.89
## ManufacturingProcess04   23.55

f. 

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Well, since the \(R^2\) is .81, not sure what this info could do here. But in general, this can help focus on using the right predictors.

top <- cmp |>
  select (Yield, 
          ManufacturingProcess32,
          ManufacturingProcess13,
          ManufacturingProcess17,
          ManufacturingProcess36,
          ManufacturingProcess09,
          ManufacturingProcess12,
          ManufacturingProcess33,
          BiologicalMaterial02,
          BiologicalMaterial06
          ) |>
  rename(MP32 = ManufacturingProcess32,
         MP13 = ManufacturingProcess13,
         MP17 = ManufacturingProcess17,
         MP36 = ManufacturingProcess36,
         MP09 = ManufacturingProcess09,
         MP12 = ManufacturingProcess12,
         MP33 = ManufacturingProcess33,
         BM02 = BiologicalMaterial02,
         BM06 = BiologicalMaterial06)

top |>
  cor() |>
  corrplot.mixed(tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey", tl.col = 'black', number.cex = 0.5)