Exercises:

6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a) Start R and use these commands to load the data:

data( "permeability")

b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Answer:

  • fingerprints predictors in the original dataset is 1107 for 165 compounds.

  • fingerprints predictors in the dataset after using nearZeroVar function is 338 for 165 compounds.

# Check the initial observations and predictors
dim(fingerprints)
## [1]  165 1107
# Filter out the predictors that have low frequencies
fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]

# Display the predictors left
dim(fingerprints)
## [1] 165 388

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

Answer:

  • The training data took 133 samples .
  • Using 10 fold Cross validation technique,using R squared as the metric, the PLS method provided 8 latent variables to be optimal.Hence the final value used for the model was ncomp = 8.
  • The corresponding resampled estimate of R2 is 0.5051883 .
set.seed(100)

# sample data for training
sample <- permeability |> createDataPartition(p = .8, list = FALSE)

# train  data for both datasets
train_pm <- permeability[sample, ]
train_fp <- fingerprints[sample, ]

# test data for both datasets
test_pm <- permeability[-sample, ]
test_fp <- fingerprints [-sample, ]

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

plsTune <- train(train_fp, train_pm, method = "pls", metric = "Rsquared",
             tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))

plot(plsTune) 

plsTune
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 119, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.89902  0.3366671   9.723079
##    2     11.57073  0.4528188   8.008647
##    3     11.74004  0.4540994   8.638716
##    4     11.93784  0.4435949   8.785540
##    5     11.72550  0.4550101   8.696446
##    6     11.50509  0.4693658   8.546633
##    7     11.39297  0.4871664   8.653850
##    8     11.13762  0.5051883   8.558643
##    9     11.18859  0.5013221   8.642897
##   10     11.29239  0.4985107   8.727017
##   11     11.47258  0.4836285   8.824869
##   12     11.42934  0.4880340   8.975986
##   13     11.74486  0.4702584   9.189660
##   14     12.08391  0.4502299   9.380611
##   15     12.27229  0.4497430   9.560355
##   16     12.53866  0.4429663   9.685105
##   17     12.57991  0.4380633   9.684354
##   18     12.60036  0.4371346   9.779651
##   19     12.65043  0.4323267   9.815696
##   20     12.94446  0.4207868  10.028232
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 8.
getTrainPerf(plsTune)
##   TrainRMSE TrainRsquared TrainMAE method
## 1  11.13762     0.5051883 8.558643    pls

d) Predict the response for the test set. What is the test set estimate of R2?

Answer:

  • Predicted the response for the test set using predict() function, ncomp = 6.

  • The test set estimate of R2 using PLS model is 0.5357105.

fp_predict <- predict(plsTune, test_fp)

postResample(fp_predict, test_pm)
##       RMSE   Rsquared        MAE 
## 11.0223004  0.5357105  7.9250542

e) Try building other models discussed in this chapter. Do any have better predictive performance?

Answer:

  • Elastic Net model computation took a bit more time than PLS model build.

  • The test set estimate of R2 using elasticnet model is 0.460516 with lambda = 0.01 and .fraction = 0.05.

  • Lasso Regression model computation is L1 regularization.

  • The test set estimate of R2 using Lasso model is 0.4774312 with .fraction = 0.05.

  • Based on the R2 comparison, the PLS model offers better predictive performance as it has the highest R2 .

# grid of penalties
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

# tuning penalized regression(elastic net ) model using metric = "Rsquared"
enetTune <- train(train_fp, train_pm, method = "enet",metric = "Rsquared",
                  tuneGrid = enetGrid, trControl = ctrl, preProc = c("center", "scale"))
#validation plot
plot(enetTune)

#summary
enetTune
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.05       15.92454  0.4149977  11.250097
##   0.00    0.10       20.23916  0.4268401  13.146906
##   0.00    0.15       25.27825  0.4014007  16.074842
##   0.00    0.20       30.66808  0.3862955  19.391415
##   0.00    0.25       36.61766  0.3743805  22.867382
##   0.00    0.30       42.29704  0.3624048  26.285371
##   0.00    0.35       48.74081  0.3454254  29.391911
##   0.00    0.40       54.99337  0.3369527  32.547410
##   0.00    0.45       61.01564  0.3311644  35.657877
##   0.00    0.50       66.98059  0.3255013  38.748701
##   0.00    0.55       72.89343  0.3234535  41.651433
##   0.00    0.60       78.80292  0.3183460  44.476682
##   0.00    0.65       84.64774  0.3142939  47.294215
##   0.00    0.70       90.49634  0.3110215  50.100267
##   0.00    0.75       96.50695  0.3064360  52.955658
##   0.00    0.80      102.54072  0.3011334  55.830134
##   0.00    0.85      108.69852  0.2986829  58.762840
##   0.00    0.90      114.88765  0.2963885  61.697929
##   0.00    0.95      121.08212  0.2946134  64.655689
##   0.00    1.00      127.14467  0.2925550  67.711128
##   0.01    0.05       11.13390  0.4875269   7.808608
##   0.01    0.10       11.02644  0.4697844   7.760532
##   0.01    0.15       11.19436  0.4587664   8.171019
##   0.01    0.20       11.51798  0.4473346   8.425328
##   0.01    0.25       11.72125  0.4451314   8.496069
##   0.01    0.30       12.02011  0.4390133   8.693026
##   0.01    0.35       12.32496  0.4276333   8.869635
##   0.01    0.40       12.51186  0.4209352   8.995068
##   0.01    0.45       12.68564  0.4120608   9.145245
##   0.01    0.50       12.87752  0.3987716   9.264984
##   0.01    0.55       13.14639  0.3806198   9.428356
##   0.01    0.60       13.42127  0.3645401   9.623630
##   0.01    0.65       13.65940  0.3520916   9.846036
##   0.01    0.70       13.90300  0.3397872  10.073486
##   0.01    0.75       14.20370  0.3285202  10.302569
##   0.01    0.80       14.48125  0.3193193  10.529721
##   0.01    0.85       14.72357  0.3118963  10.749190
##   0.01    0.90       14.93236  0.3058452  10.926728
##   0.01    0.95       15.11693  0.3012204  11.082565
##   0.01    1.00       15.28554  0.2970509  11.218829
##   0.10    0.05       12.03362  0.4599280   9.057243
##   0.10    0.10       11.11587  0.4835982   7.742628
##   0.10    0.15       11.01906  0.4809139   7.555460
##   0.10    0.20       11.06195  0.4728057   7.794452
##   0.10    0.25       11.13881  0.4669137   8.028810
##   0.10    0.30       11.30545  0.4577672   8.294670
##   0.10    0.35       11.49460  0.4486985   8.503234
##   0.10    0.40       11.62638  0.4436208   8.615526
##   0.10    0.45       11.77095  0.4368198   8.689516
##   0.10    0.50       11.88887  0.4309562   8.728833
##   0.10    0.55       12.00117  0.4257441   8.789869
##   0.10    0.60       12.10288  0.4213316   8.881068
##   0.10    0.65       12.17475  0.4189686   8.938232
##   0.10    0.70       12.23948  0.4165820   8.989967
##   0.10    0.75       12.29727  0.4145380   9.045190
##   0.10    0.80       12.35420  0.4125055   9.090148
##   0.10    0.85       12.40335  0.4108638   9.120956
##   0.10    0.90       12.44991  0.4092878   9.143724
##   0.10    0.95       12.51563  0.4067511   9.178773
##   0.10    1.00       12.60301  0.4029965   9.220182
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.01.
#predict the permeability response for the test data
enet_predict <- predict(enetTune, test_fp)

#compare the variance of the predicted values to the test values.
postResample(enet_predict, test_pm)
##      RMSE  Rsquared       MAE 
## 11.925564  0.460516  8.762171
# Lasso Regression model 

larsTune <- train(train_fp, train_pm, method = "lars", metric = "Rsquared",
                    tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))

plot(larsTune)

larsTune
## Least Angle Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 117, 120, 121, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.05      10.78089  0.4884988   7.621235
##   0.10      11.22069  0.4658293   8.155298
##   0.15      11.85998  0.4389142   8.710538
##   0.20      12.56456  0.4081787   9.161977
##   0.25      13.28352  0.3782736   9.579537
##   0.30      14.03239  0.3496227  10.049710
##   0.35      14.74708  0.3177346  10.571696
##   0.40      15.24085  0.2967624  11.001037
##   0.45      15.90103  0.2775492  11.479329
##   0.50      16.83813  0.2546089  11.973205
##   0.55      17.94281  0.2350213  12.707837
##   0.60      19.08541  0.2186174  13.394901
##   0.65      20.33394  0.2064060  14.106612
##   0.70      21.52771  0.1948684  14.795779
##   0.75      22.59915  0.1870641  15.447077
##   0.80      23.54321  0.1822464  16.042508
##   0.85      24.34344  0.1776840  16.557470
##   0.90      25.43092  0.1757257  17.277698
##   0.95      26.48296  0.1728901  17.913372
##   1.00      27.51909  0.1708402  18.541743
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
lars_predict <- predict(larsTune, test_fp)

postResample(lars_predict, test_pm)
##       RMSE   Rsquared        MAE 
## 11.7132875  0.4774312  8.3278765

f) Would you recommend any of your models to replace the permeability laboratory experiment?

Answer:

As the R2 value is low around 0.50,it only explains 50% of the variability in the response variable. I do not recommend any of the models to replace the permeability laboratory experiment. Perhaps we need to look into other models for better accuracy.

6.3 Chemical manufacturing process for a pharmaceutical product

a) Read ChemicalManufacturingProcess data

Answer: The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)

data(ChemicalManufacturingProcess)

cmp <- data.frame(ChemicalManufacturingProcess)

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# summary of missing values
paste0(" Missing values: ",sum(is.na(cmp)))
## [1] " Missing values: 106"
#data imputation
cmp_impute <- preProcess(cmp, method = "knnImpute")
cmp_impute  <- predict(cmp_impute,cmp)


# summary post data imputation
paste0(" Missing values post data kNN Impute:",sum(is.na(cmp_impute)))
## [1] " Missing values post data kNN Impute:0"

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

  • For evaluation i used Rsquared as the performance metric.
  • Filtered predictors with low frequencies.
  • Checked for skewness revealed BiologicalMaterial01 -05 and Yield have skewness.Hence we need to apply Box-cox transformation.
  • The PLS model evaluated the optimal components ncomp = 1 . As evident in the validation plot, beyond the 1 PLS component1 the Rsquared value is increasing.Resampling results across tuning parameters shows R2 = 0.4780728.
# pre-process to filter predictor with low frequencies
cmp_transform <- cmp_impute[, -nearZeroVar(cmp_impute)]

# Check for skewness
library(e1071)
skewVals <- apply(cmp_transform, 2, skewness)
head(skewVals)
##                Yield BiologicalMaterial01 BiologicalMaterial02 
##           0.31095956           0.27331650           0.24412685 
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05 
##           0.02851075           1.73231530           0.30400532
#data transformation 
cmp_impute <- preProcess(cmp, method = c("BoxCox","knnImpute"))
cmp_impute  <- predict(cmp_impute,cmp)

set.seed(100)

# sample data for training
index <- createDataPartition(cmp_transform$Yield, p = .8, list = FALSE)

# train and test data
train_cmp <- cmp_transform[index, ]
test_cmp <- cmp_transform[-index, ]

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

plsTune <- train(Yield ~ ., train_cmp , method = "pls", metric = "RMSE",
             tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))

#Validation plot
plot(plsTune) 

#optimal PLS components
plsTune$bestTune
##   ncomp
## 1     1
#performance
getTrainPerf(plsTune)
##   TrainRMSE TrainRsquared TrainMAE method
## 1  0.755647     0.4850805 0.618129    pls

d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Answer: R2 of the test set is 0.1807132 which is lower than the train set R2 which was 0.4780728. The diagnostic plots shows us visually to assess how close the predicted values are to the actual values.

cmp_predict <- predict(plsTune, test_cmp[ ,-1])


postResample(cmp_predict, test_cmp[ ,1])
##      RMSE  Rsquared       MAE 
## 1.2109611 0.1856010 0.7573744
xyplot( predict(plsTune) ~ train_cmp$Yield , type = c("p","g"))

xyplot( predict(plsTune) ~ resid(plsTune) , type = c("p","g"))

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Answer:

Of the top 20 predictors, there is a mix of both Manufacturing Process and Biological Material. However, the most of the predictors in this are Manufacturing Processes predictors.

varImp(plsTune)
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   84.21
## ManufacturingProcess13   82.93
## ManufacturingProcess09   80.75
## BiologicalMaterial02     80.34
## BiologicalMaterial06     79.06
## BiologicalMaterial03     75.65
## BiologicalMaterial04     71.71
## ManufacturingProcess17   70.37
## ManufacturingProcess33   70.19
## BiologicalMaterial08     62.74
## ManufacturingProcess06   62.29
## ManufacturingProcess12   59.52
## BiologicalMaterial01     59.15
## BiologicalMaterial11     58.35
## BiologicalMaterial12     57.64
## ManufacturingProcess11   55.73
## ManufacturingProcess28   42.73
## ManufacturingProcess04   41.49
## ManufacturingProcess15   41.48

f)Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Answer:

According to the correlation plot, ManufacturingProcess32 has the highest positive correlation with Yield, followed by ManufacturingProcess09 and BiologicalMaterial02. ManufacturingProcess13,ManufacturingProcess36 and ManufacturingProcess17 which were in the Top5 are negatively correlated with Yield. This information can be helpful improving yield in future runs of the manufacturing process, as these were the predictors that affected the yield. If they want to maximize or improve their yield, they way want to increase the measurements of these manufacturing process and these biological measurements of the raw materials which have positive correlation and decrease the measurements which have negative correlation.

top_predictors <- varImp(plsTune)$importance |>
  arrange(-Overall) |>
  head(10)


cmp_transform |>
  select(c("Yield", row.names(top_predictors))) |>
  cor() |>
  corrplot()