In Kuhn and Johnson do problems 6.2 and 6.3.

library(caret)
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
library(e1071)

Question 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
 data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
filter_predictors <- nearZeroVar(fingerprints)
fingerprints_filter<- fingerprints[,-filter_predictors]

Predictors are left for modeling:

(ncol(fingerprints)-length(filter_predictors))
## [1] 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?
# Combine the data
permeability_2 <- cbind(permeability, fingerprints_filter)
set.seed(100)

train_index <- createDataPartition(permeability_2[, "permeability"], p = 0.80, list = FALSE)

train_permeability <- permeability_2[train_index, ]
test_permeability <- permeability_2[-train_index, ]

ctrl <- trainControl(method = "cv", number = 10)

plsTune <- train(train_permeability[,-1],  # Target variables in the training set
                 train_permeability[, "permeability"],   # Predictors in the training set   
                 method = "pls", 
                 tuneLength = 20, 
                 trControl = ctrl, 
                 preProc = c("center", "scale"))

(plsTune)
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 119, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.89902  0.3366671   9.723079
##    2     11.57073  0.4528188   8.008647
##    3     11.74004  0.4540994   8.638716
##    4     11.93784  0.4435949   8.785540
##    5     11.72550  0.4550101   8.696446
##    6     11.50509  0.4693658   8.546633
##    7     11.39297  0.4871664   8.653850
##    8     11.13762  0.5051883   8.558643
##    9     11.18859  0.5013221   8.642897
##   10     11.29239  0.4985107   8.727017
##   11     11.47258  0.4836285   8.824869
##   12     11.42934  0.4880340   8.975986
##   13     11.74486  0.4702584   9.189660
##   14     12.08391  0.4502299   9.380611
##   15     12.27229  0.4497430   9.560355
##   16     12.53866  0.4429663   9.685105
##   17     12.57991  0.4380633   9.684354
##   18     12.60036  0.4371346   9.779651
##   19     12.65043  0.4323267   9.815696
##   20     12.94446  0.4207868  10.028232
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(plsTune)



The latent variables that are optimal are 8 and the Rsquared corresponding resampled estimate is 0.5051883.

  1. Predict the response for the test set. What is the test set estimate of R^2?
pls_predict <- predict(plsTune, newdata=test_permeability)
postResample(pred = pls_predict, obs = test_permeability[, "permeability"])
##       RMSE   Rsquared        MAE 
## 11.0223004  0.5357105  7.9250542

Rsquared estimate is 0.5357105 for test.

The closeness of these values implies that the performance of the cross-validation reliably predicts performance on the test set.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?

Principal Component Regression (PCR)

set.seed(1001)
pcrTune <- train(train_permeability[,-1],  # Target variables in the training set
                 train_permeability[, "permeability"],   # Predictors in the training set   
                 method = "pcr",
                 tuneLength = 20, 
                 trControl = ctrl, 
                 preProc = c("center", "scale"))

pcr_predict <- predict(pcrTune, newdata=test_permeability)

postResample(pred = pcr_predict, obs = test_permeability[, "permeability"])
##       RMSE   Rsquared        MAE 
## 12.3088307  0.4232345  8.7732255

Elastic Net model

set.seed(1002)
enetGrid <- expand.grid(alpha=seq(0,1,by=0.05),
                        lambda=seq(0,1,by=0.05))

enetTune <- train(train_permeability[,-1],  
                  train_permeability[, "permeability"],  
                  method = 'glmnet', 
                  tuneGrid = enetGrid, 
                  trControl = ctrl, 
                  preProc = c('center','scale'))

enet_predict <- predict(enetTune, newdata=test_permeability)
postResample(pred = enet_predict, obs = test_permeability[, "permeability"])
##       RMSE   Rsquared        MAE 
## 11.7079433  0.4763978  8.2892487
  1. Would you recommend any of your models to replace the permeability laboratory experiment?

The PCR model and the elastic net model did not yield R-squares close to the training set, unlike the PLS model. Therefore, I would not recommend the other models that were tested.

Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Impute missing with KNN

yield_index <- which(names(ChemicalManufacturingProcess) == "Yield")

preProcess <- preProcess(ChemicalManufacturingProcess[, -yield_index], method = c("BoxCox", "center", "scale", "knnImpute"))
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(2424)

train_index2 <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.80, list = FALSE)

train_cmp <- ChemicalManufacturingProcess[train_index2, ]
test_cmp <- ChemicalManufacturingProcess[-train_index2, ]

train_cmp_pp <- predict(preProcess, train_cmp)
test_cmp_pp <- predict(preProcess, test_cmp)
plsTune2 <- train(train_cmp_pp [,-1],  # Target variables in the training set
                 train_cmp_pp$Yield,   # Predictors in the training set   
                 method = "pls", 
                 tuneLength = 20, 
                 trControl = ctrl)
plsTune2
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.415875  0.4263115  1.1548771
##    2     1.282157  0.5527842  1.0258157
##    3     1.239256  0.5815855  1.0078627
##    4     1.225719  0.5879883  0.9994953
##    5     1.226519  0.5907152  1.0027987
##    6     1.214903  0.5961783  0.9961279
##    7     1.202741  0.6017300  0.9735884
##    8     1.203708  0.6011846  0.9842309
##    9     1.224111  0.5948779  1.0036944
##   10     1.243275  0.5857078  1.0167769
##   11     1.263144  0.5774234  1.0310615
##   12     1.289407  0.5715636  1.0548138
##   13     1.292238  0.5726646  1.0484980
##   14     1.305527  0.5682292  1.0484471
##   15     1.306554  0.5687594  1.0450451
##   16     1.302413  0.5708695  1.0481671
##   17     1.294022  0.5733026  1.0457114
##   18     1.291551  0.5741425  1.0436250
##   19     1.275105  0.5823252  1.0326767
##   20     1.276671  0.5819906  1.0313540
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
plot(plsTune2)



The optimal value of the performance metric is 7 and the Rsquared corresponding is 0.5730703.

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
pls_predict2 <- predict(plsTune2, newdata=test_cmp_pp)

postResample(pred = pls_predict2, obs = test_cmp_pp$Yield)
##      RMSE  Rsquared       MAE 
## 1.2218527 0.6355319 0.9978895

Since the R-squared value in the testing set is slightly higher than that in the training set, it suggests that the model may need adjustments to better capture the underlying patterns in the data.

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
predictors_coef <- as.data.frame(coef(plsTune2$finalModel))

colnames(predictors_coef) <- c("Coefficient") 

predictors_coef %>% 
  arrange(Coefficient)
##                         Coefficient
## ManufacturingProcess37 -0.339943911
## ManufacturingProcess13 -0.318147344
## ManufacturingProcess28 -0.281032201
## ManufacturingProcess36 -0.239320234
## ManufacturingProcess17 -0.189456044
## ManufacturingProcess03 -0.188234168
## ManufacturingProcess08 -0.175381163
## BiologicalMaterial10   -0.139358684
## ManufacturingProcess24 -0.134104323
## BiologicalMaterial09   -0.130380144
## ManufacturingProcess38 -0.102103724
## BiologicalMaterial11   -0.085887954
## ManufacturingProcess07 -0.065131266
## ManufacturingProcess22 -0.060692551
## BiologicalMaterial07   -0.059015589
## ManufacturingProcess16 -0.039522449
## ManufacturingProcess44 -0.021524679
## ManufacturingProcess05 -0.017032033
## ManufacturingProcess40 -0.008673674
## BiologicalMaterial04   -0.003384967
## BiologicalMaterial02   -0.001934235
## ManufacturingProcess27  0.011005484
## ManufacturingProcess01  0.011450507
## ManufacturingProcess41  0.021774937
## ManufacturingProcess23  0.022290934
## ManufacturingProcess33  0.024000872
## ManufacturingProcess30  0.024206963
## ManufacturingProcess31  0.025043980
## ManufacturingProcess25  0.025617727
## ManufacturingProcess42  0.031183756
## ManufacturingProcess10  0.040223294
## ManufacturingProcess20  0.046034515
## ManufacturingProcess26  0.050805761
## ManufacturingProcess21  0.064536573
## BiologicalMaterial08    0.074954225
## ManufacturingProcess11  0.081107294
## ManufacturingProcess14  0.083035550
## ManufacturingProcess18  0.086012007
## BiologicalMaterial12    0.105843025
## ManufacturingProcess35  0.122673273
## BiologicalMaterial01    0.122778014
## ManufacturingProcess02  0.127827287
## BiologicalMaterial06    0.135690198
## BiologicalMaterial03    0.145837514
## ManufacturingProcess15  0.154758114
## BiologicalMaterial05    0.170563356
## ManufacturingProcess12  0.174780639
## ManufacturingProcess45  0.177761758
## ManufacturingProcess29  0.185728048
## ManufacturingProcess19  0.194792444
## ManufacturingProcess34  0.219191210
## ManufacturingProcess43  0.220512488
## ManufacturingProcess06  0.265019140
## ManufacturingProcess39  0.271867402
## ManufacturingProcess09  0.434086350
## ManufacturingProcess04  0.514524027
## ManufacturingProcess32  0.698906396

The manufacturing process predictors dominate the list.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
# Apply preprocessing to create processed data
processed_data <- data.frame(predict(preProcess, ChemicalManufacturingProcess))
#extract the top 5
rel_exp <- processed_data[,c("Yield",
               "ManufacturingProcess37",
               "ManufacturingProcess13",
               "ManufacturingProcess28",
               "ManufacturingProcess36",
               "ManufacturingProcess17",
               "ManufacturingProcess03",
               "ManufacturingProcess08",
               "BiologicalMaterial10",
               "ManufacturingProcess24",
               "BiologicalMaterial09")]

cor_matrix <- cor(rel_exp)

corrplot(cor_matrix, 
         method="color",
         type="upper")



Exploring the relationships between top predictors and yield helps identify critical factors influencing the manufacturing process. If these manufacturing processes can be controlled, altering these steps to have higher (or lower) values could improve the overall yield.