Question 6.2

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

  2. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

  3. Predict the response for the test set. What is the test set estimate of R2?

  4. Try building other models discussed in this chapter. Do any have better predictive performance?

  5. Would you recommend any of your models to replace the permeability laboratory experiment?

Question 6.2b Answer

According to the R Documentation:

“nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large. checkConditionalX looks at the distribution of the columns of x conditioned on the levels of y and identifies columns of x that are sparse within groups of y.”

high_freq_predictors <- fingerprints[,-nearZeroVar(fingerprints)]
dim(high_freq_predictors)
## [1] 165 388

After using the nearZeroVar function to parse out low frequency predictors, we went from 1,107 predictors to 388,

Question 6.2c Answer

The first thing that we should do is create the testing and training data, which is done below. The dimensions of the training and testing data are also provided.

set.seed(1)
sample <- sample.split(permeability, SplitRatio = 0.8)

train_X  <- subset(high_freq_predictors, sample == TRUE) %>% as.matrix()
test_X   <- subset(high_freq_predictors, sample == FALSE) %>% as.matrix()
train_y <- subset(permeability, sample == TRUE) %>% as.vector()
test_y <- subset(permeability, sample == FALSE) %>% as.vector()

dim(train_X)
## [1] 132 388
dim(test_X)
## [1]  33 388

Now we can fit a PLS model. Page 134 shows us an example where the variables are preprocessed using center and scale. We will be applying the same methodology to this dataset.

ctrl <- trainControl(method = "cv", number = 10)
plsTune <- caret::train(
  train_X, 
  train_y,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl,
  preProc = c("center", "scale")
  )

plot(plsTune)

plsTune
## Partial Least Squares 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 118, 120, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.19418  0.3355888  10.161782
##    2     12.08697  0.4154592   8.381873
##    3     12.02124  0.4257758   8.933843
##    4     12.09577  0.4278739   9.019170
##    5     12.03456  0.4320812   9.117138
##    6     12.02757  0.4359904   8.924423
##    7     11.79439  0.4504540   8.934411
##    8     11.96022  0.4447283   9.249347
##    9     11.99048  0.4590222   9.203734
##   10     12.19598  0.4459319   9.207656
##   11     12.35679  0.4335111   9.265978
##   12     12.55873  0.4201770   9.346663
##   13     12.54423  0.4188488   9.318062
##   14     12.54874  0.4155148   9.356758
##   15     12.80671  0.4068115   9.660144
##   16     13.22448  0.3897108   9.971214
##   17     13.65284  0.3846397  10.294842
##   18     14.37353  0.3591093  10.885709
##   19     14.75681  0.3479585  11.216480
##   20     15.34589  0.3247690  11.519261
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.

Based on the output above, the number of latent variables that is optimal is 7, and the corresponding resampled estimate of \(R^2\) is 0.475.

Question 6.2d Answer

plsPred1 <- predict(plsTune, test_X)
plsPred1 %>% head()
## [1] -3.826290 -3.826290 41.249095 -3.026750  4.053989  9.324345

The output above shows us the first 6 predicted values using the test set predictors.

plsValues1 <- data.frame(obs = test_y, pred = plsPred1)
defaultSummary(plsValues1)
##      RMSE  Rsquared       MAE 
## 9.0697275 0.6245837 7.1262647

The output above shows us that our test set R-squared came out to 0.625.

Question 6.2e Answer

We can try fitting a ridge regression model to the data use the same preprocessing methodology like what we did in Question 6.2c. What we will do is create a search grid for lambda, which finds the value for lambda which minimizes the RMSE in an iterative fashion.

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))

set.seed(100)

ridgeRegFit <- train(
  train_X, 
  train_y, 
  method = "ridge",
  tuneGrid = ridgeGrid,
  trControl = ctrl,
  preProc = c("center", "scale")
  )
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridgeRegFit
## Ridge Regression 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 118, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000   14.54432  0.3489843   10.29389
##   0.007142857  127.85644  0.2500175   74.30955
##   0.014285714  448.90916  0.2090538  288.22806
##   0.021428571   15.89967  0.3024838   11.34149
##   0.028571429   15.41820  0.3180304   11.07465
##   0.035714286   15.04636  0.3383254   10.75635
##   0.042857143   14.73696  0.3513392   10.59260
##   0.050000000   14.54815  0.3643678   10.46277
##   0.057142857   14.47153  0.3679564   10.44045
##   0.064285714   14.20348  0.3827026   10.28230
##   0.071428571   14.07164  0.3917995   10.20378
##   0.078571429   13.96364  0.3984985   10.14318
##   0.085714286   13.86500  0.4053690   10.08946
##   0.092857143   13.79269  0.4105608   10.05292
##   0.100000000   13.72585  0.4158997   10.01949
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

We provide the RMSE based on the best performing model as shown below.

ridgeRegPred1 <- predict(ridgeRegFit, test_X)
ridgeRegPred1 %>% head()
##         4         6         7        18        20        21 
## -1.236078 -1.236078 41.019991 -5.626989  1.897921  9.382436
ridgeRegValues1 <- data.frame(obs = test_y, pred = ridgeRegPred1)
defaultSummary(ridgeRegValues1)
##       RMSE   Rsquared        MAE 
## 10.2383550  0.5607206  8.0225647

Next model that we will be trying out is the elastic net model. Again we set up a search grid which searches for a value of lambda and fraction that minimizes the RMSE.

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                        .fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(train_X, 
                  train_y, 
                  method = "enet",
                  tuneGrid = enetGrid, 
                  trControl = ctrl,
                  preProc = c("center", "scale")
                  )
## Warning: model fit failed for Fold05: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
enetTune
## Elasticnet 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 118, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      12.02047  0.5189474   9.097294
##   0.00    0.10      11.60474  0.5200844   8.059818
##   0.00    0.15      11.93991  0.5014864   8.226068
##   0.00    0.20      12.06173  0.4843267   8.421519
##   0.00    0.25      12.08863  0.4735817   8.545991
##   0.00    0.30      12.17088  0.4671724   8.751755
##   0.00    0.35      12.27840  0.4586238   8.980117
##   0.00    0.40      12.46660  0.4457948   9.209953
##   0.00    0.45      12.64958  0.4335609   9.360218
##   0.00    0.50      12.86976  0.4211769   9.488794
##   0.00    0.55      13.08509  0.4098843   9.637639
##   0.00    0.60      13.28154  0.4015285   9.760444
##   0.00    0.65      13.48040  0.3940941   9.869420
##   0.00    0.70      13.63104  0.3878474   9.913044
##   0.00    0.75      13.66381  0.3884211   9.904921
##   0.00    0.80      13.72624  0.3879019   9.945552
##   0.00    0.85      13.88461  0.3811717  10.022249
##   0.00    0.90      14.00510  0.3759211  10.070932
##   0.00    0.95      14.23060  0.3650705  10.165363
##   0.00    1.00      14.54432  0.3489843  10.293891
##   0.01    0.05      11.81562  0.4918210   8.429058
##   0.01    0.10      11.65594  0.4783427   8.302625
##   0.01    0.15      11.75026  0.4705944   8.651031
##   0.01    0.20      12.16488  0.4516813   9.083736
##   0.01    0.25      12.48859  0.4416718   9.267566
##   0.01    0.30      12.83083  0.4278192   9.399609
##   0.01    0.35      13.35513  0.4023965   9.689018
##   0.01    0.40      13.88906  0.3769515   9.993169
##   0.01    0.45      14.31740  0.3564402  10.245366
##   0.01    0.50      14.64406  0.3382519  10.431969
##   0.01    0.55      14.96190  0.3216162  10.596338
##   0.01    0.60      15.25618  0.3074271  10.793407
##   0.01    0.65      15.54814  0.2950449  11.004723
##   0.01    0.70      15.83378  0.2849387  11.234173
##   0.01    0.75      16.13159  0.2753088  11.476382
##   0.01    0.80      16.41540  0.2688916  11.683412
##   0.01    0.85      16.67132  0.2645065  11.888190
##   0.01    0.90      16.87614  0.2622441  12.055586
##   0.01    0.95      17.01145  0.2616895  12.156351
##   0.01    1.00      17.13338  0.2610388  12.243228
##   0.10    0.05      12.24076  0.4993314   9.329858
##   0.10    0.10      11.65844  0.5049855   8.301089
##   0.10    0.15      11.64699  0.4992936   8.254943
##   0.10    0.20      11.57167  0.4954713   8.281789
##   0.10    0.25      11.58661  0.4926245   8.391096
##   0.10    0.30      11.74659  0.4860902   8.584354
##   0.10    0.35      11.86992  0.4817331   8.677382
##   0.10    0.40      12.00345  0.4792977   8.746524
##   0.10    0.45      12.13955  0.4762629   8.806233
##   0.10    0.50      12.27736  0.4738214   8.878685
##   0.10    0.55      12.42115  0.4695951   8.982025
##   0.10    0.60      12.55246  0.4656409   9.077755
##   0.10    0.65      12.71429  0.4586517   9.182696
##   0.10    0.70      12.88703  0.4504408   9.313726
##   0.10    0.75      13.02956  0.4438379   9.433124
##   0.10    0.80      13.16818  0.4382233   9.553909
##   0.10    0.85      13.32382  0.4319150   9.692010
##   0.10    0.90      13.46948  0.4261461   9.816405
##   0.10    0.95      13.60155  0.4208544   9.924682
##   0.10    1.00      13.72585  0.4158997  10.019491
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.1.

We provide the RMSE based on the best performing model as shown below.

enetTunePred1 <- predict(enetTune, test_X)
enetTunePred1 %>% head()
##          4          6          7         18         20         21 
## -3.4935058 -3.4935058 37.9875690 -0.7016277  2.2593745  9.6945005
enetTuneValues1 <- data.frame(obs = test_y, pred = enetTunePred1)
defaultSummary(enetTuneValues1)
##       RMSE   Rsquared        MAE 
## 10.4621340  0.5034596  8.1426555

The R-squared and RMSE values for the PLS model are 9.0697275 and 0.6245837 respectively. The other two models that we have generated in this question are unable to beat these metrics, which indicates that we do not have better predictive performance across these two models.

Question 6.2f Answer

The PLS model performs the best out of all of the models that were tested in Question 6.2e, therefore, I would probably continue to use the PLS model for the permeability laboratory experiment.

Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

  2. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

  3. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

  4. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

  5. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Question 6.3b Answer

Using the methodology explained in Section 3.8, we impute using the code chunk below.

impute <- ChemicalManufacturingProcess %>%
  preProcess("knnImpute")

imputed <- predict(impute, ChemicalManufacturingProcess)

Question 6.3c Answer

Let’s use that nearzerovar function that we used in the previous problem. Let’s split the data into X and y.

X <- imputed %>% select(-Yield)
y = imputed$Yield
high_freq_predictors <- X[,-nearZeroVar(X)]
dim(high_freq_predictors)
## [1] 176  56

We create the training and test set in the code chunk below.

set.seed(1)
sample <- sample.split(y, SplitRatio = 0.8)

train_X  <- subset(high_freq_predictors, sample == TRUE) %>% as.matrix()
test_X   <- subset(high_freq_predictors, sample == FALSE) %>% as.matrix()
train_y <- subset(y, sample == TRUE) %>% as.vector()
test_y <- subset(y, sample == FALSE) %>% as.vector()

dim(train_X)
## [1] 140  56
dim(test_X)
## [1] 36 56
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- caret::train(
  train_X, 
  train_y,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl,
  preProc = c("center", "scale")
  )

plot(plsTune)

plsTune
## Partial Least Squares 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 125, 125, 125, 126, 127, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8815812  0.4115193  0.6873771
##    2     1.2940157  0.4451745  0.7666037
##    3     0.8434375  0.5089979  0.6138395
##    4     0.7695028  0.5882731  0.5817554
##    5     0.8551909  0.5714961  0.6158262
##    6     0.9170698  0.5524309  0.6363787
##    7     0.9830327  0.5561862  0.6583479
##    8     1.0931773  0.5293409  0.7067873
##    9     1.3519114  0.4881377  0.7895779
##   10     1.5499833  0.4747700  0.8514642
##   11     1.7507091  0.4785277  0.9063816
##   12     1.7846913  0.4849381  0.9095140
##   13     1.7866934  0.4827298  0.9164390
##   14     1.7872606  0.4926111  0.9123233
##   15     1.8224340  0.4956915  0.9212411
##   16     1.8833359  0.4896128  0.9426463
##   17     1.9042100  0.4866516  0.9497834
##   18     1.9484455  0.4873739  0.9646124
##   19     2.0652901  0.4834424  0.9979863
##   20     2.1162865  0.4825031  1.0117810
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.

The output above shows us that the iteration with the lowest RMSE was at 4, it is at this iteration that the RMSE is at its lowest at 0.77. The R-squared and MAE are 0.59 and 0.58 respectively.

Question 6.3d Answer

plsPred1 <- predict(plsTune, test_X)
plsValues1 <- data.frame(obs = test_y, pred = plsPred1)
defaultSummary(plsValues1)
##      RMSE  Rsquared       MAE 
## 0.7171429 0.4864144 0.6090410

The RMSE for the test set is not that far off from the RMSE from cross-validation, indicating that the model is doing a decent job at predicting data that it has never seen before.

Question 6.3e Answer

The caret package includes a function for calculating feature importance, varImp, which is what we use to calculate and plot feature importance below.

ggplot(varImp(plsTune), top = 10)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
## 
##     corrplot
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings

We have the top 10 predictors that are most important in the PLS model that was created. It would seem that the manufacturing processes dominate the list, since the top 6 predictors are manufacturing processes.

Question 6.3f Answer

top_10_predictors_row_names <- varImp(plsTune)$importance %>%
  arrange(desc(Overall)) %>%
  head(10) %>%
  row.names() %>%
  as.vector()

top_10_predictors_df = subset(high_freq_predictors) %>%
  select(top_10_predictors_row_names)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(top_10_predictors_row_names)
## 
##   # Now:
##   data %>% select(all_of(top_10_predictors_row_names))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
top_10_predictors_df$Yield = y

We can explore the relationships between the predictors with a correlation plot.

corrplot::corrplot(cor(dplyr::select_if(top_10_predictors_df, is.numeric), use = "na.or.complete"),
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 0.75,
         tl.cex = 0.75)

Some of the processes above have a negative correlation with yield, while others have a positive yield. As we can see from the output above, Manufacturing Processes 13, 17, and 36 have a negative correlation. This means that yield might increase if these manufacturing processes would decrease, while we would have in increase in all of the other processes, so we would probably have to reallocate resources accordingly. The inverse might also be true too, However, with that being said, there could be another model type that we didn’t explore in this question that gives us a totally different plot. So we should be cognizant of that including the fact that we also have to worry about budgetary constraints, manufacturing hiccups, etc. But based on this plot, we could investigate those three variables that have a negative correlation with yield.