library(AppliedPredictiveModeling)
library(tidyverse)
library(MASS)
library(caret)
library(pls)
library(Amelia)

6.2

Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
nzv <- nearZeroVar(fingerprints)
not_nzv <- fingerprints[, -nzv]
ncol(not_nzv)
## [1] 388

There are 388 predictors remaining for modeling.

  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(456)
split <- permeability %>%
  createDataPartition(p = 0.8, list = FALSE, times = 1)

Xtrain.data  <- not_nzv[split, ] #fingerprints train
xtest.data <- not_nzv[-split, ] #fingerprints test
Ytrain.data  <- permeability[split, ] #permability train
ytest.data <- permeability[-split, ] #permability test
ctrl <- trainControl(method = "cv", number = 10)
plsmod <- train(x = Xtrain.data, y = Ytrain.data, method = "pls", tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plsmod
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 119, 120, 121, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.59309  0.3188302   9.598363
##    2     11.36710  0.4595255   7.749247
##    3     11.07388  0.4746904   8.057209
##    4     11.16464  0.4785656   8.382748
##    5     11.13931  0.4855600   8.043630
##    6     11.02535  0.5135675   7.981960
##    7     11.06765  0.5073874   8.127469
##    8     10.98450  0.5229158   8.197463
##    9     11.23015  0.5068022   8.431096
##   10     11.49945  0.4834073   8.591727
##   11     11.83014  0.4725379   8.710300
##   12     11.89082  0.4662821   8.809819
##   13     12.04049  0.4616730   8.933141
##   14     12.42299  0.4413192   9.361306
##   15     12.56859  0.4383558   9.479632
##   16     12.96051  0.4211730   9.635792
##   17     13.27607  0.4045419   9.846898
##   18     13.53374  0.4034482  10.047704
##   19     13.57828  0.4068468  10.111798
##   20     13.62695  0.4053035  10.088053
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(plsmod)

plsmod$bestTune
##   ncomp
## 8     8

The best tuning parameter is 8 which minimizes the cross validation error, that is, the best estimate for the test error of model.

summary(plsmod$finalModel)
## Data:    X dimension: 133 388 
##  Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 8
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           22.26    34.05    39.74    44.27    51.07    57.89    60.95
## .outcome    34.32    53.35    59.86    65.94    70.01    73.21    76.35
##           8 comps
## X           63.58
## .outcome    78.65

The optimal number of principal components included in the PLS model is 8. This captures 63.58% of the variation in the predictors and 78.65% of the variation in the outcome variable (permability).

  1. Predict the response for the test set. What is the test set estimate of R2?
predictions <- plsmod %>% predict(xtest.data)

cbind(
  RMSE = RMSE(predictions, ytest.data),
  R_squared = caret::R2(predictions, ytest.data)
)
##          RMSE R_squared
## [1,] 11.84771 0.5312741
plot(predictions, col = "darkgreen", main = "Observed (Permability) vs. Predicted", xlab = "", ylab = "Predictions")
par(new = TRUE)
plot(ytest.data, col = "blue", axes=F, ylab = "", xlab="Observed") 
abline(0, 1, col='orange')

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15)) 
set.seed(100) 
ridgeRegFit <- train(Xtrain.data, Ytrain.data, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl, preProc = c("center", "scale", "knnImpute"))
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold08: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridgeRegFit 
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388), nearest neighbor imputation (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE       
##   0.000000000   17.18761  0.3028216   11.682151
##   0.007142857  175.87079  0.2366887  115.675441
##   0.014285714   14.15846  0.3609925   10.354749
##   0.021428571   13.57247  0.3852815    9.934347
##   0.028571429   38.59403  0.3648867   28.001622
##   0.035714286   12.90435  0.4153151    9.387562
##   0.042857143   12.75958  0.4207740    9.236959
##   0.050000000   12.65226  0.4262812    9.150038
##   0.057142857   12.62771  0.4296830    9.178311
##   0.064285714   12.43073  0.4368335    8.976391
##   0.071428571   12.40746  0.4396622    8.968032
##   0.078571429   12.49930  0.4391171    9.081732
##   0.085714286   12.26692  0.4469522    8.847182
##   0.092857143   12.20159  0.4506149    8.799834
##   0.100000000   12.17273  0.4531507    8.777437
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
plot(ridgeRegFit)

Model had some issue when fitting in some validation folds.

predictions2 <- ridgeRegFit %>% predict(xtest.data)

cbind(
  RMSE = RMSE(predictions2, ytest.data),
  R_squared = caret::R2(predictions2, ytest.data)
)
##          RMSE R_squared
## [1,] 12.65672  0.464324
plot(ytest.data, predictions2, ylab = "Predictions", xlab = "Observed", main = "Observed vs Predicted")
abline(abline(0, 1, col='red'))

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

The PLS model worked better on this data due to the lower accuracy scores revealed.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
 data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).
missmap(ChemicalManufacturingProcess, col = c("yellow", "navy"))

We can see some predictors do have missing values.

Below I will preprocess the data. This includes:

  1. centering and scaling the data
  2. Using the knn imputation method to replace missing values
  3. Using corr to filter out highly correlated predictors
  4. nzv to filter near zero variance predictors that could cause trouble.
#preprocess data excluding the yeild column
preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv")) 

Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])
missmap(Xpreprocess, col = c("yellow", "navy"))

As seen in this second plot, the missing values were replaced and the data is now complete.

  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
yield <- as.matrix(ChemicalManufacturingProcess$Yield)

set.seed(789)
split2 <- yield %>%
  createDataPartition(p = 0.8, list = FALSE, times = 1)

Xtrain.data2  <- Xpreprocess[split2, ] #chem train
xtest.data2 <- Xpreprocess[-split2, ] #chem test
Ytrain.data2  <- yield[split2, ] #yield train
ytest.data2 <- yield[-split2, ] #yield test
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15)) 
set.seed(101) 
ridgeRegFit2 <- train(Xtrain.data2, Ytrain.data2, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl)
ridgeRegFit2
## Ridge Regression 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE     
##   0.000000000  5.576669  0.4660474  2.378677
##   0.007142857  2.741742  0.5320202  1.542394
##   0.014285714  2.653634  0.5305896  1.518070
##   0.021428571  2.596637  0.5317464  1.500523
##   0.028571429  2.547937  0.5333929  1.485373
##   0.035714286  2.504615  0.5352293  1.471001
##   0.042857143  2.465688  0.5371422  1.457550
##   0.050000000  2.430505  0.5390638  1.445037
##   0.057142857  2.398538  0.5409488  1.433566
##   0.064285714  2.369348  0.5427668  1.423108
##   0.071428571  2.342569  0.5444987  1.413353
##   0.078571429  2.317895  0.5461338  1.404434
##   0.085714286  2.295071  0.5476671  1.396175
##   0.092857143  2.273883  0.5490979  1.388738
##   0.100000000  2.254150  0.5504282  1.381890
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
plot(ridgeRegFit2)

Optimal value:

The lowest point in the curve indicates the optimal lambda: the log value of lambda that best minimised the error in cross-validation. We can extract this values as:

ridgeRegFit2$bestTune
##    lambda
## 15    0.1
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
predictions3 <- ridgeRegFit2 %>% predict(xtest.data2)

cbind(
  RMSE = RMSE(predictions3, ytest.data2),
  R_squared = caret::R2(predictions3, ytest.data2)
)
##         RMSE R_squared
## [1,] 1.06489 0.5657402

Better than the resampled metrics.

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
varImp(ridgeRegFit2)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   97.82
## ManufacturingProcess17   86.84
## BiologicalMaterial06     83.64
## BiologicalMaterial03     78.13
## ManufacturingProcess09   72.37
## BiologicalMaterial12     72.20
## ManufacturingProcess36   70.51
## BiologicalMaterial02     63.10
## ManufacturingProcess06   61.88
## ManufacturingProcess31   58.39
## BiologicalMaterial11     56.85
## ManufacturingProcess33   47.06
## ManufacturingProcess11   45.94
## BiologicalMaterial04     45.43
## ManufacturingProcess29   44.71
## BiologicalMaterial08     44.36
## ManufacturingProcess12   38.22
## BiologicalMaterial01     35.56
## BiologicalMaterial09     33.79
plot(varImp(ridgeRegFit2))

Based on the plot and values displayed, seems as though the processing predictors dominate the list.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

For this question, I’ll only look at the top predictor recorded for the manufacturing processes and the biological materials.

cor(yield, ChemicalManufacturingProcess$ManufacturingProcess13)
##            [,1]
## [1,] -0.5036797
cor(yield, ChemicalManufacturingProcess$BiologicalMaterial06)
##           [,1]
## [1,] 0.4781634

As stated in the intro for this question, Biological materials are used to asses the quality of raw materials before processing. If the results are good then the yield of the product may increase. Looking at the top Biological material, we can see that its positively but moderately correlated to the response variable.

On the other hand, manufacturing processes are possibly the steps taken to create the end product graded by a rate. We can see there is a negative correlation here which make sense. If the process is not great then the product will not come out great.