Assignment: Exercises 6.2 and 6.3 from the KJ textbook
library(caTools)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
##
## R2
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
fingerprints originally has 1,107 predictors, and applying nearZeroVar leaves us with 388 predictors. We also convert the fingerprints matrix into a data frame and append the permeability response variable to it.
ncol(fingerprints)
## [1] 1107
f2 <- as.data.frame(fingerprints[, -nearZeroVar(fingerprints)])
ncol(f2)
## [1] 388
f2$permeability <- permeability
We do an 80/20 split on our dataset for training and testing of our model. The resampled estimate of R^2 for the optimal model is 0.385.
set.seed(101)
sample2 = sample.split(f2$X1, SplitRatio = .8)
train2 = subset(f2, sample2 == TRUE)
test2 = subset(f2, sample2 == FALSE)
train2.X = subset(train2, select = -permeability)
train2.y = as.double(train2[,389])
set.seed(42)
cv_5 = trainControl(method = "cv", number = 5)
pls = train(train2.X, train2.y,
method = "pls",
trControl = cv_5,
tuneLength = 20,
preProc = c("center", "scale")
)
pls
## Partial Least Squares
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 106, 106, 105, 106, 105
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.11285 0.3023623 10.202646
## 2 12.34325 0.3857531 8.782778
## 3 12.87472 0.3257454 9.762851
## 4 13.46234 0.2861761 10.196898
## 5 13.15464 0.3269827 9.863930
## 6 12.61933 0.3674958 9.483667
## 7 12.38542 0.3880415 9.574450
## 8 12.38560 0.3918038 9.452560
## 9 13.26281 0.3313283 10.222422
## 10 13.80736 0.2930231 10.623743
## 11 13.89955 0.2912486 10.633719
## 12 14.14672 0.2775490 10.710961
## 13 14.78525 0.2409274 11.125034
## 14 14.89994 0.2336614 11.178475
## 15 15.46838 0.1951437 11.555663
## 16 15.89752 0.1847873 11.934445
## 17 16.21158 0.1724270 12.238368
## 18 16.40462 0.1655988 12.336353
## 19 16.54381 0.1525942 12.373425
## 20 16.75260 0.1498705 12.464773
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
The test set estimate of R^2 is 0.676 which is more than the 0.38 that we got from training the model.
test2.y <- as.double(test2[,389])
results2 <- predict(pls, test2)
postResample(results2, test2.y)
## RMSE Rsquared MAE
## 11.6102518 0.6761989 8.6112739
We also attempt an elastic net model and tuning it resulted in a model that was about equal parts Ridge and Lasso with an R^2 of 0.436 instead of the 0.38 that we got for the training of the PLS. The R^2 of our test set is slightly lower than what we saw with PLS at 0.611 versus 0.676.
fenet = train(
permeability~ ., data = train2,
method = "glmnet",
trControl = cv_5
)
fenet
## glmnet
##
## 132 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 106, 107, 105, 105, 105
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.5675973 12.03826 0.4014247 8.875618
## 0.10 1.7949004 11.58749 0.4232663 8.490708
## 0.10 5.6759734 11.45569 0.4372767 8.023007
## 0.55 0.5675973 12.04758 0.3842069 8.992382
## 0.55 1.7949004 11.42972 0.4356371 7.962093
## 0.55 5.6759734 11.90777 0.4390952 8.632116
## 1.00 0.5675973 11.69898 0.4088869 8.557344
## 1.00 1.7949004 11.43925 0.4393444 7.900447
## 1.00 5.6759734 13.09008 0.4259580 10.027339
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda = 1.7949.
results3 <- predict(fenet, test2)
postResample(results3, test2.y)
## RMSE Rsquared MAE
## 12.0358969 0.6105601 8.6806142
I would recommend either of these models but there would need to be a deeper dive and evaluation to decide which would be a better fit.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
data("ChemicalManufacturingProcess")
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
We use KNN imputation which imputes a missing value with the average weighted value of observations near/similar to it. We perform this imputation for missing values in all variables except for the response variable, yield.
cmp <- knnImputation(ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "yield"])
anyNA(cmp)
## [1] FALSE
First, we split our data, using 80% of it to train the model and holding out the rest for our test set.
set.seed(101)
sample = sample.split(cmp$BiologicalMaterial01, SplitRatio = .8)
train = subset(cmp, sample == TRUE)
test = subset(cmp, sample == FALSE)
Next, we use cross validation with 5 folds and tune an Elastic Net model on the training set. Alpha controls the mix between the ridge and lasso penalties and lambda controls the amount of penalization. The optimal value of the performance metric is the minimized RMSE value (1.24) and R^2 of 0.59 when alpha = 1 and lamda = 0.22. Since alpha = 1, this optimal model is actually a lasso regression.
set.seed(42)
cv_5 = trainControl(method = "cv", number = 5)
elnet = train(
Yield ~ ., data = train,
method = "glmnet",
trControl = cv_5
)
elnet
## glmnet
##
## 140 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 112, 112, 112, 112, 112
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.002214191 6.090344 0.3463700 2.029516
## 0.10 0.022141914 4.294448 0.4891488 1.571158
## 0.10 0.221419144 2.388345 0.5062283 1.174329
## 0.55 0.002214191 5.253761 0.3646390 1.846019
## 0.55 0.022141914 4.013140 0.4989420 1.483398
## 0.55 0.221419144 1.492739 0.5313145 1.045691
## 1.00 0.002214191 4.829122 0.3904685 1.745341
## 1.00 0.022141914 3.872670 0.5016430 1.446879
## 1.00 0.221419144 1.243313 0.5961836 1.015103
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2214191.
The RMSE for the test set is 1.319 which is slightly higher, but not by much, when compared to the 1.24 RMSE that we have with the training set. The R^2 is 0.397 which is less than the 0.59 that we got from training the model.
test.y <- test[,1]
results <- predict(elnet, s = 0.22, test)
postResample(results, test.y)
## RMSE Rsquared MAE
## 1.319225 0.397440 1.047190
There are 3 biological and 7 process predictors in our model. The strongest predictors are manufacturing processes, specifically 9, 17 and 32.
coef(elnet$finalModel, elnet$finalModel$lambdaOpt)
## 58 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.846801e+01
## BiologicalMaterial01 .
## BiologicalMaterial02 1.389335e-03
## BiologicalMaterial03 .
## BiologicalMaterial04 .
## BiologicalMaterial05 2.122102e-03
## BiologicalMaterial06 .
## BiologicalMaterial07 .
## BiologicalMaterial08 4.478227e-02
## BiologicalMaterial09 .
## BiologicalMaterial10 .
## BiologicalMaterial11 .
## BiologicalMaterial12 .
## ManufacturingProcess01 .
## ManufacturingProcess02 .
## ManufacturingProcess03 .
## ManufacturingProcess04 .
## ManufacturingProcess05 .
## ManufacturingProcess06 1.960422e-02
## ManufacturingProcess07 .
## ManufacturingProcess08 .
## ManufacturingProcess09 3.001602e-01
## ManufacturingProcess10 .
## ManufacturingProcess11 .
## ManufacturingProcess12 .
## ManufacturingProcess13 -9.270081e-02
## ManufacturingProcess14 .
## ManufacturingProcess15 .
## ManufacturingProcess16 .
## ManufacturingProcess17 -2.200624e-01
## ManufacturingProcess18 .
## ManufacturingProcess19 .
## ManufacturingProcess20 .
## ManufacturingProcess21 .
## ManufacturingProcess22 .
## ManufacturingProcess23 .
## ManufacturingProcess24 -3.283492e-03
## ManufacturingProcess25 .
## ManufacturingProcess26 .
## ManufacturingProcess27 .
## ManufacturingProcess28 .
## ManufacturingProcess29 .
## ManufacturingProcess30 .
## ManufacturingProcess31 .
## ManufacturingProcess32 1.160302e-01
## ManufacturingProcess33 .
## ManufacturingProcess34 .
## ManufacturingProcess35 .
## ManufacturingProcess36 -2.307180e+02
## ManufacturingProcess37 .
## ManufacturingProcess38 .
## ManufacturingProcess39 .
## ManufacturingProcess40 .
## ManufacturingProcess41 .
## ManufacturingProcess42 .
## ManufacturingProcess43 .
## ManufacturingProcess44 .
## ManufacturingProcess45 .
Like we previously mentioned, the strongest predictors for yield are manufacturing processes 9, 17 and 32, but process 17 has a negative effect on yield. It would help to implement more processes similar to 9 and 32 in the production of this pharmaceutical product. Some biological materials also had relatively strong effects on the yield, but those predictors cannot be changed.