Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.
Start R and use these commands to load the data.The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
## [1] 165 1107
## [1] 165 1
## [1] 165 388
After running nearZeroVar function there are 388 predictors left.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
## Data: X dimension: 115 388
## Y dimension: 115 1
## Fit method: kernelpls
## Number of components considered: 114
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 28.66 43.95 50.20 54.22 63.43 66.99 69.60
## permeability 29.08 48.84 57.08 64.71 67.70 72.88 75.88
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 71.93 74.37 76.68 79.06 80.68 82.37
## permeability 78.26 80.64 82.24 83.36 84.99 86.35
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## X 83.84 85.98 87.28 88.11 89.59 90.16
## permeability 87.53 88.18 89.04 89.91 90.28 91.01
## 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps
## X 90.90 91.55 92.41 92.90 93.41 94.03
## permeability 91.42 91.82 92.06 92.41 92.70 92.96
## 26 comps 27 comps 28 comps 29 comps 30 comps 31 comps
## X 94.43 94.80 95.02 95.30 95.52 95.78
## permeability 93.30 93.54 93.76 93.88 93.99 94.07
## 32 comps 33 comps 34 comps 35 comps 36 comps 37 comps
## X 96.03 96.27 96.43 96.61 96.89 97.07
## permeability 94.16 94.25 94.34 94.41 94.45 94.50
## 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## X 97.26 97.42 97.60 97.74 97.85 98.01
## permeability 94.54 94.61 94.66 94.69 94.72 94.73
## 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps
## X 98.20 98.33 98.44 98.52 98.61 98.68
## permeability 94.74 94.76 94.78 94.79 94.80 94.82
## 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## X 98.78 98.85 98.93 99.05 99.10 99.16
## permeability 94.82 94.84 94.85 94.86 94.87 94.88
## 56 comps 57 comps 58 comps 59 comps 60 comps 61 comps
## X 99.23 99.29 99.35 99.41 99.44 99.47
## permeability 94.89 94.89 94.90 94.90 94.91 94.91
## 62 comps 63 comps 64 comps 65 comps 66 comps 67 comps
## X 99.52 99.56 99.61 99.64 99.67 99.70
## permeability 94.91 94.92 94.92 94.92 94.92 94.92
## 68 comps 69 comps 70 comps 71 comps 72 comps 73 comps
## X 99.73 99.75 99.78 99.80 99.83 99.84
## permeability 94.92 94.92 94.92 94.92 94.92 94.92
## 74 comps 75 comps 76 comps 77 comps 78 comps 79 comps
## X 99.86 99.88 99.89 99.90 99.92 99.93
## permeability 94.92 94.92 94.92 94.92 94.92 94.92
## 80 comps 81 comps 82 comps 83 comps 84 comps 85 comps
## X 99.95 99.96 99.97 99.97 99.98 99.99
## permeability 94.92 94.92 94.92 94.92 94.92 94.92
## 86 comps 87 comps 88 comps 89 comps 90 comps 91 comps
## X 99.99 100.00 100.00 100.29 100.59 100.88
## permeability 94.92 94.92 94.92 94.92 94.92 94.92
## 92 comps 93 comps 94 comps 95 comps 96 comps 97 comps
## X 101.18 101.47 101.76 102.06 102.35 102.65
## permeability 94.92 94.92 94.92 94.92 94.92 94.92
## 98 comps 99 comps 100 comps 101 comps 102 comps 103 comps
## X 102.94 103.23 103.53 103.82 104.12 104.41
## permeability 94.92 94.92 94.92 94.92 94.92 94.92
## 104 comps 105 comps 106 comps 107 comps 108 comps 109 comps
## X 104.71 105.00 105.29 105.59 105.88 106.18
## permeability 94.92 94.92 94.92 94.92 94.92 94.92
## 110 comps 111 comps 112 comps 113 comps 114 comps
## X 106.47 106.77 107.06 107.36 107.65
## permeability 94.92 94.92 94.92 94.92 94.92
Predict the response for the test set. What is the test set estimate of R2?
## [1] 25.12240 25.48017 26.87275 25.47519 26.53963 26.86371
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.178e+09 1.000e+00 1.000e+01 1.910e+06 3.100e+01 2.481e+09
Try building other models discussed in this chapter. Do any have better predictive performance?
## Length Class Mode
## call 4 -none- call
## actions 338 -none- list
## allset 389 -none- numeric
## beta.pure 131482 -none- numeric
## vn 389 -none- character
## mu 1 -none- numeric
## normx 389 -none- numeric
## meanx 389 -none- numeric
## lambda 1 -none- numeric
## L1norm 338 -none- numeric
## penalty 338 -none- numeric
## df 338 -none- numeric
## Cp 338 -none- numeric
## sigma2 1 -none- numeric
## Length Class Mode
## s 1 -none- numeric
## fraction 1 -none- numeric
## mode 1 -none- character
## fit 50 -none- numeric
## 116 117 118 119 120 121
## 52.11614 51.44953 51.18513 29.97141 22.01058 28.79457
## Ridge Regression
##
## 115 samples
## 389 predictors
##
## Pre-processing: centered (389), scaled (389)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 104, 103, 103, 104, 104, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 4.663100e-14 1.0000000 3.544018e-14
## 0.007142857 2.313521e+03 0.6283895 1.213060e+03
## 0.014285714 1.760807e+01 0.8038951 1.400285e+01
## 0.021428571 7.068826e+00 0.7907931 4.969317e+00
## 0.028571429 9.999212e+00 0.8701261 7.474803e+00
## 0.035714286 3.516263e+00 0.9547176 2.586893e+00
## 0.042857143 3.658132e+00 0.9509928 2.706185e+00
## 0.050000000 3.772068e+00 0.9465127 2.789631e+00
## 0.057142857 3.967192e+00 0.9404803 2.940685e+00
## 0.064285714 4.119357e+00 0.9350103 3.052796e+00
## 0.071428571 4.268597e+00 0.9300001 3.172374e+00
## 0.078571429 4.438219e+00 0.9239049 3.306660e+00
## 0.085714286 4.568636e+00 0.9189027 3.413702e+00
## 0.092857143 4.685828e+00 0.9147105 3.512301e+00
## 0.100000000 4.807197e+00 0.9100387 3.611950e+00
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.
## Length Class Mode
## call 4 -none- call
## actions 88 -none- list
## allset 124 -none- numeric
## beta.pure 10912 -none- numeric
## vn 389 -none- character
## mu 1 -none- numeric
## normx 124 -none- numeric
## meanx 124 -none- numeric
## lambda 1 -none- numeric
## L1norm 88 -none- numeric
## penalty 88 -none- numeric
## df 88 -none- numeric
## Cp 88 -none- numeric
## sigma2 1 -none- numeric
## xNames 389 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
## 68 129 162 43 14 51 85 21 106 74
## 28.1000 8.5850 0.5250 2.4600 5.5600 1.7650 18.9150 3.8000 1.7050 5.3550
## 7 73 79 37 105 110 34 157 126 89
## 25.3650 47.0050 1.8350 40.8550 1.1950 4.4950 24.0150 1.1950 5.7650 2.0200
## 33 84 70 156 42 111 20 44 121 87
## 3.9200 18.5550 1.6950 4.1350 45.6800 42.0600 5.5100 4.0700 30.5700 0.1050
## 143 137 40 25 119 122 39 160 141 6
## 26.3000 0.8200 5.5650 0.8050 31.7075 48.5100 2.3700 0.7450 50.5100 0.5100
## 24 32 161 2 45 18 22 78 102 65
## 6.1200 8.2700 0.7050 1.1200 1.2250 1.6800 47.6650 13.2950 17.6100 2.7400
## 115 104 135 136 108 113 114 103 75 81
## 31.6700 1.7750 0.7750 0.2350 2.7300 42.0600 35.4300 4.3800 8.3100 40.8550
## 100 13 133 146 48 117 23 144 29 109
## 8.6200 0.0600 1.7400 0.5600 1.2100 51.4150 32.3750 37.8450 0.6400 4.0450
## 131 93 28 101 145 134 158 31 17 154
## 5.5250 8.9000 4.1950 0.4800 0.5600 0.5400 1.0850 13.6750 0.7250 0.3200
## 83 92 64 60 128 149 10 1 163 59
## 9.5100 49.7650 20.3400 1.7100 0.3750 1.7000 4.9100 12.5200 1.5450 8.3850
## 26 15 58 97 125 127 98 164 71 53
## 14.1500 3.8500 2.4900 11.3125 47.1050 0.8650 1.0800 39.5550 6.4275 8.9800
## 140 91 19 107 35 77 118 72 123 95
## 0.7550 2.5500 5.2750 0.9775 13.5325 1.1250 50.2100 1.9150 47.1050 12.4900
## 147 94 52 41 152
## 0.4550 5.7500 5.9850 8.1800 0.5300
Would you recommend any of your models to replace the permeability laboratory experiment?
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data. The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
## [1] 176 58
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
## [1] FALSE
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
## Data: X dimension: 152 57
## Y dimension: 152 1
## Fit method: kernelpls
## Number of components considered: 57
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 86.49 92.29 99.73 99.82 99.91 99.97 99.98 99.99
## Yield 15.34 23.87 30.52 41.94 45.03 46.59 56.13 59.58
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 99.99 99.99 99.99 99.99 100.0 100 100.00
## Yield 62.91 64.99 67.19 67.95 68.8 71 73.71
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Yield 75.04 75.38 75.59 75.93 76.29 76.68 76.88
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Yield 77.03 77.43 77.77 77.98 78.08 78.27 78.45
## 30 comps 31 comps 32 comps 33 comps 34 comps 35 comps 36 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Yield 78.57 78.69 78.81 78.85 78.88 78.94 78.97
## 37 comps 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Yield 79.06 79.08 79.11 79.15 79.16 79.18 79.19
## 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps 50 comps
## X 100.00 100.00 100.0 100.00 100.00 100.00 100.00
## Yield 79.22 79.27 79.3 79.35 79.36 79.38 79.39
## 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps 57 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Yield 79.39 79.39 79.39 79.39 79.39 79.46 79.46
Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
## Support Vector Machines with Linear Kernel
##
## 152 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 137, 137, 136, 138, 137, 136, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.277014 0.4930476 1.342484
##
## Tuning parameter 'C' was held constant at a value of 1
## 124 125 126 127 128 129 130 131
## 42.17826 41.63769 41.04629 41.14536 39.47639 41.76125 40.98895 41.08199
## 132 133 135 136 137 138 140 141
## 41.00749 41.15955 38.04850 39.12700 37.93575 38.51842 38.85692 39.63958
## 142 143 144 145 146 147 148 149
## 40.61385 40.63043 40.60831 38.78862 37.94179 40.05694 39.00819 38.48481
## 150 151 152 153 154 155 156 157
## 38.95087 38.39020 39.87769 39.48349 39.88806 38.83271 37.18061 38.54268
## 158 159 160 161 162 163 164 165
## 38.76366 38.21768 38.16305 38.32175 40.04264 39.55040 40.57412 38.60783
## 166 167 168 169 170 171
## 37.43753 37.64187 37.22554 38.71159 39.43196 40.83218
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Code used in analysis
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(AppliedPredictiveModeling)
data("permeability")
str( fingerprints)
#describe(fingerprints)
dim(fingerprints)
dim(permeability)
nzfinger<-nearZeroVar(fingerprints)
fingerprints <- fingerprints[,-nzfinger]
dim(fingerprints)
require(pls)
library(AppliedPredictiveModeling)
ctrl<-trainControl(method="cv", number=10)
finger.df <- as.data.frame(fingerprints)
finger.df$permeability <- as.vector(permeability)
#trainp<- createDataPartition(finger.df, p=0.70)
set.seed(1)
trainp <- sample(1:nrow(finger.df), 0.7*nrow(finger.df))
trainf <- finger.df[trainp,]
testf <- finger.df[(nrow(trainf)+1):nrow(finger.df),]
plsfit <- plsr(permeability ~ ., data = trainf)
summary(plsfit)
plot(plsfit)
plsTune<-caret::train(trainf ,trainf$permeability,
method="pls",
tuneLength=20,
trControl=ctrl,
preProc = c("center", "scale"))
plot(plsTune)
pt<-predict(plsfit, testf)
head(pt)
summary(pt)
require(MASS)
require(elasticnet)
require(caret)
require(lars)
enetfit <- enet(as.matrix(trainf), trainf$permeability, lambda = .01)
summary(enetfit)
plot(enetfit)
enetpred<-predict(enetfit, newx = testf, s=1, mode="fraction", type="fit")
summary(enetpred)
head(enetpred$fit)
enetGrid<-expand.grid(.lambda = c(0,.01, 1), .fraction = seq(.05, 1, length=20))
enetTune<-caret::train(trainf ,trainf$permeability,
method="enet",
tuneGrid=enetGrid,
trControl=ctrl,
preProc = c("center", "scale"))
plot(enetTune)
ridgeGrid<-data.frame(.lambda = seq(0, .1, length=15))
set.seed(100)
ridgeRegFit<- caret::train(trainf ,trainf$permeability,
method="ridge",
tuneGrid=ridgeGrid,
trControl=ctrl,
preProc = c("center", "scale"))
ridgeRegFit
summary(ridgeRegFit)
plot(ridgeRegFit)
ridgePred<-predict(ridgeRegFit, nex=testf)
ridgePred
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
cm<-ChemicalManufacturingProcess
dim(cm)
cm<-cm[complete.cases(cm),]
anyNA(cm)
set.seed(1)
trainp <- sample(1:nrow(cm), 0.7*nrow(cm))
trainf <- cm[trainp,]
testf <- cm[(nrow(trainf)+1):nrow(cm),]
plsfit <- plsr(Yield ~ ., data = cm)
summary(plsfit)
plot(plsfit)
library(caret)
trctrl<- trainControl(method="repeatedcv", number=10,repeats=3)
svm_lin<- caret::train(Yield~., data=cm, method="svmLinear",
trControl=trctrl, preProcess=c("center","scale"),
tuneLength =10)
svm_lin
test_pred<- predict(svm_lin, newdata=testf)
test_pred
plot(test_pred)