Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
nearZeroVar function from the caret package. How many predictors are left for modeling?library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
nzv <- nearZeroVar(fingerprints)
not.nzv <- fingerprints[, -nzv]
ncol(not.nzv)
## [1] 388
set.seed(123)
split <- createDataPartition(permeability, p = 0.8, list = FALSE, times = 1)
Xtrain.data <- not.nzv[split, ] #fingerprints train
xtest.data <- not.nzv[-split, ] #fingerprints test
Ytrain.data <- permeability[split, ] #permability train
ytest.data <- permeability[-split, ] #permability test
ctrl <- trainControl(method = "cv", number = 10)
pls.mod <- train(x = Xtrain.data,
y = Ytrain.data, method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
pls.mod
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.31894 0.3442124 10.254018
## 2 11.78898 0.4830504 8.534741
## 3 11.98818 0.4792649 9.219285
## 4 12.04349 0.4923322 9.448926
## 5 11.79823 0.5193195 9.049121
## 6 11.53275 0.5335956 8.658301
## 7 11.64053 0.5229621 8.878265
## 8 11.86459 0.5144801 9.265252
## 9 11.98385 0.5188205 9.218594
## 10 12.55634 0.4808614 9.610747
## 11 12.69674 0.4758068 9.702325
## 12 13.01534 0.4538906 9.956623
## 13 13.12637 0.4367362 9.878017
## 14 13.44865 0.4140715 10.065088
## 15 13.60135 0.4034269 10.188150
## 16 13.79361 0.3943904 10.247160
## 17 14.00756 0.3845119 10.412776
## 18 14.18113 0.3711378 10.587027
## 19 14.25674 0.3703610 10.575726
## 20 14.33121 0.3723176 10.679764
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
plot(pls.mod)
pls.mod$bestTune
## ncomp
## 6 6
summary(pls.mod$finalModel)
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 22.98 34.61 40.51 46.13 53.69 58.12
## .outcome 33.73 55.03 61.84 67.77 71.65 75.73
6 is the best tuning parameter to minimze cross validation error.
predictions <- predict(pls.mod, xtest.data)
cbind(RMSE = RMSE(predictions, ytest.data), R.squared = caret::R2(predictions, ytest.data))
## RMSE R.squared
## [1,] 12.34869 0.3244542
plot(predictions,
col = "darkblue",
main = "Observed vs. Predicted - Partial Least Squares Regression Model",
xlab = "",
ylab = "Predictions")
par(new = TRUE)
plot(ytest.data,
col = "orange",
axes=F,
ylab = "",
xlab="Observed")
abline(0, 1, col='green')
We can perform a similar predition with both Ridge and Lasso regression models.
set.seed(123)
ridge.mod <- train(x=Xtrain.data,
y=Ytrain.data,
method='ridge',
metric='Rsquared',
tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold02: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridge.mod
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 16.20635 0.3601753 11.220501
## 0.1 12.06264 0.5453869 9.100052
## 0.2 12.10944 0.5571793 9.168267
## 0.3 12.37240 0.5642431 9.388516
## 0.4 12.76763 0.5687514 9.760973
## 0.5 13.26055 0.5717972 10.174632
## 0.6 13.82796 0.5740629 10.616896
## 0.7 14.45460 0.5755140 11.137567
## 0.8 15.12637 0.5768246 11.742376
## 0.9 15.83746 0.5773256 12.371193
## 1.0 16.57763 0.5778968 13.011360
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
plot(ridge.mod)
ridge.mod$bestTune
## lambda
## 11 1
summary(ridge.mod$finalModel)
## Length Class Mode
## call 4 -none- call
## actions 153 -none- list
## allset 388 -none- numeric
## beta.pure 59364 -none- numeric
## vn 388 -none- character
## mu 1 -none- numeric
## normx 388 -none- numeric
## meanx 388 -none- numeric
## lambda 1 -none- numeric
## L1norm 153 -none- numeric
## penalty 153 -none- numeric
## df 153 -none- numeric
## Cp 153 -none- numeric
## sigma2 1 -none- numeric
## xNames 388 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
predictions <- predict(ridge.mod, xtest.data)
cbind(RMSE = RMSE(predictions, ytest.data), R.squared = caret::R2(predictions, ytest.data))
## RMSE R.squared
## [1,] 19.9078 0.3695863
plot(predictions,
col = "darkblue",
main = "Observed vs. Predicted - Ridge Regression Model",
xlab = "",
ylab = "Predictions")
par(new = TRUE)
plot(ytest.data,
col = "orange",
axes=F,
ylab = "",
xlab="Observed")
abline(0, 1, col='green')
set.seed(123)
lasso.mod <- train(x=Xtrain.data,
y=Ytrain.data,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold02: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = Xtrain.data, y = Ytrain.data, method = "lasso", :
## missing values found in aggregated results
lasso.mod
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 16.00688 NaN 12.917824
## 0.05 12.97787 0.4301959 9.877733
## 0.10 11.80785 0.4931852 9.107986
## 0.15 11.66393 0.4930871 8.897928
## 0.20 11.86820 0.4793179 9.030145
## 0.25 12.10947 0.4728912 9.253825
## 0.30 12.34485 0.4658690 9.393513
## 0.35 12.61952 0.4590835 9.612879
## 0.40 12.75663 0.4601371 9.686790
## 0.45 12.96171 0.4539654 9.774599
## 0.50 13.22572 0.4468568 9.900789
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.1.
plot(lasso.mod)
Model had some issue when fitting in some validation folds.
lasso.mod$bestTune
## fraction
## 3 0.1
summary(lasso.mod$finalModel)
## Length Class Mode
## call 4 -none- call
## actions 193 -none- list
## allset 162 -none- numeric
## beta.pure 31266 -none- numeric
## vn 388 -none- character
## mu 1 -none- numeric
## normx 162 -none- numeric
## meanx 162 -none- numeric
## lambda 1 -none- numeric
## L1norm 193 -none- numeric
## penalty 193 -none- numeric
## df 193 -none- numeric
## Cp 193 -none- numeric
## sigma2 1 -none- numeric
## xNames 388 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
predictions <- predict(lasso.mod, xtest.data)
cbind(RMSE = RMSE(predictions, ytest.data), R.squared = caret::R2(predictions, ytest.data))
## RMSE R.squared
## [1,] 10.68287 0.3661212
plot(predictions,
col = "darkblue",
main = "Observed vs. Predicted - Lasso Regression Model",
xlab = "",
ylab = "Predictions")
par(new = TRUE)
plot(ytest.data,
col = "orange",
axes=F,
ylab = "",
xlab="Observed")
abline(0, 1, col='green')
While the three model’s had similar R^2 values, the Ridge regression model had a greater RMSE.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(RANN)
preprocessing <- preProcess(ChemicalManufacturingProcess[,-1],
method = c("center", "scale", "knnImpute", "corr", "nzv"))
preprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])
yield <- as.matrix(ChemicalManufacturingProcess$Yield)
set.seed(123)
split2 <- createDataPartition(yield, p = 0.8, list = FALSE, times = 1)
Xtrain.data2 <- preprocess[split2, ]
xtest.data2 <- preprocess[-split2, ]
Ytrain.data2 <- yield[split2, ]
ytest.data2 <- yield[-split2, ]
ridge.grid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(123)
ridge.mod2 <- train(x=Xtrain.data2,
y=Ytrain.data2,
method="ridge",
tuneGrid=ridge.grid,
trControl= ctrl)
ridge.mod2
## Ridge Regression
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 8.231140 0.5001829 3.365686
## 0.007142857 5.177328 0.4945675 2.430293
## 0.014285714 4.329076 0.5069033 2.138368
## 0.021428571 3.853764 0.5158292 1.977340
## 0.028571429 3.533281 0.5225199 1.872177
## 0.035714286 3.296381 0.5278183 1.792915
## 0.042857143 3.111303 0.5321913 1.729989
## 0.050000000 2.961284 0.5359104 1.678308
## 0.057142857 2.836451 0.5391442 1.634722
## 0.064285714 2.730522 0.5420033 1.597395
## 0.071428571 2.639265 0.5445641 1.565126
## 0.078571429 2.559696 0.5468811 1.536722
## 0.085714286 2.489636 0.5489948 1.511494
## 0.092857143 2.427447 0.5509361 1.488916
## 0.100000000 2.371869 0.5527288 1.468582
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
plot(ridge.mod2)
Optimal value:
The lowest point in the curve indicates the optimal lambda: the log value of lambda that best minimised the error in cross-validation. We can extract this values as:
ridge.mod2$bestTune
## lambda
## 15 0.1
predictions <- predict(ridge.mod2, xtest.data2)
cbind(RMSE = RMSE(predictions, ytest.data2), R.squared = caret::R2(predictions, ytest.data2))
## RMSE R.squared
## [1,] 1.38534 0.4873811
Better than the resampled metrics.
varImp(ridge.mod2)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 94.06
## BiologicalMaterial03 81.27
## ManufacturingProcess13 80.63
## ManufacturingProcess36 78.46
## BiologicalMaterial02 76.04
## ManufacturingProcess17 75.92
## ManufacturingProcess31 75.30
## ManufacturingProcess09 73.04
## BiologicalMaterial12 69.48
## ManufacturingProcess06 66.28
## BiologicalMaterial11 59.72
## ManufacturingProcess33 58.31
## ManufacturingProcess29 54.48
## BiologicalMaterial04 53.93
## ManufacturingProcess11 47.22
## BiologicalMaterial01 45.62
## BiologicalMaterial08 44.93
## BiologicalMaterial09 40.88
## ManufacturingProcess30 40.40
plot(varImp(ridge.mod2))
cor(yield, ChemicalManufacturingProcess$ManufacturingProcess13)
## [,1]
## [1,] -0.5036797
cor(yield, ChemicalManufacturingProcess$BiologicalMaterial06)
## [,1]
## [1,] 0.4781634