library(AppliedPredictiveModeling)
library(tidyverse)
library(MASS)
library(caret)
library(pls)
library(Amelia)
Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability
contains permeability response.
nearZeroVar
function from the caret package. How many predictors are left for modeling?## [1] 388
There are 388
predictors remaining for modeling.
set.seed(456)
split <- permeability %>%
createDataPartition(p = 0.8, list = FALSE, times = 1)
Xtrain.data <- not_nzv[split, ] #fingerprints train
xtest.data <- not_nzv[-split, ] #fingerprints test
Ytrain.data <- permeability[split, ] #permability train
ytest.data <- permeability[-split, ] #permability test
ctrl <- trainControl(method = "cv", number = 10)
plsmod <- train(x = Xtrain.data, y = Ytrain.data, method = "pls", tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 121, 119, 120, 121, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.59309 0.3188302 9.598363
## 2 11.36710 0.4595255 7.749247
## 3 11.07388 0.4746904 8.057209
## 4 11.16464 0.4785656 8.382748
## 5 11.13931 0.4855600 8.043630
## 6 11.02535 0.5135675 7.981960
## 7 11.06765 0.5073874 8.127469
## 8 10.98450 0.5229158 8.197463
## 9 11.23015 0.5068022 8.431096
## 10 11.49945 0.4834073 8.591727
## 11 11.83014 0.4725379 8.710300
## 12 11.89082 0.4662821 8.809819
## 13 12.04049 0.4616730 8.933141
## 14 12.42299 0.4413192 9.361306
## 15 12.56859 0.4383558 9.479632
## 16 12.96051 0.4211730 9.635792
## 17 13.27607 0.4045419 9.846898
## 18 13.53374 0.4034482 10.047704
## 19 13.57828 0.4068468 10.111798
## 20 13.62695 0.4053035 10.088053
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
## ncomp
## 8 8
The best tuning parameter is 8 which minimizes the cross validation error, that is, the best estimate for the test error of model.
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 8
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 22.26 34.05 39.74 44.27 51.07 57.89 60.95
## .outcome 34.32 53.35 59.86 65.94 70.01 73.21 76.35
## 8 comps
## X 63.58
## .outcome 78.65
The optimal number of principal components included in the PLS model is 8. This captures 63.58% of the variation in the predictors and 78.65% of the variation in the outcome variable (permability).
predictions <- plsmod %>% predict(xtest.data)
cbind(
RMSE = RMSE(predictions, ytest.data),
R_squared = caret::R2(predictions, ytest.data)
)
## RMSE R_squared
## [1,] 11.84771 0.5312741
plot(predictions, col = "darkgreen", main = "Observed (Permability) vs. Predicted", xlab = "", ylab = "Predictions")
par(new = TRUE)
plot(ytest.data, col = "blue", axes=F, ylab = "", xlab="Observed")
abline(0, 1, col='orange')
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeRegFit <- train(Xtrain.data, Ytrain.data, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl, preProc = c("center", "scale", "knnImpute"))
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold08: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388), nearest neighbor imputation (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 17.18761 0.3028216 11.682151
## 0.007142857 175.87079 0.2366887 115.675441
## 0.014285714 14.15846 0.3609925 10.354749
## 0.021428571 13.57247 0.3852815 9.934347
## 0.028571429 38.59403 0.3648867 28.001622
## 0.035714286 12.90435 0.4153151 9.387562
## 0.042857143 12.75958 0.4207740 9.236959
## 0.050000000 12.65226 0.4262812 9.150038
## 0.057142857 12.62771 0.4296830 9.178311
## 0.064285714 12.43073 0.4368335 8.976391
## 0.071428571 12.40746 0.4396622 8.968032
## 0.078571429 12.49930 0.4391171 9.081732
## 0.085714286 12.26692 0.4469522 8.847182
## 0.092857143 12.20159 0.4506149 8.799834
## 0.100000000 12.17273 0.4531507 8.777437
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
Model had some issue when fitting in some validation folds.
predictions2 <- ridgeRegFit %>% predict(xtest.data)
cbind(
RMSE = RMSE(predictions2, ytest.data),
R_squared = caret::R2(predictions2, ytest.data)
)
## RMSE R_squared
## [1,] 12.65672 0.464324
plot(ytest.data, predictions2, ylab = "Predictions", xlab = "Observed", main = "Observed vs Predicted")
abline(abline(0, 1, col='red'))
The PLS model worked better on this data due to the lower accuracy scores revealed.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
We can see some predictors do have missing values.
Below I will preprocess the data. This includes:
centering
and scaling
the dataknn
imputation method to replace missing valuescorr
to filter out highly correlated predictorsnzv
to filter near zero variance predictors that could cause trouble.#preprocess data excluding the yeild column
preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv"))
Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])
missmap(Xpreprocess, col = c("yellow", "navy"))
As seen in this second plot, the missing values were replaced and the data is now complete.
yield <- as.matrix(ChemicalManufacturingProcess$Yield)
set.seed(789)
split2 <- yield %>%
createDataPartition(p = 0.8, list = FALSE, times = 1)
Xtrain.data2 <- Xpreprocess[split2, ] #chem train
xtest.data2 <- Xpreprocess[-split2, ] #chem test
Ytrain.data2 <- yield[split2, ] #yield train
ytest.data2 <- yield[-split2, ] #yield test
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(101)
ridgeRegFit2 <- train(Xtrain.data2, Ytrain.data2, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl)
ridgeRegFit2
## Ridge Regression
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 129, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 5.576669 0.4660474 2.378677
## 0.007142857 2.741742 0.5320202 1.542394
## 0.014285714 2.653634 0.5305896 1.518070
## 0.021428571 2.596637 0.5317464 1.500523
## 0.028571429 2.547937 0.5333929 1.485373
## 0.035714286 2.504615 0.5352293 1.471001
## 0.042857143 2.465688 0.5371422 1.457550
## 0.050000000 2.430505 0.5390638 1.445037
## 0.057142857 2.398538 0.5409488 1.433566
## 0.064285714 2.369348 0.5427668 1.423108
## 0.071428571 2.342569 0.5444987 1.413353
## 0.078571429 2.317895 0.5461338 1.404434
## 0.085714286 2.295071 0.5476671 1.396175
## 0.092857143 2.273883 0.5490979 1.388738
## 0.100000000 2.254150 0.5504282 1.381890
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
Optimal value:
The lowest point in the curve indicates the optimal lambda: the log value of lambda that best minimised the error in cross-validation. We can extract this values as:
## lambda
## 15 0.1
predictions3 <- ridgeRegFit2 %>% predict(xtest.data2)
cbind(
RMSE = RMSE(predictions3, ytest.data2),
R_squared = caret::R2(predictions3, ytest.data2)
)
## RMSE R_squared
## [1,] 1.06489 0.5657402
Better than the resampled metrics.
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess13 100.00
## ManufacturingProcess32 97.82
## ManufacturingProcess17 86.84
## BiologicalMaterial06 83.64
## BiologicalMaterial03 78.13
## ManufacturingProcess09 72.37
## BiologicalMaterial12 72.20
## ManufacturingProcess36 70.51
## BiologicalMaterial02 63.10
## ManufacturingProcess06 61.88
## ManufacturingProcess31 58.39
## BiologicalMaterial11 56.85
## ManufacturingProcess33 47.06
## ManufacturingProcess11 45.94
## BiologicalMaterial04 45.43
## ManufacturingProcess29 44.71
## BiologicalMaterial08 44.36
## ManufacturingProcess12 38.22
## BiologicalMaterial01 35.56
## BiologicalMaterial09 33.79
Based on the plot and values displayed, seems as though the processing predictors dominate the list.
For this question, I’ll only look at the top predictor recorded for the manufacturing processes and the biological materials.
## [,1]
## [1,] -0.5036797
## [,1]
## [1,] 0.4781634
As stated in the intro for this question, Biological materials are used to asses the quality of raw materials before processing. If the results are good then the yield of the product may increase. Looking at the top Biological material, we can see that its positively but moderately correlated to the response variable.
On the other hand, manufacturing processes are possibly the steps taken to create the end product graded by a rate. We can see there is a negative correlation here which make sense. If the process is not great then the product will not come out great.