Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.
#Load packages
library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
library(elasticnet)
library(glmnet)
library(kableExtra)
library(corrplot)
library(VIM)
library(RANN)
Start R and use these commands to load the data:
data(permeability)
summary(permeability)
## permeability
## Min. : 0.06
## 1st Qu.: 1.55
## Median : 4.91
## Mean :12.24
## 3rd Qu.:15.47
## Max. :55.60
The matrix fingerprints contain the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
predictors = nearZeroVar(fingerprints)
fingerprints1 = fingerprints[,-predictors]
dim(fingerprints1)[2]
## [1] 388
719 columns have been removed using nearZeroVar function. 388 predictors are left for modeling. Let us look at the pairwise correlations.
correlations <- cor(fingerprints1)
plot(correlations)
We will remove highly correlated predictors with threshold 0.90.
highCorr <- findCorrelation(correlations, cutoff = .9)
fingerprints1 <- fingerprints1[, -highCorr]
correlations <- cor(fingerprints1)
corrplot(correlations, order = "hclust")
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
#adding permeability data to last column
df <- as.data.frame(fingerprints1) %>% mutate(permeability = permeability)
#Split the data
set.seed(425)
in_train <- createDataPartition(df$permeability, times = 1, p = 0.7, list = FALSE)
train <- df[in_train, ]
test <- df[-in_train, ]
#run PLS model
set.seed(525)
pls_model <- train(permeability ~ ., data = train, method = "pls",
center = TRUE,trControl = trainControl("cv", number = 10),
tuneLength = 25, preProcess= c('center','scale'))
pls_model
## Partial Least Squares
##
## 117 samples
## 110 predictors
##
## Pre-processing: centered (110), scaled (110)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.49375 0.4597224 9.091507
## 2 12.24800 0.4945470 9.112055
## 3 12.36040 0.4797827 9.406771
## 4 12.16871 0.4952475 9.389245
## 5 11.89010 0.5057603 9.013300
## 6 11.94220 0.5014062 9.128401
## 7 12.16506 0.5011089 9.317626
## 8 12.47311 0.4897341 9.493437
## 9 12.79494 0.4728488 9.731745
## 10 12.96546 0.4680538 9.982773
## 11 13.31205 0.4411153 10.232183
## 12 13.79447 0.4090751 10.538288
## 13 14.20229 0.3927278 10.950167
## 14 14.41850 0.3926943 11.122089
## 15 14.56554 0.3893528 11.310679
## 16 14.76587 0.3792650 11.414448
## 17 14.95644 0.3712629 11.480098
## 18 15.15261 0.3643800 11.625230
## 19 15.33920 0.3616213 11.803363
## 20 15.25886 0.3631634 11.815923
## 21 15.32512 0.3658402 11.871329
## 22 15.43380 0.3662152 12.169136
## 23 15.56990 0.3650867 12.217720
## 24 15.69537 0.3635686 12.282632
## 25 15.98231 0.3573109 12.512090
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
plot(pls_model, main = "R-squared Error of PLS Model")
pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp)
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 11.8901 0.5057603 9.0133 2.380438 0.2206955 1.604525
The number of components resulting in the smallest root mean squared error is 2. It has RMSE = 12,30, R2 = 0.49, and MAE = 8.74. Therefore 2 variables account for the largest portion of the variability in the data than all other latent variables.
Predict the response for the test set. What is the test set estimate of \(R^2\)?
set.seed(526)
#Predictions
predictions <- predict(pls_model, test)
#Model performance
results <- data.frame(Model = "PLS",
RMSE = caret::RMSE(predictions, test$permeability),
Rsquared = caret::R2(predictions, test$permeability),
MAE = caret::MAE(predictions, test$permeability))
results
## Model RMSE Rsquared MAE
## permeability PLS 9.931766 0.4813456 7.404271
The test set estimate of \(R^2\) is 0.31.
Try building other models discussed in this chapter. Do any have better predictive performance?
PCR Model
set.seed(527)
pcr_model <- train(permeability ~ ., data = train, method = "pcr",
center = TRUE,trControl = trainControl("cv", number = 10),
tuneLength = 25, preProcess= c('center','scale'))
pcr_model
## Principal Component Analysis
##
## 117 samples
## 110 predictors
##
## Pre-processing: centered (110), scaled (110)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 105, 106, 105, 105, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 14.53036 0.2719165 11.569587
## 2 13.77096 0.3868836 10.527510
## 3 12.77983 0.5293771 9.659629
## 4 12.67719 0.5235212 9.492291
## 5 12.57255 0.5308346 9.358080
## 6 12.80383 0.5159124 9.586236
## 7 12.73028 0.5194885 9.680261
## 8 12.61531 0.5272950 9.455674
## 9 12.84418 0.5002749 9.528027
## 10 12.94629 0.4904117 9.729650
## 11 13.00556 0.4871528 9.803775
## 12 13.23303 0.4547214 10.021566
## 13 13.49666 0.4450683 10.265017
## 14 13.20284 0.4698343 9.678929
## 15 12.97044 0.4760409 9.979021
## 16 12.75276 0.4904039 9.763029
## 17 12.92765 0.4702478 9.935250
## 18 12.88617 0.4718028 9.811889
## 19 12.93372 0.4762997 9.843803
## 20 12.70200 0.4859912 9.732408
## 21 12.53706 0.4996257 9.540718
## 22 12.38637 0.4998647 9.356427
## 23 12.09613 0.5351635 9.098953
## 24 12.07304 0.5213616 9.049872
## 25 12.43117 0.4889082 9.351105
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 24.
plot(pcr_model, main = "R-squared Error of PCR Model")
set.seed(528)
#Predictions
predictions_pcr <- predict(pcr_model, test)
#Model performance
results_pcr <- data.frame(Model = "PCR",
RMSE = caret::RMSE(predictions_pcr, test$permeability),
Rsquared = caret::R2(predictions_pcr, test$permeability),
MAE = caret::MAE(predictions_pcr, test$permeability))
results_pcr
## Model RMSE Rsquared MAE
## permeability PCR 9.78416 0.4838728 7.612136
Ridge Regression
ridgeGrid<-data.frame(.lambda=seq(0,1, length=15))
set.seed(529)
#Ridge Method Fit
ridge_model <- train(permeability ~ ., data = train, method='ridge', metric='Rsquared',
tuneGrid = ridgeGrid,
trControl = trainControl(method = 'cv'), preProcess = c('center','scale'))
ridge_model
## Ridge Regression
##
## 117 samples
## 110 predictors
##
## Pre-processing: centered (110), scaled (110)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 105, 105, 106, 105, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00000000 4.498531e+08 0.1449095 2.376653e+08
## 0.07142857 1.232739e+01 0.5178299 9.146404e+00
## 0.14285714 1.210243e+01 0.5406364 8.984906e+00
## 0.21428571 1.212827e+01 0.5476154 8.980539e+00
## 0.28571429 1.225627e+01 0.5499758 9.071299e+00
## 0.35714286 1.244352e+01 0.5504269 9.205432e+00
## 0.42857143 1.267159e+01 0.5499507 9.352651e+00
## 0.50000000 1.293066e+01 0.5489891 9.533762e+00
## 0.57142857 1.321465e+01 0.5477679 9.749646e+00
## 0.64285714 1.351938e+01 0.5464119 9.978588e+00
## 0.71428571 1.384169e+01 0.5449936 1.022930e+01
## 0.78571429 1.417913e+01 0.5435565 1.049217e+01
## 0.85714286 1.452964e+01 0.5421271 1.075541e+01
## 0.92857143 1.489147e+01 0.5407216 1.101782e+01
## 1.00000000 1.526313e+01 0.5393498 1.127768e+01
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.3571429.
plot(ridge_model, main = "R-squared Error of Ridge Model")
set.seed(530)
#Predictions
predictions_ridge <- predict(ridge_model, test)
#Model performance
results_ridge <- data.frame(Model = "Ridge",
RMSE = caret::RMSE(predictions_ridge, test$permeability),
Rsquared = caret::R2(predictions_ridge, test$permeability),
MAE = caret::MAE(predictions_ridge, test$permeability))
results_ridge
## Model RMSE Rsquared MAE
## permeability Ridge 11.27315 0.4742014 8.297782
Lasso Regression
set.seed(531)
lasso_model <- train(permeability ~ ., data = train, method='lasso', metric='Rsquared',
tuneGrid = data.frame(.fraction = seq(0,0.5, by=0.05)),
trControl = trainControl(method='cv'),
preProcess = c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
lasso_model
## The lasso
##
## 117 samples
## 110 predictors
##
## Pre-processing: centered (110), scaled (110)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 1.612720e+01 NaN 1.308597e+01
## 0.05 6.297561e+05 0.2256917 3.665642e+05
## 0.10 1.209421e+06 0.2118223 6.980189e+05
## 0.15 1.750901e+06 0.2115700 1.001342e+06
## 0.20 2.293185e+06 0.2019776 1.304688e+06
## 0.25 2.835730e+06 0.1876124 1.607993e+06
## 0.30 3.378434e+06 0.1765733 1.911290e+06
## 0.35 6.441781e+06 0.1518032 3.588910e+06
## 0.40 9.834419e+06 0.1509985 5.446040e+06
## 0.45 1.320198e+07 0.1507521 7.287634e+06
## 0.50 1.652143e+07 0.1494106 9.099398e+06
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
plot(lasso_model, main = "R-squared Lasso Model")
set.seed(532)
#Predictions
predictions_lasso <- predict(lasso_model, test)
#Model performance
results_lasso <- data.frame(Model = "Lasso",
RMSE = caret::RMSE(predictions_lasso, test$permeability),
Rsquared = caret::R2(predictions_lasso, test$permeability),
MAE = caret::MAE(predictions_lasso, test$permeability))
results_lasso
## Model RMSE Rsquared MAE
## permeability Lasso 35991793 0.007920709 5194982
Elastic Net
set.seed(533)
elasticnet_model <- train(permeability ~ ., data = train, method ='enet', metric='Rsquared',
tuneGrid = expand.grid(.fraction=seq(0,1,by=0.1),
.lambda=seq(0,1,by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
elasticnet_model
## Elasticnet
##
## 117 samples
## 110 predictors
##
## Pre-processing: centered (110), scaled (110)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 1.605289e+01 NaN 1.311468e+01
## 0.0 0.1 1.068953e+16 0.2185678 6.047534e+15
## 0.0 0.2 2.137905e+16 0.1855136 1.209506e+16
## 0.0 0.3 3.206857e+16 0.1751223 1.814259e+16
## 0.0 0.4 4.275808e+16 0.1600843 2.419012e+16
## 0.0 0.5 5.344760e+16 0.1484753 3.023765e+16
## 0.0 0.6 6.413712e+16 0.1397991 3.628518e+16
## 0.0 0.7 7.482664e+16 0.1347267 4.233272e+16
## 0.0 0.8 8.551615e+16 0.1318993 4.838025e+16
## 0.0 0.9 9.620567e+16 0.1302927 5.442778e+16
## 0.0 1.0 1.068952e+17 0.1293711 6.047531e+16
## 0.1 0.0 1.605289e+01 NaN 1.311468e+01
## 0.1 0.1 1.209484e+01 0.5003136 9.232768e+00
## 0.1 0.2 1.167021e+01 0.5207194 8.424887e+00
## 0.1 0.3 1.172327e+01 0.5214684 8.532537e+00
## 0.1 0.4 1.174074e+01 0.5220290 8.565115e+00
## 0.1 0.5 1.189062e+01 0.5137933 8.596010e+00
## 0.1 0.6 1.203865e+01 0.5067960 8.728549e+00
## 0.1 0.7 1.213169e+01 0.5047378 8.822719e+00
## 0.1 0.8 1.228958e+01 0.5018736 8.985004e+00
## 0.1 0.9 1.247119e+01 0.4980152 9.124576e+00
## 0.1 1.0 1.267160e+01 0.4923753 9.284064e+00
## 0.2 0.0 1.605289e+01 NaN 1.311468e+01
## 0.2 0.1 1.221857e+01 0.5010274 9.533200e+00
## 0.2 0.2 1.171156e+01 0.5162873 8.369793e+00
## 0.2 0.3 1.169990e+01 0.5247079 8.370508e+00
## 0.2 0.4 1.170722e+01 0.5314202 8.412648e+00
## 0.2 0.5 1.178959e+01 0.5294089 8.497834e+00
## 0.2 0.6 1.195086e+01 0.5232267 8.610123e+00
## 0.2 0.7 1.205052e+01 0.5225170 8.715108e+00
## 0.2 0.8 1.217582e+01 0.5214250 8.836916e+00
## 0.2 0.9 1.233128e+01 0.5190818 9.000577e+00
## 0.2 1.0 1.247769e+01 0.5164552 9.169102e+00
## 0.3 0.0 1.605289e+01 NaN 1.311468e+01
## 0.3 0.1 1.234262e+01 0.5012940 9.724087e+00
## 0.3 0.2 1.171478e+01 0.5162028 8.331104e+00
## 0.3 0.3 1.170155e+01 0.5282976 8.241005e+00
## 0.3 0.4 1.180700e+01 0.5330533 8.340881e+00
## 0.3 0.5 1.189075e+01 0.5339338 8.452902e+00
## 0.3 0.6 1.204760e+01 0.5308615 8.587209e+00
## 0.3 0.7 1.216317e+01 0.5309007 8.702374e+00
## 0.3 0.8 1.228967e+01 0.5310289 8.836379e+00
## 0.3 0.9 1.243380e+01 0.5299548 9.027190e+00
## 0.3 1.0 1.257521e+01 0.5284496 9.200748e+00
## 0.4 0.0 1.605289e+01 NaN 1.311468e+01
## 0.4 0.1 1.240368e+01 0.5009499 9.816886e+00
## 0.4 0.2 1.172083e+01 0.5153425 8.278039e+00
## 0.4 0.3 1.174592e+01 0.5296434 8.161662e+00
## 0.4 0.4 1.194355e+01 0.5333826 8.325030e+00
## 0.4 0.5 1.208066e+01 0.5346198 8.470189e+00
## 0.4 0.6 1.223528e+01 0.5342476 8.613265e+00
## 0.4 0.7 1.237649e+01 0.5355645 8.741814e+00
## 0.4 0.8 1.250892e+01 0.5365687 8.923495e+00
## 0.4 0.9 1.265996e+01 0.5362266 9.127898e+00
## 0.4 1.0 1.281559e+01 0.5351700 9.316409e+00
## 0.5 0.0 1.605289e+01 NaN 1.311468e+01
## 0.5 0.1 1.243088e+01 0.5005166 9.858802e+00
## 0.5 0.2 1.175785e+01 0.5131168 8.240715e+00
## 0.5 0.3 1.183430e+01 0.5293797 8.110441e+00
## 0.5 0.4 1.210811e+01 0.5333377 8.360932e+00
## 0.5 0.5 1.232607e+01 0.5340980 8.555698e+00
## 0.5 0.6 1.248934e+01 0.5357311 8.686468e+00
## 0.5 0.7 1.265149e+01 0.5380442 8.858571e+00
## 0.5 0.8 1.280805e+01 0.5394694 9.104317e+00
## 0.5 0.9 1.296726e+01 0.5402103 9.329111e+00
## 0.5 1.0 1.314822e+01 0.5390668 9.571011e+00
## 0.6 0.0 1.605289e+01 NaN 1.311468e+01
## 0.6 0.1 1.243886e+01 0.5000112 9.872817e+00
## 0.6 0.2 1.179800e+01 0.5112173 8.207406e+00
## 0.6 0.3 1.194946e+01 0.5285314 8.087585e+00
## 0.6 0.4 1.230161e+01 0.5329802 8.412493e+00
## 0.6 0.5 1.260468e+01 0.5331262 8.658001e+00
## 0.6 0.6 1.278347e+01 0.5363459 8.795060e+00
## 0.6 0.7 1.297368e+01 0.5393004 9.047542e+00
## 0.6 0.8 1.316466e+01 0.5410254 9.369195e+00
## 0.6 0.9 1.334632e+01 0.5422274 9.637833e+00
## 0.6 1.0 1.354741e+01 0.5412955 9.897148e+00
## 0.7 0.0 1.605289e+01 NaN 1.311468e+01
## 0.7 0.1 1.243609e+01 0.4995259 9.869835e+00
## 0.7 0.2 1.183484e+01 0.5097874 8.162437e+00
## 0.7 0.3 1.208311e+01 0.5276243 8.088323e+00
## 0.7 0.4 1.251388e+01 0.5323363 8.476564e+00
## 0.7 0.5 1.289290e+01 0.5326997 8.757181e+00
## 0.7 0.6 1.310979e+01 0.5364246 8.943365e+00
## 0.7 0.7 1.333937e+01 0.5396325 9.277415e+00
## 0.7 0.8 1.357027e+01 0.5415626 9.661536e+00
## 0.7 0.9 1.377747e+01 0.5429873 9.964952e+00
## 0.7 1.0 1.399695e+01 0.5424790 1.024426e+01
## 0.8 0.0 1.605289e+01 NaN 1.311468e+01
## 0.8 0.1 1.242505e+01 0.4990176 9.856433e+00
## 0.8 0.2 1.187427e+01 0.5084516 8.114485e+00
## 0.8 0.3 1.223352e+01 0.5267246 8.099047e+00
## 0.8 0.4 1.274079e+01 0.5317895 8.552355e+00
## 0.8 0.5 1.319213e+01 0.5323682 8.899139e+00
## 0.8 0.6 1.347109e+01 0.5359483 9.143150e+00
## 0.8 0.7 1.374226e+01 0.5391840 9.547237e+00
## 0.8 0.8 1.401282e+01 0.5413733 9.976749e+00
## 0.8 0.9 1.424731e+01 0.5429638 1.033446e+01
## 0.8 1.0 1.448535e+01 0.5429824 1.064620e+01
## 0.9 0.0 1.605289e+01 NaN 1.311468e+01
## 0.9 0.1 1.241084e+01 0.4984631 9.837369e+00
## 0.9 0.2 1.191265e+01 0.5076839 8.066954e+00
## 0.9 0.3 1.239418e+01 0.5257997 8.125098e+00
## 0.9 0.4 1.297862e+01 0.5310715 8.654536e+00
## 0.9 0.5 1.350613e+01 0.5317670 9.105192e+00
## 0.9 0.6 1.386112e+01 0.5350888 9.454870e+00
## 0.9 0.7 1.417284e+01 0.5384760 9.873698e+00
## 0.9 0.8 1.448001e+01 0.5408636 1.032868e+01
## 0.9 0.9 1.474556e+01 0.5425441 1.072039e+01
## 0.9 1.0 1.500394e+01 0.5430317 1.104976e+01
## 1.0 0.0 1.605289e+01 NaN 1.311468e+01
## 1.0 0.1 1.239543e+01 0.4979113 9.813566e+00
## 1.0 0.2 1.195463e+01 0.5072525 8.019172e+00
## 1.0 0.3 1.256500e+01 0.5249396 8.175630e+00
## 1.0 0.4 1.322902e+01 0.5302971 8.782117e+00
## 1.0 0.5 1.384047e+01 0.5309961 9.369314e+00
## 1.0 0.6 1.427143e+01 0.5340739 9.791762e+00
## 1.0 0.7 1.462591e+01 0.5375252 1.025180e+01
## 1.0 0.8 1.496755e+01 0.5401210 1.072109e+01
## 1.0 0.9 1.526609e+01 0.5419639 1.113243e+01
## 1.0 1.0 1.554593e+01 0.5427731 1.146808e+01
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 1 and lambda = 0.9.
plot(elasticnet_model, main = "R-square Elastic Net Model")
set.seed(534)
#Predictions
predictions_elasticnet <- predict(elasticnet_model, test)
#Model performance
results_elasticnet <- data.frame(Model = "Elastic Net",
RMSE = caret::RMSE(predictions_elasticnet, test$permeability),
Rsquared = caret::R2(predictions_elasticnet, test$permeability),
MAE = caret::MAE(predictions_elasticnet, test$permeability))
results_elasticnet
## Model RMSE Rsquared MAE
## permeability Elastic Net 13.60161 0.473866 10.31296
Comparison of the models
compmod<-rbind(results, results_pcr, results_ridge, results_lasso, results_elasticnet)
#compmod<-comp[-1]
compmod%>%kable(row.names = F)%>% kable_styling(full_width = FALSE)
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| PLS | 9.931766e+00 | 0.4813456 | 7.404271e+00 |
| PCR | 9.784160e+00 | 0.4838728 | 7.612136e+00 |
| Ridge | 1.127315e+01 | 0.4742014 | 8.297782e+00 |
| Lasso | 3.599179e+07 | 0.0079207 | 5.194982e+06 |
| Elastic Net | 1.360161e+01 | 0.4738660 | 1.031296e+01 |
From the table, PLS Model is doing better if we see the values of RMSE and MAE compared to other models. Notice that \(R^2\) is lower in PLS but \(R^2\) it can be increased with adding even an insignificant predictor. Therefore, PLS is a good model to choose from here.
Would you recommend any of your models to replace the permeability laboratory experiment?
The \(R^2\) value seems to be low for a physical process so I would not use this.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, the manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
library(Amelia)
missmap(ChemicalManufacturingProcess)
The missing values are insignificant as a % and seem to be missing at random. I will be using kNN to impute the missing values.
pre_process <-preProcess(ChemicalManufacturingProcess[, -c(1)], method = "knnImpute")
chemical_imp <- predict(pre_process, ChemicalManufacturingProcess[, -c(1)])
print('No of missing values is')
## [1] "No of missing values is"
sum(is.na(chemical_imp))
## [1] 0
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
It is better to filter out pairs with high correlation first, we use a threshold of 0.90.
correlations <- cor(chemical_imp)
highCorr <- findCorrelation(correlations, cutoff = .9)
chemical_imp <- chemical_imp[, -highCorr]
We will center and scale the data
(pre.process = preProcess(chemical_imp, method = c("BoxCox", "center", "scale")))
## Created from 176 samples and 47 variables
##
## Pre-processing:
## - centered (47)
## - ignored (0)
## - scaled (47)
Split the data
set.seed(675)
#Splitting data into training and test datasets
splitt <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.7, list=FALSE)
X_train <- chemical_imp[splitt, ]
y_train <- ChemicalManufacturingProcess$Yield[splitt]
X_test <- chemical_imp[-splitt, ]
y_test <- ChemicalManufacturingProcess$Yield[-splitt]
We will use the PLS model here.
model_pls <- train(X_train, y_train, method='pls', metric='RMSE',
tuneLength=20, trControl = trainControl(method='cv'))
model_pls
## Partial Least Squares
##
## 124 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 111, 112, 111, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.287871 0.5025434 1.0625987
## 2 1.199313 0.5513777 0.9759255
## 3 1.188637 0.5731375 0.9588816
## 4 1.186537 0.5800785 0.9644993
## 5 1.200505 0.5753360 0.9697254
## 6 1.222620 0.5638443 0.9804865
## 7 1.247016 0.5588769 1.0110444
## 8 1.290894 0.5437929 1.0572265
## 9 1.333730 0.5307537 1.0891686
## 10 1.378266 0.5198481 1.1076451
## 11 1.412787 0.5103675 1.1317620
## 12 1.450614 0.4995396 1.1558610
## 13 1.475706 0.5014033 1.1639757
## 14 1.505213 0.4957886 1.1723211
## 15 1.531410 0.4934102 1.1786340
## 16 1.554586 0.4913442 1.1898845
## 17 1.587931 0.4859518 1.2041464
## 18 1.635204 0.4805996 1.2266924
## 19 1.703957 0.4761138 1.2556180
## 20 1.814303 0.4688571 1.3096496
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
plot(model_pls)
The optimal value of ncomp chosen is 4 and that gives the lowest RMSE at 1.186537 and \(R^2\) of 0.58.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
pls_predict <- predict(model_pls, X_test)
postResample(pls_predict, y_test)
## RMSE Rsquared MAE
## 1.2094706 0.5892557 0.9925435
The RMSE has gone up slightly to 1.209. \(R^2\) remains almost the same.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
plot(varImp(model_pls), top =10)
The Manufacturing Processes dominate the list for top 20 in terms of variable importance, with a few biological materials showing up in the list. The top 2 are ManufacturingProcess32 and ManufacturingProcess13.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
model_pls$finalModel$coefficients
## , , 1 comps
##
## .outcome
## BiologicalMaterial01 0.100044110
## BiologicalMaterial03 0.117074112
## BiologicalMaterial05 0.040799770
## BiologicalMaterial06 0.128834635
## BiologicalMaterial07 0.000000000
## BiologicalMaterial08 0.114337069
## BiologicalMaterial09 0.030432658
## BiologicalMaterial10 0.061667893
## BiologicalMaterial11 0.107852468
## ManufacturingProcess01 -0.036455446
## ManufacturingProcess02 -0.053011666
## ManufacturingProcess03 -0.023009466
## ManufacturingProcess04 -0.087867874
## ManufacturingProcess05 0.028525895
## ManufacturingProcess06 0.101885833
## ManufacturingProcess07 0.002408217
## ManufacturingProcess08 0.018957463
## ManufacturingProcess09 0.121702177
## ManufacturingProcess10 0.071796874
## ManufacturingProcess11 0.080990064
## ManufacturingProcess12 0.103662554
## ManufacturingProcess13 -0.138025156
## ManufacturingProcess14 -0.028408217
## ManufacturingProcess15 0.028122045
## ManufacturingProcess16 -0.015900173
## ManufacturingProcess17 -0.096513371
## ManufacturingProcess19 0.037533864
## ManufacturingProcess20 -0.021440596
## ManufacturingProcess21 0.025267641
## ManufacturingProcess22 0.001851531
## ManufacturingProcess23 -0.037334823
## ManufacturingProcess24 -0.061518814
## ManufacturingProcess26 0.011409920
## ManufacturingProcess28 0.076658387
## ManufacturingProcess30 0.050915236
## ManufacturingProcess32 0.152264786
## ManufacturingProcess33 0.097762636
## ManufacturingProcess34 0.065201061
## ManufacturingProcess35 -0.027222280
## ManufacturingProcess36 -0.119421577
## ManufacturingProcess37 -0.025731448
## ManufacturingProcess38 -0.016182959
## ManufacturingProcess39 0.002845507
## ManufacturingProcess41 -0.014353698
## ManufacturingProcess43 0.049582863
## ManufacturingProcess44 0.006956481
## ManufacturingProcess45 0.006259375
##
## , , 2 comps
##
## .outcome
## BiologicalMaterial01 0.047351099
## BiologicalMaterial03 0.124763334
## BiologicalMaterial05 0.019550329
## BiologicalMaterial06 0.117951499
## BiologicalMaterial07 0.000000000
## BiologicalMaterial08 0.073754046
## BiologicalMaterial09 -0.003844060
## BiologicalMaterial10 -0.007022295
## BiologicalMaterial11 0.074751085
## ManufacturingProcess01 0.006098761
## ManufacturingProcess02 0.019246222
## ManufacturingProcess03 -0.039744769
## ManufacturingProcess04 -0.059907697
## ManufacturingProcess05 0.003770578
## ManufacturingProcess06 0.137628637
## ManufacturingProcess07 0.012898871
## ManufacturingProcess08 0.064643104
## ManufacturingProcess09 0.219879504
## ManufacturingProcess10 0.098863980
## ManufacturingProcess11 0.116494989
## ManufacturingProcess12 0.167363514
## ManufacturingProcess13 -0.264810622
## ManufacturingProcess14 -0.061211261
## ManufacturingProcess15 0.015817166
## ManufacturingProcess16 -0.014537542
## ManufacturingProcess17 -0.231120929
## ManufacturingProcess19 0.011323635
## ManufacturingProcess20 0.012665854
## ManufacturingProcess21 -0.025221306
## ManufacturingProcess22 0.041494556
## ManufacturingProcess23 -0.019856992
## ManufacturingProcess24 -0.066237188
## ManufacturingProcess26 -0.006392833
## ManufacturingProcess28 0.020537751
## ManufacturingProcess30 0.066578397
## ManufacturingProcess32 0.219260350
## ManufacturingProcess33 0.100903841
## ManufacturingProcess34 0.161186184
## ManufacturingProcess35 -0.022006379
## ManufacturingProcess36 -0.152897693
## ManufacturingProcess37 -0.091013809
## ManufacturingProcess38 -0.013093473
## ManufacturingProcess39 0.035821571
## ManufacturingProcess41 -0.038435744
## ManufacturingProcess43 0.057609380
## ManufacturingProcess44 0.036834531
## ManufacturingProcess45 0.046146908
##
## , , 3 comps
##
## .outcome
## BiologicalMaterial01 0.016572795
## BiologicalMaterial03 0.150083196
## BiologicalMaterial05 0.043454468
## BiologicalMaterial06 0.140454724
## BiologicalMaterial07 0.000000000
## BiologicalMaterial08 0.061453931
## BiologicalMaterial09 -0.057811504
## BiologicalMaterial10 -0.048016749
## BiologicalMaterial11 0.079335601
## ManufacturingProcess01 0.043354791
## ManufacturingProcess02 0.048686569
## ManufacturingProcess03 -0.015350798
## ManufacturingProcess04 0.024607847
## ManufacturingProcess05 0.005939617
## ManufacturingProcess06 0.101086717
## ManufacturingProcess07 -0.015277141
## ManufacturingProcess08 0.101060679
## ManufacturingProcess09 0.247814706
## ManufacturingProcess10 0.043881434
## ManufacturingProcess11 0.059989215
## ManufacturingProcess12 0.133617942
## ManufacturingProcess13 -0.300641487
## ManufacturingProcess14 0.009633348
## ManufacturingProcess15 0.101229928
## ManufacturingProcess16 -0.017147063
## ManufacturingProcess17 -0.289032251
## ManufacturingProcess19 0.110816434
## ManufacturingProcess20 0.084072234
## ManufacturingProcess21 -0.071356748
## ManufacturingProcess22 0.098172284
## ManufacturingProcess23 -0.008934295
## ManufacturingProcess24 -0.067210642
## ManufacturingProcess26 -0.008322115
## ManufacturingProcess28 -0.058122914
## ManufacturingProcess30 0.023832260
## ManufacturingProcess32 0.363607345
## ManufacturingProcess33 0.160931887
## ManufacturingProcess34 0.254342163
## ManufacturingProcess35 -0.024596007
## ManufacturingProcess36 -0.229737554
## ManufacturingProcess37 -0.180972255
## ManufacturingProcess38 -0.026622142
## ManufacturingProcess39 0.101741214
## ManufacturingProcess41 -0.054764475
## ManufacturingProcess43 0.051824888
## ManufacturingProcess44 0.081864819
## ManufacturingProcess45 0.129008785
##
## , , 4 comps
##
## .outcome
## BiologicalMaterial01 0.004936713
## BiologicalMaterial03 0.143246348
## BiologicalMaterial05 0.062738627
## BiologicalMaterial06 0.152371830
## BiologicalMaterial07 0.000000000
## BiologicalMaterial08 0.063700161
## BiologicalMaterial09 -0.128720687
## BiologicalMaterial10 -0.076435291
## BiologicalMaterial11 0.056360030
## ManufacturingProcess01 0.085843807
## ManufacturingProcess02 0.042783747
## ManufacturingProcess03 0.021176812
## ManufacturingProcess04 0.131433333
## ManufacturingProcess05 0.048267317
## ManufacturingProcess06 0.059392645
## ManufacturingProcess07 -0.079238893
## ManufacturingProcess08 0.091052719
## ManufacturingProcess09 0.309875891
## ManufacturingProcess10 0.086401272
## ManufacturingProcess11 0.052338856
## ManufacturingProcess12 0.060362044
## ManufacturingProcess13 -0.352444434
## ManufacturingProcess14 -0.007647458
## ManufacturingProcess15 0.126463685
## ManufacturingProcess16 -0.050849753
## ManufacturingProcess17 -0.299693048
## ManufacturingProcess19 0.212194069
## ManufacturingProcess20 0.099737541
## ManufacturingProcess21 -0.020877521
## ManufacturingProcess22 0.081721654
## ManufacturingProcess23 -0.050317947
## ManufacturingProcess24 -0.045885155
## ManufacturingProcess26 0.022289091
## ManufacturingProcess28 -0.131814818
## ManufacturingProcess30 0.021207184
## ManufacturingProcess32 0.444560183
## ManufacturingProcess33 0.160418879
## ManufacturingProcess34 0.325246102
## ManufacturingProcess35 -0.030904398
## ManufacturingProcess36 -0.245350049
## ManufacturingProcess37 -0.209494711
## ManufacturingProcess38 -0.091421304
## ManufacturingProcess39 0.137449887
## ManufacturingProcess41 -0.066613707
## ManufacturingProcess43 0.064631344
## ManufacturingProcess44 0.064103457
## ManufacturingProcess45 0.157706627
According to the coefficients above, ManufacturingProcess32 has the highest positive relationship followed by ManufacturingProcess36 (negative impact).
ManufacturingProcess32 improved the yield tremendously, and it has the highest, positive correlation than the other variables in the model. The ManufacturingProcess32 coefficient in the regression equation is 0.445 which shows the units the yield increases for every additional unit of ManufacturingProcess32.
cor(ChemicalManufacturingProcess$Yield,
ChemicalManufacturingProcess$ManufacturingProcess32)
## [1] 0.6083321
From the negative coefficients, ManufacturingProcess13 affected the yield tremendously, and it has a negative correlation. The ManufacturingProcess13 coefficient in the regression equation is -0.35 which shows the yield decrease for every additional unit of ManufacturingProcess13.
cor(ChemicalManufacturingProcess$Yield,
ChemicalManufacturingProcess$ManufacturingProcess13)
## [1] -0.5036797