6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a) Start R and use these commands to load the data
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
remove.cols <- nearZeroVar(fingerprints)
X <- fingerprints[,-remove.cols]
length(remove.cols) %>% paste(" predictors removed from fingerprint matrix")
## [1] "719 predictors removed from fingerprint matrix"
dim(X)[2] %>% paste(" predictors remain from fingerprint matrix")
## [1] "388 predictors remain from fingerprint matrix"
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?
set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
X.train <- X[trainRow, ]
y.train <- permeability[trainRow, ]
X.test <- X[-trainRow, ]
y.test <- permeability[-trainRow, ]
set.seed(1)
plsFit <- train(x=X.train,
y=y.train,
method='pls',
metric='Rsquared',
tuneLength=20,
trControl=trainControl(method='cv'),
preProcess=c('center', 'scale')
)
plsResult <- plsFit$results
plsFit
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.43441 0.3057483 10.026064
## 2 12.25160 0.4403062 8.723873
## 3 11.73659 0.4724559 8.888187
## 4 11.62749 0.4754973 8.803811
## 5 11.48531 0.4719510 8.551965
## 6 11.32229 0.4847234 8.355657
## 7 11.24192 0.4945511 8.449710
## 8 11.11175 0.5094459 8.540873
## 9 10.99359 0.5256837 8.441846
## 10 11.08075 0.5276792 8.420112
## 11 11.28732 0.5235339 8.570830
## 12 11.22365 0.5307880 8.617399
## 13 11.45069 0.5207469 8.850330
## 14 11.70270 0.5161730 9.115517
## 15 11.96240 0.4954513 9.269173
## 16 12.22588 0.4803732 9.467217
## 17 12.64388 0.4616823 9.853870
## 18 12.97678 0.4481705 10.176509
## 19 13.16823 0.4433375 10.285782
## 20 13.51347 0.4262324 10.524854
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 12.
Using R^2 as the deciding metric, the CV found the optimal ncomp to be 12, with the maximum R^2 being 0.530788.
plot(plsFit)
(d) Predict the response for the test set. What is the test set estimate of R^2?
plsPred <- predict(plsFit, newdata=X.test)
postResample(pred=plsPred, obs=y.test)
## RMSE Rsquared MAE
## 12.4780591 0.3536607 8.7204595
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
Attempting three additional models
–Ridge regression, parameter tuned: lambda (from 0 to 1 by 0.1) –Lasso, parameter tuned: fraction (from 0 to 0.5 by 0.05) –Elastic net, parameters tuned: fraction and lambda (2-D grid with each D from 0 to 1 by 0.1) I ensure that all of the models have the same seed, so their CV sets are identical. This way, I can then use the resamples functions to compare all 4 models at once. The R^2 metrics are used in all cases.
set.seed(1)
ridgeFit <- train(x=X.train,
y=y.train,
method='ridge',
metric='Rsquared',
tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold01: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold07: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridgeFit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 5356.11572 0.2486749 2880.252316
## 0.1 11.77578 0.5105327 9.055483
## 0.2 11.58842 0.5332129 8.856245
## 0.3 11.79544 0.5390658 9.011344
## 0.4 12.15804 0.5403632 9.281896
## 0.5 12.62564 0.5397839 9.662863
## 0.6 13.16666 0.5383912 10.126622
## 0.7 13.76370 0.5366428 10.603431
## 0.8 14.40455 0.5347592 11.112877
## 0.9 15.08030 0.5328606 11.635311
## 1.0 15.78236 0.5309934 12.222635
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.4.
plot(ridgeFit)
set.seed(1)
lassoFit <- train(x=X.train,
y=y.train,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold01: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold07: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = X.train, y = y.train, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results
lassoFit
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 15.14497 NaN 12.18338
## 0.05 282.05852 0.2771002 159.77875
## 0.10 547.13547 0.2949129 302.64198
## 0.15 813.24271 0.3118860 445.37620
## 0.20 1079.83629 0.3166240 588.46142
## 0.25 1346.59535 0.3133758 731.73650
## 0.30 1613.53589 0.3168971 874.88495
## 0.35 1880.48829 0.3164859 1018.02344
## 0.40 2147.61922 0.3111799 1161.26767
## 0.45 2414.94636 0.3036532 1304.60396
## 0.50 2682.39241 0.2976325 1447.90368
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.3.
set.seed(1)
enetFit <- train(x=X.train,
y=y.train,
method='enet',
metric='Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold01: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold07: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = X.train, y = y.train, method = "enet", metric =
## "Rsquared", : missing values found in aggregated results
enetFit
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 15.14497 NaN 12.183383
## 0.0 0.1 547.13547 0.2949129 302.641985
## 0.0 0.2 1079.83629 0.3166240 588.461421
## 0.0 0.3 1613.53589 0.3168971 874.884947
## 0.0 0.4 2147.61922 0.3111799 1161.267665
## 0.0 0.5 2682.39241 0.2976325 1447.903684
## 0.0 0.6 3217.32188 0.2800790 1734.533604
## 0.0 0.7 3752.19169 0.2671443 2021.045306
## 0.0 0.8 4287.02976 0.2577913 2307.562509
## 0.0 0.9 4821.82852 0.2515728 2594.010564
## 0.0 1.0 5356.11572 0.2486749 2880.252316
## 0.1 0.0 15.69625 NaN 12.538724
## 0.1 0.1 12.20748 0.4440476 8.590043
## 0.1 0.2 11.62592 0.4685590 8.404613
## 0.1 0.3 11.05009 0.5101069 8.133017
## 0.1 0.4 10.86508 0.5280515 8.038032
## 0.1 0.5 10.79665 0.5426627 8.038752
## 0.1 0.6 10.91529 0.5415457 8.202280
## 0.1 0.7 11.13653 0.5336798 8.434152
## 0.1 0.8 11.35453 0.5257101 8.664111
## 0.1 0.9 11.56772 0.5180970 8.868485
## 0.1 1.0 11.77578 0.5105327 9.055483
## 0.2 0.0 15.69625 NaN 12.538724
## 0.2 0.1 12.36784 0.4390558 8.736203
## 0.2 0.2 11.88382 0.4577598 8.498157
## 0.2 0.3 11.34436 0.4951866 8.302594
## 0.2 0.4 11.10231 0.5168465 8.177826
## 0.2 0.5 11.03692 0.5296733 8.198425
## 0.2 0.6 10.98737 0.5429416 8.221419
## 0.2 0.7 11.09910 0.5445789 8.382248
## 0.2 0.8 11.25742 0.5418776 8.557132
## 0.2 0.9 11.41327 0.5382843 8.713038
## 0.2 1.0 11.58842 0.5332129 8.856245
## 0.3 0.0 15.69625 NaN 12.538724
## 0.3 0.1 12.44302 0.4353294 8.806184
## 0.3 0.2 12.03368 0.4570275 8.548660
## 0.3 0.3 11.61572 0.4872069 8.445786
## 0.3 0.4 11.37440 0.5090081 8.329739
## 0.3 0.5 11.28840 0.5244474 8.348912
## 0.3 0.6 11.24486 0.5386504 8.422953
## 0.3 0.7 11.31903 0.5448835 8.551257
## 0.3 0.8 11.47106 0.5444677 8.709536
## 0.3 0.9 11.63535 0.5423500 8.873215
## 0.3 1.0 11.79544 0.5390658 9.011344
## 0.4 0.0 15.69625 NaN 12.538724
## 0.4 0.1 12.46191 0.4358930 8.815372
## 0.4 0.2 12.17693 0.4582466 8.554661
## 0.4 0.3 11.88896 0.4821079 8.608029
## 0.4 0.4 11.67254 0.5037734 8.517484
## 0.4 0.5 11.59686 0.5198163 8.554244
## 0.4 0.6 11.60051 0.5328710 8.665629
## 0.4 0.7 11.67020 0.5414568 8.803825
## 0.4 0.8 11.81356 0.5437613 8.940722
## 0.4 0.9 11.98601 0.5429672 9.103568
## 0.4 1.0 12.15804 0.5403632 9.281896
## 0.5 0.0 15.69625 NaN 12.538724
## 0.5 0.1 12.47802 0.4353104 8.808893
## 0.5 0.2 12.29626 0.4608017 8.532296
## 0.5 0.3 12.18054 0.4789290 8.782376
## 0.5 0.4 12.00209 0.5003660 8.762006
## 0.5 0.5 11.96956 0.5156565 8.818677
## 0.5 0.6 12.02218 0.5274183 8.947892
## 0.5 0.7 12.10691 0.5370474 9.115591
## 0.5 0.8 12.25483 0.5412436 9.281079
## 0.5 0.9 12.43242 0.5420301 9.467562
## 0.5 1.0 12.62564 0.5397839 9.662863
## 0.6 0.0 15.69625 NaN 12.538724
## 0.6 0.1 12.49756 0.4340608 8.797122
## 0.6 0.2 12.44274 0.4616598 8.530958
## 0.6 0.3 12.48650 0.4770240 8.960343
## 0.6 0.4 12.36182 0.4980510 9.034142
## 0.6 0.5 12.39563 0.5115920 9.110625
## 0.6 0.6 12.48444 0.5232870 9.265083
## 0.6 0.7 12.59923 0.5328637 9.467803
## 0.6 0.8 12.76209 0.5381950 9.677472
## 0.6 0.9 12.95750 0.5396472 9.908992
## 0.6 1.0 13.16666 0.5383912 10.126622
## 0.7 0.0 15.69625 NaN 12.538724
## 0.7 0.1 12.51610 0.4331791 8.774374
## 0.7 0.2 12.60467 0.4618883 8.545572
## 0.7 0.3 12.81294 0.4757621 9.138647
## 0.7 0.4 12.75434 0.4961645 9.305793
## 0.7 0.5 12.84907 0.5083508 9.419925
## 0.7 0.6 12.97825 0.5197860 9.615271
## 0.7 0.7 13.13597 0.5290128 9.863176
## 0.7 0.8 13.32407 0.5347350 10.118876
## 0.7 0.9 13.53749 0.5368695 10.365385
## 0.7 1.0 13.76370 0.5366428 10.603431
## 0.8 0.0 15.69625 NaN 12.538724
## 0.8 0.1 12.53490 0.4326035 8.745463
## 0.8 0.2 12.78927 0.4614059 8.587022
## 0.8 0.3 13.14645 0.4756061 9.310523
## 0.8 0.4 13.17713 0.4945085 9.581096
## 0.8 0.5 13.33199 0.5057091 9.744770
## 0.8 0.6 13.50336 0.5168459 10.009297
## 0.8 0.7 13.70513 0.5257158 10.301729
## 0.8 0.8 13.92383 0.5314463 10.587227
## 0.8 0.9 14.15477 0.5343439 10.849712
## 0.8 1.0 14.40455 0.5347592 11.112877
## 0.9 0.0 15.69625 NaN 12.538724
## 0.9 0.1 12.54943 0.4324329 8.704999
## 0.9 0.2 12.99197 0.4608802 8.653472
## 0.9 0.3 13.49650 0.4760353 9.524746
## 0.9 0.4 13.62774 0.4930213 9.891082
## 0.9 0.5 13.84359 0.5033817 10.134488
## 0.9 0.6 14.06161 0.5140795 10.465220
## 0.9 0.7 14.30670 0.5226521 10.792990
## 0.9 0.8 14.55448 0.5283852 11.079886
## 0.9 0.9 14.80839 0.5318659 11.350021
## 0.9 1.0 15.08030 0.5328606 11.635311
## 1.0 0.0 15.69625 NaN 12.538724
## 1.0 0.1 12.56901 0.4322952 8.664635
## 1.0 0.2 13.21208 0.4601537 8.737332
## 1.0 0.3 13.86592 0.4764227 9.821570
## 1.0 0.4 14.10547 0.4915907 10.276441
## 1.0 0.5 14.37680 0.5013566 10.578398
## 1.0 0.6 14.64452 0.5116315 10.955812
## 1.0 0.7 14.93333 0.5198521 11.315343
## 1.0 0.8 15.20941 0.5256051 11.606150
## 1.0 0.9 15.48734 0.5295564 11.912368
## 1.0 1.0 15.78236 0.5309934 12.222635
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.7 and lambda = 0.3.
plot(enetFit)
resamp <- resamples(list(PLS=plsFit, Ridge=ridgeFit, Lasso=lassoFit, enet=enetFit))
(resamp.s <- summary(resamp))
##
## Call:
## summary.resamples(object = resamp)
##
## Models: PLS, Ridge, Lasso, enet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 5.178501 7.188415 8.333728 8.617399 10.302540 11.69573 0
## Ridge 6.795657 7.680192 9.047882 9.281896 11.027027 11.95240 0
## Lasso 6.386738 9.984898 27.172988 874.884947 58.298420 5954.06827 3
## enet 6.504255 7.274734 8.430117 8.551257 9.611989 11.46330 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 6.535488 9.638799 11.53551 11.22365 13.31952 14.06551 0
## Ridge 9.902231 10.449327 11.48375 12.15804 13.75999 15.48795 0
## Lasso 8.064446 13.248283 36.79488 1613.53589 141.47516 10940.44499 3
## enet 9.348709 9.943900 10.86659 11.31903 12.37915 14.11738 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 1.578591e-01 0.45401817 0.5665093 0.5307880 0.6136715 0.9158369 0
## Ridge 1.831412e-01 0.40193440 0.5981921 0.5403632 0.6504959 0.8220435 0
## Lasso 1.303612e-05 0.08585287 0.3422656 0.3168971 0.4176661 0.8689629 3
## enet 2.178972e-01 0.39405040 0.5593544 0.5448835 0.6903328 0.8208629 0
Inference: – The model with the maximum mean R^2 appears to be the elastic net model, with R^2 = 0.5448835
Also evaulating the models using test set
plsPred <- predict(plsFit, newdata=X.test)
postResample(pred=plsPred, obs=y.test)
## RMSE Rsquared MAE
## 12.4780591 0.3536607 8.7204595
multiResample <- function(models, newdata, obs){
res = list()
methods = c()
i = 1
for (model in models){
pred <- predict(model, newdata=newdata)
metrics <- postResample(pred=pred, obs=obs)
res[[i]] <- metrics
methods[[i]] <- model$method
i <- 1 + i
}
names(res) <- methods
return(res)
}
models <- list(plsFit, ridgeFit, lassoFit, enetFit)
(resampleResult <- multiResample(models, X.test, y.test))
## $pls
## RMSE Rsquared MAE
## 12.4780591 0.3536607 8.7204595
##
## $ridge
## RMSE Rsquared MAE
## 12.3734550 0.4254119 8.8972712
##
## $lasso
## RMSE Rsquared MAE
## 13.4125805 0.2963818 9.3458962
##
## $enet
## RMSE Rsquared MAE
## 11.7077013 0.4250689 8.3796258
The evaluation on the test sets seems to suggest that the ridge model is best, with R^2 = 0.4254119 . Here we seem to have a dilemma: the 10-fold cross validations suggest that the elastic net model is the best, while the test set evaluation suggest that the ridge model is the best. Here, I would choose to trust the cross validation result, because the cross validation result is closer approximation to the true distribution than the test set, which is equivalent to just one fold of the whole set.
Nonetheless, the scores for the Ridge, Lasso, and Enet are all higher (better performance) than the PLS.
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
I would not recommend any of the models to replace the permeability laboratory experiment. The MAE of all of the models are roughly between 8 and 9, meaning that the model predictions are on average +/- 8 to 9 off. Looking at the histogram of the target variable permeability:
hist(permeability)
Inference: We can see that most of permeability are under 10. The model’s accuracy is not good enough to replace lab test.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data.
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
** (b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8). **
(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))
## Created from 152 samples and 57 variables
##
## Pre-processing:
## - bagged tree imputation (57)
## - ignored (0)
cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]
The elastic net model is tuned using 10-fold cross validation with parameters lambda ranging from 0 to 1, and fraction ranging from 0 to 1. The metric used to decide is the RMSE.
set.seed(1)
enetFit <- train(x=X.train,
y=y.train,
method='enet',
metric='RMSE',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
plot(enetFit)
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
enetPred <- predict(enetFit, newdata=X.test)
(predResult <- postResample(pred=enetPred, obs=y.test))
## RMSE Rsquared MAE
## 1.0264650 0.6979678 0.7817494
The test set RMSE is 1.0264650 . This is lower than the resampled performance metric (cross validated RMSE) on the training set. So the test set result appears to be better than the training set result.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list? The coefficients of the best-tuned elastic net model is below. We can see that the elastic net zero out some of the predictors, due to the lasso penalty.
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
(coeffs <- predict.enet(enetFit$finalModel, s=enetFit$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients)
## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 0.00000000 0.00000000 0.28496893
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 0.00000000 0.21331810 0.01512520
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## -0.04492429 0.07932604 -0.08810680
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## -0.15290025 0.00000000 0.00000000
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 0.31049470 -0.01165359 0.14112246
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## -0.14265019 0.00000000 0.54475730
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 0.00000000 0.08506511 0.05785315
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## -0.12522821 0.00000000 0.11697980
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 0.00000000 -0.17591409 0.03478354
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 0.11832431 0.00000000 0.00000000
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## -0.02215018 -0.03485616 0.00000000
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## -0.21966146 0.04417853 0.00000000
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 0.00000000 1.29604227 -0.38548635
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 0.01872111 -0.10998748 0.00000000
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## -0.31058395 0.00000000 0.09365101
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## -0.01665097 0.00000000 0.09031709
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 0.19136651 0.00000000 0.05809911
We can compare the non-zero coefficients by taking their absolute value, and then sorting them:
coeffs.sorted <- abs(coeffs)
coeffs.sorted <- coeffs.sorted[coeffs.sorted>0]
(coeffs.sorted <- sort(coeffs.sorted, decreasing = T))
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess33
## 1.29604227 0.54475730 0.38548635
## ManufacturingProcess37 ManufacturingProcess04 BiologicalMaterial03
## 0.31058395 0.31049470 0.28496893
## ManufacturingProcess28 BiologicalMaterial05 ManufacturingProcess43
## 0.21966146 0.21331810 0.19136651
## ManufacturingProcess17 BiologicalMaterial10 ManufacturingProcess07
## 0.17591409 0.15290025 0.14265019
## ManufacturingProcess06 ManufacturingProcess13 ManufacturingProcess19
## 0.14112246 0.12522821 0.11832431
## ManufacturingProcess15 ManufacturingProcess35 ManufacturingProcess39
## 0.11697980 0.10998748 0.09365101
## ManufacturingProcess42 BiologicalMaterial09 ManufacturingProcess11
## 0.09031709 0.08810680 0.08506511
## BiologicalMaterial08 ManufacturingProcess45 ManufacturingProcess12
## 0.07932604 0.05809911 0.05785315
## BiologicalMaterial07 ManufacturingProcess29 ManufacturingProcess23
## 0.04492429 0.04417853 0.03485616
## ManufacturingProcess18 ManufacturingProcess22 ManufacturingProcess34
## 0.03478354 0.02215018 0.01872111
## ManufacturingProcess40 BiologicalMaterial06 ManufacturingProcess05
## 0.01665097 0.01512520 0.01165359
We can conclude the following:
26 out of the 45 ManufacturingProcess predictors are zero’d out, while 7 out of the 12 BiologicalMaterial predictors are zero’d out. In the remaining 24 predictors, 19 are ManufacturingProcess predictors and just 5 are BiologicalMaterial predictors The top 7 highest absolute coefficients are all from the ManufacturingProcess predictors. It appears that ManufacturingProcess are more important. Alternatively, varImp function can be used to rank the importance of predictors:
varImp(enetFit)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 84.76
## ManufacturingProcess36 73.52
## BiologicalMaterial03 70.69
## ManufacturingProcess13 68.69
## BiologicalMaterial02 64.16
## ManufacturingProcess31 60.12
## BiologicalMaterial12 60.11
## ManufacturingProcess17 57.19
## ManufacturingProcess09 55.42
## ManufacturingProcess33 55.09
## BiologicalMaterial04 51.35
## ManufacturingProcess06 46.88
## BiologicalMaterial11 46.75
## ManufacturingProcess29 46.36
## BiologicalMaterial01 41.20
## BiologicalMaterial08 38.75
## ManufacturingProcess26 28.05
## BiologicalMaterial09 26.02
## ManufacturingProcess11 25.51
Again, 11 out of the 20 in the list are ManufacturingProcess predictors, which makes it more important than BiologicalMaterial.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Elastic net is a linear regression model. The coefficients directly explain how the predictors affect the target. Positive coefficients improve the yield, while negative coefficients decrease the yield.
For the ManufacturingProcess having the positive coefficients, I would alter the process such that the predictor value increases. Below are the ManufacturingProcess having positive coefficients:
coeffs.mp <- coeffs.sorted[grep('ManufacturingProcess', names(coeffs.sorted))] %>% names() %>% coeffs[.]
coeffs.mp[coeffs.mp>0]
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess04
## 1.29604227 0.54475730 0.31049470
## ManufacturingProcess43 ManufacturingProcess06 ManufacturingProcess19
## 0.19136651 0.14112246 0.11832431
## ManufacturingProcess15 ManufacturingProcess39 ManufacturingProcess42
## 0.11697980 0.09365101 0.09031709
## ManufacturingProcess11 ManufacturingProcess45 ManufacturingProcess12
## 0.08506511 0.05809911 0.05785315
## ManufacturingProcess29 ManufacturingProcess18 ManufacturingProcess34
## 0.04417853 0.03478354 0.01872111
For the ManufacturingProcess having the negative coefficients, I would alter the process such that the predictor value decreases. Below are the ManufacturingProcess having negative coefficients:
coeffs.mp[coeffs.mp<0]
## ManufacturingProcess33 ManufacturingProcess37 ManufacturingProcess28
## -0.38548635 -0.31058395 -0.21966146
## ManufacturingProcess17 ManufacturingProcess07 ManufacturingProcess13
## -0.17591409 -0.14265019 -0.12522821
## ManufacturingProcess35 ManufacturingProcess23 ManufacturingProcess22
## -0.10998748 -0.03485616 -0.02215018
## ManufacturingProcess40 ManufacturingProcess05
## -0.01665097 -0.01165359