library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(dplyr)
library(elasticnet)
library(ggplot2)
library(lattice)Linear Regression
Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug
Start R and use these commands to load the data
## permeability
## Min. : 0.06
## 1st Qu.: 1.55
## Median : 4.91
## Mean :12.24
## 3rd Qu.:15.47
## Max. :55.60
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
## num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr "permeability"
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
#Filtering the predictors that hae low frequencies nearZeroVar
fingerprints %>% nearZeroVar() %>% length()## [1] 719
## [1] 388
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
The optimal ncomp is 11 with the maximum R2 is 0.5271005.
#Split data into train/test set using the createDataPartition
set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
X.train <- X[trainRow, ]
y.train <- permeability[trainRow, ]
X.test <- X[-trainRow, ]
y.test <- permeability[-trainRow, ]#Pre-process training set then cross validation of ncomp parameter for PLS model from 1 to 20.
set.seed(1)
plsFit <- train(x=X.train, y=y.train, method='pls', metric='Rsquared', tuneLength=20, trControl=trainControl(method='cv'), preProcess=c('center', 'scale'))
plsResult <- plsFit$results
plsFit## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.66375 0.3461521 9.603300
## 2 11.66557 0.4884962 8.364131
## 3 11.86090 0.4462082 9.081715
## 4 12.15438 0.4254096 9.389547
## 5 12.00549 0.4492658 9.076973
## 6 11.70875 0.4618761 8.826006
## 7 11.42799 0.4775314 8.747634
## 8 11.25481 0.4925856 8.533191
## 9 11.03884 0.5066896 8.344461
## 10 10.88506 0.5187689 8.028416
## 11 10.80216 0.5271005 8.012505
## 12 10.81509 0.5247495 8.034195
## 13 10.74326 0.5245241 8.003891
## 14 10.70988 0.5238880 7.892343
## 15 10.87078 0.5198265 8.032440
## 16 11.25383 0.4999003 8.375209
## 17 11.52949 0.4886010 8.619614
## 18 11.46969 0.4907743 8.541789
## 19 11.46414 0.4906053 8.591222
## 20 11.38025 0.5003197 8.484932
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 11.
Predict the response for the test set. What is the test set estimate of R2?
The test set estimate of R2 is .2899935.
## RMSE Rsquared MAE
## 15.4341377 0.2899935 11.6639869
Try building other models discussed in this chapter. Do any have better predictive performance?
The the test sets seems to suggest that the Lasso model is best, with an R2 of 0.3954258. The scores for the Ridge, Lasso, and Enet are better performing than the PLS.
#Ridge Regression
set.seed(1)
ridgeFit <- train(x=X.train, y=y.train, method='ridge', metric='Rsquared', tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)), trControl=trainControl(method='cv'), preProcess=c('center','scale'))## Warning: model fit failed for Fold08: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 12.31893 0.4528045 9.294510
## 0.1 10.79390 0.5292391 7.935830
## 0.2 11.15821 0.5289694 8.241192
## 0.3 11.56911 0.5286210 8.612171
## 0.4 12.09487 0.5259461 9.039822
## 0.5 12.67474 0.5230682 9.494363
## 0.6 13.29786 0.5205254 10.029131
## 0.7 13.95945 0.5179546 10.622090
## 0.8 14.64851 0.5156990 11.217376
## 0.9 15.36147 0.5136360 11.828589
## 1.0 16.09410 0.5117506 12.434283
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.1.
#Lasso
set.seed(1)
lassoFit <- train(x=X.train, y=y.train, method='lasso', metric='Rsquared', tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),trControl=trainControl(method='cv'), preProcess=c('center','scale'))## Warning: model fit failed for Fold08: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = X.train, y = y.train, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 14.62719 NaN 11.901845
## 0.05 13.29247 0.4552939 10.359126
## 0.10 12.80621 0.4554363 9.552588
## 0.15 12.90315 0.4235837 9.618665
## 0.20 12.87604 0.4068843 9.600865
## 0.25 12.75649 0.4114738 9.556420
## 0.30 12.66593 0.4169995 9.558130
## 0.35 12.58089 0.4227670 9.518502
## 0.40 12.50366 0.4283152 9.458058
## 0.45 12.44435 0.4366066 9.404528
## 0.50 12.35249 0.4422982 9.333097
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.1.
# Elasticnet
set.seed(1)
enetFit <- train(x=X.train, y=y.train, method='enet', metric='Rsquared', tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), .lambda = seq(0, 1, by=0.1)), trControl=trainControl(method='cv'), preProcess=c('center','scale'))## Warning: model fit failed for Fold08: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = X.train, y = y.train, method = "enet", metric =
## "Rsquared", : missing values found in aggregated results
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 14.62719 NaN 11.901845
## 0.0 0.1 12.80621 0.4554363 9.552588
## 0.0 0.2 12.87604 0.4068843 9.600865
## 0.0 0.3 12.66593 0.4169995 9.558130
## 0.0 0.4 12.50366 0.4283152 9.458058
## 0.0 0.5 12.35249 0.4422982 9.333097
## 0.0 0.6 12.25156 0.4459605 9.303113
## 0.0 0.7 12.19984 0.4489988 9.311995
## 0.0 0.8 12.21014 0.4503081 9.280251
## 0.0 0.9 12.24057 0.4526927 9.271295
## 0.0 1.0 12.31893 0.4528045 9.294510
## 0.1 0.0 15.17471 NaN 12.242520
## 0.1 0.1 11.55434 0.4874903 8.229407
## 0.1 0.2 11.63397 0.4714981 8.459077
## 0.1 0.3 11.17731 0.5023828 8.228974
## 0.1 0.4 10.93531 0.5152918 8.045882
## 0.1 0.5 10.80908 0.5216687 7.920525
## 0.1 0.6 10.72893 0.5273809 7.828944
## 0.1 0.7 10.69309 0.5308887 7.777732
## 0.1 0.8 10.72480 0.5308325 7.832275
## 0.1 0.9 10.75986 0.5298811 7.908604
## 0.1 1.0 10.79390 0.5292391 7.935830
## 0.2 0.0 15.17471 NaN 12.242520
## 0.2 0.1 11.48301 0.4945241 8.170117
## 0.2 0.2 11.84209 0.4696152 8.505668
## 0.2 0.3 11.43389 0.4958256 8.375297
## 0.2 0.4 11.19698 0.5114505 8.243848
## 0.2 0.5 11.09985 0.5182971 8.147914
## 0.2 0.6 11.00732 0.5263811 8.075977
## 0.2 0.7 10.99994 0.5305898 8.069433
## 0.2 0.8 11.05283 0.5299608 8.106566
## 0.2 0.9 11.10436 0.5299953 8.173625
## 0.2 1.0 11.15821 0.5289694 8.241192
## 0.3 0.0 15.17471 NaN 12.242520
## 0.3 0.1 11.43015 0.4972166 8.101725
## 0.3 0.2 11.96864 0.4730337 8.473644
## 0.3 0.3 11.75988 0.4899883 8.585111
## 0.3 0.4 11.56666 0.5064758 8.533954
## 0.3 0.5 11.46977 0.5140181 8.420096
## 0.3 0.6 11.41659 0.5217383 8.390925
## 0.3 0.7 11.38872 0.5284379 8.402842
## 0.3 0.8 11.43282 0.5293766 8.443219
## 0.3 0.9 11.50977 0.5285789 8.537666
## 0.3 1.0 11.56911 0.5286210 8.612171
## 0.4 0.0 15.17471 NaN 12.242520
## 0.4 0.1 11.41234 0.4970136 8.041166
## 0.4 0.2 12.07118 0.4773956 8.417483
## 0.4 0.3 12.09220 0.4855603 8.772254
## 0.4 0.4 11.96135 0.5008874 8.794705
## 0.4 0.5 11.91102 0.5097316 8.694116
## 0.4 0.6 11.89320 0.5169703 8.720530
## 0.4 0.7 11.87520 0.5239754 8.757094
## 0.4 0.8 11.92508 0.5256967 8.832045
## 0.4 0.9 12.00938 0.5256961 8.936506
## 0.4 1.0 12.09487 0.5259461 9.039822
## 0.5 0.0 15.17471 NaN 12.242520
## 0.5 0.1 11.40954 0.4963705 7.986894
## 0.5 0.2 12.20206 0.4801158 8.392077
## 0.5 0.3 12.46719 0.4806466 8.951887
## 0.5 0.4 12.40354 0.4958890 9.045601
## 0.5 0.5 12.39951 0.5050895 8.992832
## 0.5 0.6 12.42175 0.5118355 9.100795
## 0.5 0.7 12.42012 0.5187276 9.170141
## 0.5 0.8 12.47215 0.5220491 9.262259
## 0.5 0.9 12.57012 0.5224055 9.371348
## 0.5 1.0 12.67474 0.5230682 9.494363
## 0.6 0.0 15.17471 NaN 12.242520
## 0.6 0.1 11.43256 0.4939374 7.946523
## 0.6 0.2 12.36966 0.4817767 8.395529
## 0.6 0.3 12.85821 0.4782145 9.124133
## 0.6 0.4 12.87990 0.4922499 9.308087
## 0.6 0.5 12.92052 0.5013594 9.353646
## 0.6 0.6 12.97712 0.5077939 9.498849
## 0.6 0.7 13.00404 0.5142719 9.613319
## 0.6 0.8 13.06977 0.5182672 9.741350
## 0.6 0.9 13.17859 0.5193988 9.891632
## 0.6 1.0 13.29786 0.5205254 10.029131
## 0.7 0.0 15.17471 NaN 12.242520
## 0.7 0.1 11.46619 0.4912343 7.912957
## 0.7 0.2 12.55303 0.4833110 8.390999
## 0.7 0.3 13.25892 0.4767863 9.301066
## 0.7 0.4 13.37747 0.4891867 9.594389
## 0.7 0.5 13.47327 0.4977598 9.738364
## 0.7 0.6 13.56088 0.5039024 9.927475
## 0.7 0.7 13.62075 0.5100817 10.119179
## 0.7 0.8 13.70839 0.5144473 10.288818
## 0.7 0.9 13.82941 0.5164615 10.468116
## 0.7 1.0 13.95945 0.5179546 10.622090
## 0.8 0.0 15.17471 NaN 12.242520
## 0.8 0.1 11.50692 0.4886514 7.879098
## 0.8 0.2 12.76479 0.4839868 8.409995
## 0.8 0.3 13.66419 0.4762875 9.519340
## 0.8 0.4 13.89244 0.4866892 9.940965
## 0.8 0.5 14.04197 0.4946135 10.165559
## 0.8 0.6 14.16619 0.5006537 10.436159
## 0.8 0.7 14.25971 0.5068011 10.645189
## 0.8 0.8 14.36988 0.5114073 10.843979
## 0.8 0.9 14.50376 0.5140374 11.043532
## 0.8 1.0 14.64851 0.5156990 11.217376
## 0.9 0.0 15.17471 NaN 12.242520
## 0.9 0.1 11.55121 0.4862199 7.848229
## 0.9 0.2 12.99728 0.4843262 8.472094
## 0.9 0.3 14.07288 0.4758987 9.760572
## 0.9 0.4 14.42043 0.4845527 10.309539
## 0.9 0.5 14.62738 0.4918566 10.622684
## 0.9 0.6 14.78920 0.4978107 10.969734
## 0.9 0.7 14.91416 0.5040442 11.192740
## 0.9 0.8 15.05381 0.5086436 11.412504
## 0.9 0.9 15.19917 0.5118594 11.638340
## 0.9 1.0 15.36147 0.5136360 11.828589
## 1.0 0.0 15.17471 NaN 12.242520
## 1.0 0.1 11.60369 0.4838849 7.823994
## 1.0 0.2 13.24717 0.4844231 8.565415
## 1.0 0.3 14.49421 0.4757489 10.001137
## 1.0 0.4 14.95867 0.4828266 10.733226
## 1.0 0.5 15.22359 0.4895516 11.135659
## 1.0 0.6 15.42999 0.4952250 11.523292
## 1.0 0.7 15.58606 0.5014637 11.753596
## 1.0 0.8 15.75800 0.5060329 11.990365
## 1.0 0.9 15.91616 0.5098193 12.235461
## 1.0 1.0 16.09410 0.5117506 12.434283
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.7 and lambda = 0.1.
resamp <- resamples(list(PLS=plsFit, Ridge=ridgeFit, Lasso=lassoFit, enet=enetFit))
(resamp.s <- summary(resamp))##
## Call:
## summary.resamples(object = resamp)
##
## Models: PLS, Ridge, Lasso, enet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 3.431086 6.279454 8.683473 8.012505 9.946990 10.48475 0
## Ridge 2.932725 6.407812 8.665880 7.935830 9.924056 10.45745 0
## Lasso 5.791257 7.693374 7.903535 9.552588 9.632303 21.21770 1
## enet 2.993593 6.181506 8.157837 7.777732 9.796487 10.79813 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 4.127860 7.883648 12.72946 10.80216 13.06351 14.98578 0
## Ridge 3.517458 8.293763 12.22699 10.79390 13.11415 15.70450 0
## Lasso 7.657579 8.839116 12.01826 12.80621 14.38416 25.37189 1
## enet 3.616262 8.054933 11.81838 10.69309 13.31569 14.94121 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.0007971331 0.3795375 0.5019221 0.5271005 0.7601007 0.9084493 0
## Ridge 0.0047218766 0.3487174 0.5283737 0.5292391 0.7709032 0.9297472 0
## Lasso 0.0235890619 0.1246479 0.5805556 0.4554363 0.7421118 0.8656899 1
## enet 0.0042021862 0.3407759 0.5243361 0.5308887 0.7766468 0.9259750 0
## RMSE Rsquared MAE
## 15.4341377 0.2899935 11.6639869
#Evaluation of models using the test set
multiResample <- function(models, newdata, obs){
res = list()
methods = c()
i = 1
for (model in models){
pred <- predict(model, newdata=newdata)
metrics <- postResample(pred=pred, obs=obs)
res[[i]] <- metrics
methods[[i]] <- model$method
i <- 1 + i
}
names(res) <- methods
return(res)
}
models <- list(plsFit, ridgeFit, lassoFit, enetFit)
(resampleResult <- multiResample(models, X.test, y.test))## $pls
## RMSE Rsquared MAE
## 15.4341377 0.2899935 11.6639869
##
## $ridge
## RMSE Rsquared MAE
## 15.0822363 0.3047575 11.4460367
##
## $lasso
## RMSE Rsquared MAE
## 12.3663838 0.3954258 9.1764050
##
## $enet
## RMSE Rsquared MAE
## 14.1795387 0.3415212 10.7570758
Would you recommend any of your models to replace the permeability laboratory experiment?
No,I would not recommend any of the models to replace the permeability laboratory experiment. The MAE of all of the models are roughly between 9 and 12, meaning that the model predictions are on average +/- 9 to 12 off. We can see that most of permeability are under 12.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch
Start R and use these commands to load the data:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.
## Created from 152 samples and 57 variables
##
## Pre-processing:
## - bagged tree imputation (57)
## - ignored (0)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
#Train and Test
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]The elastic net model is tuned using cross validation with parameters lambda ranging from 0 to 1, and fraction ranging from 0 to 1.
set.seed(1)
enetFit <- train(x=X.train, y=y.train, method='enet', metric='RMSE', tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), .lambda = seq(0, 1, by=0.1)), trControl=trainControl(method='cv'), preProcess=c('center','scale'))## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
The test set RMSE is lower than the resampled performance metric on the training set. The test set result appears to be better than the training set result.
enetPred <- predict(enetFit, newdata=X.test)
(predResult <- postResample(pred=enetPred, obs=y.test))## RMSE Rsquared MAE
## 1.0611572 0.6019381 0.8536509
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
The predictors that are most important in the model are the coefficients. The elastic net zero out some of the predictors, due to the lasso penalty. 26 Manufacturing Process predictors are zeroed out, while 7 Biological Material predictors are zeroed out. In the remaining 24 predictors, 19 are Manufacturing Process predictors and 5 are Biological Material predictors.
(coeffs <- predict.enet(enetFit$finalModel, s=enetFit$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients)## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 0.00000000 0.00000000 0.00000000
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 0.00000000 0.00000000 0.08872051
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 0.00000000 0.00000000 0.00000000
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 0.04275827 0.00000000 0.04748137
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## -0.04828912 0.00000000 0.45351080
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## -0.21934545 0.00000000 0.06315242
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 0.00000000 -0.22082836 0.00000000
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess22
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 0.80779340 0.00000000 0.12023091
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 0.00000000 -0.19025738 -0.12077693
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 0.00000000 0.10800365 0.00000000
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 0.00000000 0.02043653 0.00000000
## ManufacturingProcess44 ManufacturingProcess45
## 0.00000000 0.05192492
#Comparing non-zero coefficients and sorting
coeffs.sorted <- abs(coeffs)
coeffs.sorted <- coeffs.sorted[coeffs.sorted>0]
(coeffs.sorted <- sort(coeffs.sorted, decreasing = T))## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess17
## 0.80779340 0.45351080 0.22082836
## ManufacturingProcess13 ManufacturingProcess36 ManufacturingProcess37
## 0.21934545 0.19025738 0.12077693
## ManufacturingProcess34 ManufacturingProcess39 BiologicalMaterial06
## 0.12023091 0.10800365 0.08872051
## ManufacturingProcess15 ManufacturingProcess45 ManufacturingProcess07
## 0.06315242 0.05192492 0.04828912
## ManufacturingProcess06 ManufacturingProcess04 ManufacturingProcess42
## 0.04748137 0.04275827 0.02043653
It appears that ManufacturingProcess are more important. The varImp function can be used to rank the importance.
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 82.21
## ManufacturingProcess36 79.47
## BiologicalMaterial06 75.61
## BiologicalMaterial03 71.87
## ManufacturingProcess17 70.62
## BiologicalMaterial12 66.86
## ManufacturingProcess09 62.20
## ManufacturingProcess06 55.45
## BiologicalMaterial02 53.61
## ManufacturingProcess31 46.70
## ManufacturingProcess33 45.66
## BiologicalMaterial11 42.39
## BiologicalMaterial04 39.70
## ManufacturingProcess29 36.88
## ManufacturingProcess11 36.39
## ManufacturingProcess12 35.50
## BiologicalMaterial08 31.86
## BiologicalMaterial09 30.98
## BiologicalMaterial01 29.67
11 out of the 20 in the list are ManufacturingProcess predictors making it more important than BiologicalMaterial.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Elastic net is a linear regression model. The coefficients explains the predictors affecting the target. Positive coefficients represent the improvement of the yield and the negative coefficients decreases the yield.
I would alter the process such that the predictor value increases.
#Positive coefficients
coeffs.mp <- coeffs.sorted[grep('ManufacturingProcess', names(coeffs.sorted))] %>% names() %>% coeffs[.]
coeffs.mp[coeffs.mp>0]## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess34
## 0.80779340 0.45351080 0.12023091
## ManufacturingProcess39 ManufacturingProcess15 ManufacturingProcess45
## 0.10800365 0.06315242 0.05192492
## ManufacturingProcess06 ManufacturingProcess04 ManufacturingProcess42
## 0.04748137 0.04275827 0.02043653
I would alter the process such that the predictor value decreases, since the Manufacturing Process has negative coefficients.
## ManufacturingProcess17 ManufacturingProcess13 ManufacturingProcess36
## -0.22082836 -0.21934545 -0.19025738
## ManufacturingProcess37 ManufacturingProcess07
## -0.12077693 -0.04828912