Data 624 HW7: Linear Regression and Its Cousins
library(tidyverse)
library(fpp2)
library(urca)
library(rio)
library(gridExtra)
#library(AppliedPredictiveModeling)
library(caret)
library(glmnet)
library(elasticnet)
library(RANN)
seed <- 1231 HW7: Linear Regression and Its Cousins
In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
1.1 Ex. 6.2
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a.) Start R and use these commands to load the data:
- library(AppliedPredictiveModeling)
- data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the
nearZeroVarfunction from the caret package. How many predictors are left for modeling?Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
Predict the response for the test set. What is the test set estimate of \(R^2\)?
Try building other models discussed in this chapter. Do any have better predictive performance?
Would you recommend any of your models to replace the permeability laboratory experiment?
1.1.1 Part a
Start R and use these commands to load the data:
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
1.1.2 Part b
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
Answer:
719 predictors were removed from the 1,107 binary moelecular predictors.
388 predictors left.
## int [1:719] 7 8 9 10 13 14 17 18 19 22 ...
## [1] 1107
## [1] 388
1.1.3 Part c
Split the data into a training and a test set, pre-process the data, and tune a PLS (Partial Least Square) model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
Answer:
- Train-test split at 70%
set.seed(seed)
trainingRows <- createDataPartition(permeability, p=0.75, list=FALSE) #caret, textbook sec4.9
train_X <- fp[trainingRows, ]
train_Y <- permeability[trainingRows,]
test_X <- fp[-trainingRows, ]
test_Y <- permeability[-trainingRows,]- Create a PLS model
The best performed PLS model generated is selected with the lowest RMSE value.
set.seed(seed)
pls_1 <- train(x=train_X, y=train_Y, method="pls", tuneLength=20,
preProcess=c("center", "scale"),
trControl=trainControl(method="cv"))
pls_1## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.41334 0.3623773 10.259274
## 2 11.86219 0.4618152 8.397173
## 3 11.71304 0.4659634 8.861406
## 4 11.66161 0.4877544 8.696145
## 5 11.26338 0.5311563 8.021885
## 6 11.43810 0.5298935 8.179149
## 7 11.66505 0.5257160 8.629869
## 8 11.77213 0.5317200 8.877641
## 9 11.95947 0.5208151 9.076055
## 10 12.44031 0.4879913 9.394500
## 11 12.91286 0.4605701 9.468711
## 12 12.88221 0.4686068 9.510687
## 13 13.03911 0.4552614 9.558798
## 14 12.98039 0.4501933 9.396228
## 15 13.06457 0.4507424 9.361827
## 16 13.00332 0.4502875 9.585219
## 17 13.18292 0.4480355 9.661004
## 18 13.23331 0.4462273 9.675576
## 19 13.29171 0.4415160 9.760474
## 20 13.44435 0.4363275 9.801495
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
1.1.4 Part d
Predict the response for the test set. What is the test set estimate of \(R^2\)?
Answer:
pls_predict <- predict(pls_1, test_X)
plot(pls_predict, test_Y, main="Observed vs Predicted Permeability of PLS Model",
xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")## RMSE Rsquared MAE
## 12.1654338 0.3593983 8.2578548
1.1.5 Part e
Try building other models discussed in this chapter. Do any have better predictive performance?
Answer:
We have learned 3 types of penalized models in chapter 6, ridge regression model, lasso regression model, and elastic net regression model.
1.1.5.1 Ridge Regression
- Ridge regression model
set.seed(seed)
ridge_lambda <- data.frame(.lambda = seq(0, 0.3, length=30))
ridge_1 <- train(x=train_X, y=train_Y, method="ridge",
tuneGrid=expand.grid(lambda=ridge_lambda),
preProcess=c("center", "scale"),
trControl=trainControl(method="cv", number=10))
ridge_1## Ridge Regression
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00000000 13.31355 0.4257943 9.489948
## 0.01034483 62.15255 0.3291853 34.807919
## 0.02068966 13.42988 0.4398304 9.526945
## 0.03103448 13.06314 0.4599117 9.273919
## 0.04137931 12.84222 0.4720473 9.179058
## 0.05172414 12.70529 0.4801400 9.113387
## 0.06206897 12.60931 0.4861717 9.055218
## 0.07241379 12.53670 0.4910593 9.004515
## 0.08275862 12.48777 0.4949549 8.963595
## 0.09310345 12.44965 0.4983657 8.931068
## 0.10344828 12.42250 0.5013294 8.914315
## 0.11379310 12.40210 0.5038764 8.901359
## 0.12413793 12.38961 0.5064060 8.896276
## 0.13448276 12.38060 0.5085803 8.890395
## 0.14482759 12.38435 0.5102651 8.893560
## 0.15517241 12.37985 0.5123803 8.890398
## 0.16551724 12.38608 0.5138800 8.896312
## 0.17586207 12.39146 0.5155138 8.902158
## 0.18620690 12.40188 0.5169707 8.918062
## 0.19655172 12.41694 0.5182106 8.934497
## 0.20689655 12.43309 0.5194247 8.950699
## 0.21724138 12.45263 0.5205351 8.968571
## 0.22758621 12.47204 0.5216033 8.984214
## 0.23793103 12.49450 0.5225826 9.000665
## 0.24827586 12.51887 0.5235025 9.018637
## 0.25862069 12.54486 0.5243477 9.041171
## 0.26896552 12.57344 0.5251166 9.063755
## 0.27931034 12.60197 0.5258823 9.085035
## 0.28965517 12.63676 0.5264487 9.109773
## 0.30000000 12.66622 0.5272410 9.127675
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1551724.
The best ridge regression model generated is selected with the lowest RMSE.
#2nd lambda has the minimum RMSE and maximum R^2
ridge_1$results[which(ridge_1$results$lambda==ridge_1$bestTune$lambda),]By predicting the response for the test set, the test set estimate of \(R^2\) is shown below.
set.seed(seed)
ridge_predict <- predict(ridge_1, test_X)
plot(ridge_predict, test_Y, main="Observed vs Predicted Permeability of Ridge Regression Model",
xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")## RMSE Rsquared MAE
## 12.8768125 0.3982622 8.9082363
1.1.5.2 Lasso Regression
- Lasso regression model
set.seed(seed)
lasso_1 <- train(x=train_X, y=train_Y, method="lasso",
tuneGrid=data.frame(.fraction = seq(0, 0.5, length=50)),
preProcess=c("center", "scale"), metric="RMSE",
trControl=trainControl(method="cv", number=10))
lasso_1## The lasso
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00000000 15.57692 NaN 12.728995
## 0.01020408 14.47419 0.4634966 11.835283
## 0.02040816 13.52155 0.4689504 10.972438
## 0.03061224 12.75502 0.4880627 10.160320
## 0.04081633 12.22596 0.4853324 9.493436
## 0.05102041 11.83632 0.4804238 9.042418
## 0.06122449 11.57451 0.4835550 8.717351
## 0.07142857 11.34355 0.4829280 8.438200
## 0.08163265 11.21724 0.4844352 8.237427
## 0.09183673 11.16631 0.4863175 8.132458
## 0.10204082 11.17368 0.4858203 8.098263
## 0.11224490 11.21966 0.4849602 8.119512
## 0.12244898 11.29267 0.4829318 8.142800
## 0.13265306 11.35585 0.4815348 8.146106
## 0.14285714 11.39282 0.4808331 8.123918
## 0.15306122 11.43691 0.4803752 8.125886
## 0.16326531 11.48366 0.4800437 8.125808
## 0.17346939 11.51432 0.4800548 8.112645
## 0.18367347 11.55157 0.4789385 8.115222
## 0.19387755 11.59262 0.4768212 8.118618
## 0.20408163 11.63138 0.4746560 8.119309
## 0.21428571 11.67602 0.4723134 8.130778
## 0.22448980 11.71293 0.4710988 8.134070
## 0.23469388 11.74852 0.4701882 8.133650
## 0.24489796 11.77903 0.4693071 8.146913
## 0.25510204 11.80821 0.4680399 8.172026
## 0.26530612 11.84255 0.4665413 8.197618
## 0.27551020 11.87817 0.4652673 8.230074
## 0.28571429 11.92102 0.4635390 8.266816
## 0.29591837 11.96073 0.4621253 8.299207
## 0.30612245 11.98789 0.4613869 8.321313
## 0.31632653 11.99858 0.4611815 8.334260
## 0.32653061 12.00539 0.4607196 8.340874
## 0.33673469 12.02526 0.4594931 8.365919
## 0.34693878 12.03500 0.4596560 8.386498
## 0.35714286 12.04123 0.4606912 8.397463
## 0.36734694 12.05451 0.4608652 8.413506
## 0.37755102 12.06834 0.4609730 8.425471
## 0.38775510 12.08131 0.4613339 8.437526
## 0.39795918 12.09310 0.4614313 8.446080
## 0.40816327 12.09651 0.4615999 8.448989
## 0.41836735 12.09470 0.4620095 8.447201
## 0.42857143 12.09465 0.4622211 8.450695
## 0.43877551 12.09751 0.4624315 8.459603
## 0.44897959 12.10322 0.4625048 8.471047
## 0.45918367 12.11214 0.4624240 8.487719
## 0.46938776 12.12142 0.4623135 8.502984
## 0.47959184 12.13137 0.4622029 8.518438
## 0.48979592 12.14500 0.4619836 8.537519
## 0.50000000 12.16177 0.4616643 8.554631
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.09183673.
The best lasso regression model generated is selected with the lowest RMSE value.
By predicting the response for the test set, the test set estimate of \(R^2\) is shown below.
set.seed(seed)
lasso_predict <- predict(lasso_1, test_X)
plot(lasso_predict, test_Y, main="Observed vs Predicted Permeability of Lasso Regression Model",
xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")## RMSE Rsquared MAE
## 11.7485995 0.3021474 8.2396727
1.1.5.3 Elastic Net Regression
- Elastic Net Regression model
set.seed(seed)
elastic_1 <- train(x=train_X, y=train_Y, method="enet",
tuneGrid=data.frame(.lambda = seq(0,0.3,length=20), .fraction=seq(0.05,0.5,length=20)),
preProcess=c("center", "scale"),
trControl=trainControl(method="cv", number=10))
elastic_1## Elasticnet
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00000000 0.05000000 11.87107 0.4805047 9.081472
## 0.01578947 0.07368421 11.44758 0.4855473 8.094838
## 0.03157895 0.09736842 11.45778 0.4860160 8.034581
## 0.04736842 0.12105263 11.50092 0.4859597 8.060801
## 0.06315789 0.14473684 11.54106 0.4868502 8.085338
## 0.07894737 0.16842105 11.58295 0.4881920 8.131431
## 0.09473684 0.19210526 11.62286 0.4905041 8.182796
## 0.11052632 0.21578947 11.67162 0.4919044 8.242982
## 0.12631579 0.23947368 11.72141 0.4928231 8.292377
## 0.14210526 0.26315789 11.79440 0.4924376 8.346969
## 0.15789474 0.28684211 11.85231 0.4925342 8.387146
## 0.17368421 0.31052632 11.92964 0.4921074 8.449229
## 0.18947368 0.33421053 11.99055 0.4927981 8.499881
## 0.20526316 0.35789474 12.04181 0.4934928 8.537313
## 0.22105263 0.38157895 12.09259 0.4946214 8.568601
## 0.23684211 0.40526316 12.12059 0.4967339 8.583769
## 0.25263158 0.42894737 12.14993 0.4991258 8.600702
## 0.26842105 0.45263158 12.18961 0.5010938 8.628437
## 0.28421053 0.47631579 12.23650 0.5028885 8.664567
## 0.30000000 0.50000000 12.28733 0.5047685 8.701255
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.07368421 and
## lambda = 0.01578947.
The best elastic net regression model generated is selected with the lowest RMSE value.
By predicting the response for the test set, the test set estimate of \(R^2\) is shown below.
set.seed(seed)
elastic_predict <- predict(elastic_1, test_X)
plot(elastic_predict, test_Y, main="Observed vs Predicted Permeability of Elastic Net Regression Model",
xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")## RMSE Rsquared MAE
## 11.4592402 0.3454041 7.6291889
1.1.6 Part f
Would you recommend any of your models to replace the permeability laboratory experiment?
Answer:
- I would replace my model to the elastic net regression model as it has the lowest RMSE value.
## [1] "PLS:"
## RMSE Rsquared MAE
## 12.1654338 0.3593983 8.2578548
## [1] "Ridge Regression:"
## RMSE Rsquared MAE
## 12.8768125 0.3982622 8.9082363
## [1] "Lasso Regression:"
## RMSE Rsquared MAE
## 11.7485995 0.3021474 8.2396727
## [1] "Elastic Net Regression:"
## RMSE Rsquared MAE
## 11.4592402 0.3454041 7.6291889
1.2 Ex. 6.3
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield.
Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
(a.) Start R and use these commands to load the data:
- library(AppliedPredictiveModeling)
- data(chemicalManufacturing)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
1.2.1 Part a
Start R and use these commands to load the data:
## Yield BiologicalMaterial01 BiologicalMaterial02
## Min. :35.25 Min. :4.580 Min. :46.87
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68
## Median :39.97 Median :6.305 Median :55.09
## Mean :40.18 Mean :6.411 Mean :55.69
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74
## Max. :46.34 Max. :8.810 Max. :64.75
##
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## Min. :56.97 Min. : 9.38 Min. :13.24
## 1st Qu.:64.98 1st Qu.:11.24 1st Qu.:17.23
## Median :67.22 Median :12.10 Median :18.49
## Mean :67.70 Mean :12.35 Mean :18.60
## 3rd Qu.:70.43 3rd Qu.:13.22 3rd Qu.:19.90
## Max. :78.25 Max. :23.09 Max. :24.85
##
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## Min. :40.60 Min. :100.0 Min. :15.88
## 1st Qu.:46.05 1st Qu.:100.0 1st Qu.:17.06
## Median :48.46 Median :100.0 Median :17.51
## Mean :48.91 Mean :100.0 Mean :17.49
## 3rd Qu.:51.34 3rd Qu.:100.0 3rd Qu.:17.88
## Max. :59.38 Max. :100.8 Max. :19.14
##
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## Min. :11.44 Min. :1.770 Min. :135.8
## 1st Qu.:12.60 1st Qu.:2.460 1st Qu.:143.8
## Median :12.84 Median :2.710 Median :146.1
## Mean :12.85 Mean :2.801 Mean :147.0
## 3rd Qu.:13.13 3rd Qu.:2.990 3rd Qu.:149.6
## Max. :14.08 Max. :6.870 Max. :158.7
##
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## Min. :18.35 Min. : 0.00 Min. : 0.00
## 1st Qu.:19.73 1st Qu.:10.80 1st Qu.:19.30
## Median :20.12 Median :11.40 Median :21.00
## Mean :20.20 Mean :11.21 Mean :16.68
## 3rd Qu.:20.75 3rd Qu.:12.15 3rd Qu.:21.50
## Max. :22.21 Max. :14.10 Max. :22.50
## NA's :1 NA's :3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## Min. :1.47 Min. :911.0 Min. : 923.0
## 1st Qu.:1.53 1st Qu.:928.0 1st Qu.: 986.8
## Median :1.54 Median :934.0 Median : 999.2
## Mean :1.54 Mean :931.9 Mean :1001.7
## 3rd Qu.:1.55 3rd Qu.:936.0 3rd Qu.:1008.9
## Max. :1.60 Max. :946.0 Max. :1175.3
## NA's :15 NA's :1 NA's :1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## Min. :203.0 Min. :177.0 Min. :177.0
## 1st Qu.:205.7 1st Qu.:177.0 1st Qu.:177.0
## Median :206.8 Median :177.0 Median :178.0
## Mean :207.4 Mean :177.5 Mean :177.6
## 3rd Qu.:208.7 3rd Qu.:178.0 3rd Qu.:178.0
## Max. :227.4 Max. :178.0 Max. :178.0
## NA's :2 NA's :1 NA's :1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## Min. :38.89 Min. : 7.500 Min. : 7.500
## 1st Qu.:44.89 1st Qu.: 8.700 1st Qu.: 9.000
## Median :45.73 Median : 9.100 Median : 9.400
## Mean :45.66 Mean : 9.179 Mean : 9.386
## 3rd Qu.:46.52 3rd Qu.: 9.550 3rd Qu.: 9.900
## Max. :49.36 Max. :11.600 Max. :11.500
## NA's :9 NA's :10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## Min. : 0.0 Min. :32.10 Min. :4701
## 1st Qu.: 0.0 1st Qu.:33.90 1st Qu.:4828
## Median : 0.0 Median :34.60 Median :4856
## Mean : 857.8 Mean :34.51 Mean :4854
## 3rd Qu.: 0.0 3rd Qu.:35.20 3rd Qu.:4882
## Max. :4549.0 Max. :38.60 Max. :5055
## NA's :1 NA's :1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## Min. :5904 Min. : 0 Min. :31.30
## 1st Qu.:6010 1st Qu.:4561 1st Qu.:33.50
## Median :6032 Median :4588 Median :34.40
## Mean :6039 Mean :4566 Mean :34.34
## 3rd Qu.:6061 3rd Qu.:4619 3rd Qu.:35.10
## Max. :6233 Max. :4852 Max. :40.00
##
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## Min. : 0 Min. :5890 Min. : 0
## 1st Qu.:4813 1st Qu.:6001 1st Qu.:4553
## Median :4835 Median :6022 Median :4582
## Mean :4810 Mean :6028 Mean :4556
## 3rd Qu.:4862 3rd Qu.:6050 3rd Qu.:4610
## Max. :4971 Max. :6146 Max. :4759
##
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## Min. :-1.8000 Min. : 0.000 Min. :0.000
## 1st Qu.:-0.6000 1st Qu.: 3.000 1st Qu.:2.000
## Median :-0.3000 Median : 5.000 Median :3.000
## Mean :-0.1642 Mean : 5.406 Mean :3.017
## 3rd Qu.: 0.0000 3rd Qu.: 8.000 3rd Qu.:4.000
## Max. : 3.6000 Max. :12.000 Max. :6.000
## NA's :1 NA's :1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## Min. : 0.000 Min. : 0 Min. : 0
## 1st Qu.: 4.000 1st Qu.:4832 1st Qu.:6020
## Median : 8.000 Median :4855 Median :6047
## Mean : 8.834 Mean :4828 Mean :6016
## 3rd Qu.:14.000 3rd Qu.:4877 3rd Qu.:6070
## Max. :23.000 Max. :4990 Max. :6161
## NA's :1 NA's :5 NA's :5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.:4560 1st Qu.: 0.000 1st Qu.:19.70
## Median :4587 Median :10.400 Median :19.90
## Mean :4563 Mean : 6.592 Mean :20.01
## 3rd Qu.:4609 3rd Qu.:10.750 3rd Qu.:20.40
## Max. :4710 Max. :11.500 Max. :22.00
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## Min. : 0.000 Min. : 0.00 Min. :143.0
## 1st Qu.: 8.800 1st Qu.:70.10 1st Qu.:155.0
## Median : 9.100 Median :70.80 Median :158.0
## Mean : 9.161 Mean :70.18 Mean :158.5
## 3rd Qu.: 9.700 3rd Qu.:71.40 3rd Qu.:162.0
## Max. :11.200 Max. :72.50 Max. :173.0
## NA's :5 NA's :5
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## Min. :56.00 Min. :2.300 Min. :463.0
## 1st Qu.:62.00 1st Qu.:2.500 1st Qu.:490.0
## Median :64.00 Median :2.500 Median :495.0
## Mean :63.54 Mean :2.494 Mean :495.6
## 3rd Qu.:65.00 3rd Qu.:2.500 3rd Qu.:501.5
## Max. :70.00 Max. :2.600 Max. :522.0
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## Min. :0.01700 Min. :0.000 Min. :0.000
## 1st Qu.:0.01900 1st Qu.:0.700 1st Qu.:2.000
## Median :0.02000 Median :1.000 Median :3.000
## Mean :0.01957 Mean :1.014 Mean :2.534
## 3rd Qu.:0.02000 3rd Qu.:1.300 3rd Qu.:3.000
## Max. :0.02200 Max. :2.300 Max. :3.000
## NA's :5
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## Min. :0.000 Min. :0.00000 Min. :0.00000
## 1st Qu.:7.100 1st Qu.:0.00000 1st Qu.:0.00000
## Median :7.200 Median :0.00000 Median :0.00000
## Mean :6.851 Mean :0.01771 Mean :0.02371
## 3rd Qu.:7.300 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7.500 Max. :0.10000 Max. :0.20000
## NA's :1 NA's :1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## Min. : 0.00 Min. : 0.0000 Min. :0.000
## 1st Qu.:11.40 1st Qu.: 0.6000 1st Qu.:1.800
## Median :11.60 Median : 0.8000 Median :1.900
## Mean :11.21 Mean : 0.9119 Mean :1.805
## 3rd Qu.:11.70 3rd Qu.: 1.0250 3rd Qu.:1.900
## Max. :12.10 Max. :11.0000 Max. :2.100
##
## ManufacturingProcess45
## Min. :0.000
## 1st Qu.:2.100
## Median :2.200
## Mean :2.138
## 3rd Qu.:2.300
## Max. :2.600
##
The matrix ChemicalManufacturingProcess contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs, plus the variable yield which contains the percent yield for each run.
1.2.2 Part b
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Answer:
- The
caretclasspreProcesshas the ability to transform, center, scale, or impute values, as well as apply the spatial sign transformation and feature extraction.
1.2.3 Part c
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Answer:
- Pre-process the data with centering and scaling.
cmp_pre <- preProcess(cmp_predictors, method=c("center", "scale"))
cmp_predictors <- predict(cmp_pre, cmp_predictors)- Train-test split at 70%
set.seed(0)
trainingRows <- createDataPartition(ChemicalManufacturingProcess$Yield,
p=0.70, list=FALSE) #caret, textbook sec4.9
train_X2 <- cmp_predictors[trainingRows, ]
train_Y2 <- ChemicalManufacturingProcess$Yield[trainingRows]
test_X2 <- cmp_predictors[-trainingRows, ]
test_Y2 <- ChemicalManufacturingProcess$Yield[-trainingRows]- Create an elastic net regression model
set.seed(seed)
elastic_2 <- train(x=train_X2, y=train_Y2, method="enet",
tuneGrid=data.frame(.lambda = seq(0,0.5,length=50), .fraction=seq(0,0.5,length=50)),
preProcess=c("center", "scale"),
trControl=trainControl(method="cv", number=10))
elastic_2## Elasticnet
##
## 124 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 110, 112, 112, 112, 112, 112, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00000000 0.00000000 1.901440 NaN 1.550630
## 0.01020408 0.01020408 1.827446 0.4715302 1.489765
## 0.02040816 0.02040816 1.781927 0.5154421 1.452926
## 0.03061224 0.03061224 1.740373 0.5501750 1.419992
## 0.04081633 0.04081633 1.701189 0.5760593 1.389037
## 0.05102041 0.05102041 1.665255 0.5897328 1.360337
## 0.06122449 0.06122449 1.631330 0.5991104 1.333962
## 0.07142857 0.07142857 1.598567 0.6055289 1.308824
## 0.08163265 0.08163265 1.567849 0.6094772 1.285673
## 0.09183673 0.09183673 1.538174 0.6130028 1.263410
## 0.10204082 0.10204082 1.510038 0.6152897 1.242095
## 0.11224490 0.11224490 1.484245 0.6164227 1.221695
## 0.12244898 0.12244898 1.459033 0.6176095 1.201066
## 0.13265306 0.13265306 1.435393 0.6179064 1.181772
## 0.14285714 0.14285714 1.413077 0.6175098 1.164097
## 0.15306122 0.15306122 1.391639 0.6169066 1.146605
## 0.16326531 0.16326531 1.370669 0.6167188 1.130151
## 0.17346939 0.17346939 1.349648 0.6178490 1.114648
## 0.18367347 0.18367347 1.328960 0.6196777 1.099119
## 0.19387755 0.19387755 1.309101 0.6221056 1.085584
## 0.20408163 0.20408163 1.296127 0.6197820 1.081108
## 0.21428571 0.21428571 1.291100 0.6112501 1.076820
## 0.22448980 0.22448980 1.292687 0.6015908 1.073770
## 0.23469388 0.23469388 1.298798 0.5933851 1.071336
## 0.24489796 0.24489796 1.296381 0.5918706 1.065447
## 0.25510204 0.25510204 1.295319 0.5898961 1.061638
## 0.26530612 0.26530612 1.302230 0.5847361 1.061831
## 0.27551020 0.27551020 1.309779 0.5808628 1.060294
## 0.28571429 0.28571429 1.319787 0.5773648 1.060404
## 0.29591837 0.29591837 1.330923 0.5745433 1.061366
## 0.30612245 0.30612245 1.343585 0.5716829 1.062866
## 0.31632653 0.31632653 1.357646 0.5690163 1.065103
## 0.32653061 0.32653061 1.366258 0.5686029 1.065763
## 0.33673469 0.33673469 1.374310 0.5687929 1.066584
## 0.34693878 0.34693878 1.382495 0.5695444 1.067658
## 0.35714286 0.35714286 1.390983 0.5704198 1.069274
## 0.36734694 0.36734694 1.401315 0.5714089 1.071955
## 0.37755102 0.37755102 1.415014 0.5722355 1.075890
## 0.38775510 0.38775510 1.427006 0.5734824 1.079853
## 0.39795918 0.39795918 1.436859 0.5752575 1.083323
## 0.40816327 0.40816327 1.447524 0.5768117 1.087171
## 0.41836735 0.41836735 1.458528 0.5784628 1.091176
## 0.42857143 0.42857143 1.486616 0.5746711 1.100797
## 0.43877551 0.43877551 1.508477 0.5729361 1.108207
## 0.44897959 0.44897959 1.514037 0.5757535 1.110933
## 0.45918367 0.45918367 1.520137 0.5785080 1.114653
## 0.46938776 0.46938776 1.525742 0.5810269 1.117965
## 0.47959184 0.47959184 1.532399 0.5831547 1.121583
## 0.48979592 0.48979592 1.539876 0.5851052 1.125881
## 0.50000000 0.50000000 1.557432 0.5861309 1.133355
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2142857 and lambda
## = 0.2142857.
The best elastic net regression model generated is selected with the lowest RMSE value.
1.2.4 Part d
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Answer:
The resampled performance metric on the training set obtained from above has \(R^2 = 0.6112501\) and \(RMSE = 1.2911\).
By predicting the response for the test set, the test set estimate of \(R^2 = 5955402\) and \(RMSE = 1.1216353\).
The prediction performance has lower RMSE compared to the resampled performance. Thus the test set results appears to perform better than the training set results.
set.seed(0)
elastic_predict <- predict(elastic_2, test_X2)
plot(elastic_predict, test_Y2, main="Observed vs Predicted Permeability of Elastic Net Regression Model",
xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")## RMSE Rsquared MAE
## 1.1216353 0.5955402 0.9313962
1.2.5 Part e
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Answer:
Using the
varImpfunction from librarycaretto find the predictors’ importance. The top 20 important predictors are shown below.The most important predictor is
ManufacturingProcess32, following withManufacturingProcess13,BiologicalMaterial03,BiologicalMaterial06and `ManufacturingProcess17, etc.Among the 20 most important variables, there are 12 process predictors and 8 biological predictors. Also, there are 6 process predictors among top 10. Thus, process predictors appear to dominate the list.
1.2.6 Part f
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Answer:
- According to the correlation plot of the top 20 important predictors, I will try to modify the manufacturing process #13, #17 and #36 to decrease their importance to the yield because they are highly negatively correlated to Yield. Their correlation coefficients with Yield are -0.50, -0.43, and -0.52 respectively.
rn <- varImp(elastic_2)$importance %>% arrange(desc(Overall)) %>% rownames() %>% .[1:20]
m <- cmp_predictors %>% select(rn) %>% cbind(ChemicalManufacturingProcess$Yield)
library(corrplot)
corrplot(cor(m), type="lower")## [1] -0.5036797
## [1] -0.4258069
## [1] -0.5237389