In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
library(MASS)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(AppliedPredictiveModeling)
library(DataExplorer)
library(RANN)
library(ResourceSelection)
## ResourceSelection 0.3-6 2023-06-27
6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
data(permeability)
The matrix fingerprints contains the 1,107 binary
molecular predictors for the 165 compounds, while
permeability contains permeability response.
dim(fingerprints)
## [1] 165 1107
dim(fingerprints[,-caret::nearZeroVar(fingerprints)])
## [1] 165 388
val_fp = fingerprints[,-caret::nearZeroVar(fingerprints)]
After cutting out sparse columns there are 388 predictors left for the model to consider
set.seed(12)
trainingRows <- createDataPartition(permeability, p = .80, list= FALSE)
train_finger <- val_fp[trainingRows,]
test_finger <- val_fp[-trainingRows,]
train_permY <- permeability[trainingRows,]
test_permY <- permeability[-trainingRows,]
set.seed(49)
tr_ctrl <- trainControl(method = "cv", number = 10)
plsFinger <- train(x=train_finger, y=train_permY,
method = "pls",tuneLength = 20,
trControl = tr_ctrl,preProcess = c("center", "scale"))
plsFinger
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 121, 120, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.84296 0.3645242 10.076404
## 2 11.35927 0.5255717 8.098447
## 3 11.30260 0.5221696 8.491675
## 4 11.61965 0.5171470 8.692696
## 5 11.47106 0.5136132 8.394059
## 6 11.29323 0.5178519 8.293291
## 7 11.32815 0.5189196 8.284300
## 8 11.37310 0.5317987 8.677746
## 9 11.50948 0.5189617 8.901416
## 10 11.72691 0.4993286 9.092009
## 11 11.82241 0.4897281 9.211926
## 12 11.95728 0.4883831 9.396215
## 13 12.14737 0.4784274 9.662778
## 14 12.46239 0.4629380 9.857318
## 15 12.76723 0.4464990 10.124604
## 16 13.12807 0.4260383 10.209452
## 17 13.34671 0.4168155 10.408683
## 18 13.45350 0.4121383 10.339464
## 19 13.75097 0.3955095 10.523665
## 20 13.85979 0.3931836 10.530734
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
summary(plsFinger)
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 22.78 34.31 39.98 45.08 52.85 59.13
## .outcome 35.93 54.74 61.60 67.27 70.61 73.39
Based on the summarized output of the 10 fold cross validation the pls model found that the optimal number of components is 6 with an \(R^2\) value of 51.7% and a minimized RMSE of 11.727. While this number of components doesn’t have the lowest \(R^2\) the RMSE is probably a better means of judging these model variants.
The optimization can be further demonstrated with a standard graphic:
plot(plsFinger)
xyplot(train_permY ~ predict(plsFinger),
type = c("p", "g"), xlab = "Predicted", ylab = "Observed")
The observed vs predicted plot does appear to show a somewhat linear relationship although as X increases, the spread of the points shows some fanning between 15 - 40. It does subsequently decrease the spread of the points after 40.
The one standard error rule could feasibly be used here as Kuhn indicates that this simpler model may be comparable.
pred_finger <- predict(plsFinger,test_finger)
pls_test <- caret::defaultSummary(data.frame(obs=test_permY,pred=pred_finger))
pls_test
## RMSE Rsquared MAE
## 13.7878339 0.3262867 9.6809449
The test set \(R^2\) is 32.63% which is substantially worse than the performance in the training set. The RMSE is also considerably larger on the unseen data although to soem extent that is to be expected
Let’s try using penalized regression models as an alternative means of predicting the permeability:
set.seed(15)
# ridgeGS <- data.frame(.lambda = seq(0, 1, length = 15))
ridgeRegFit <- train(x=train_finger,
y=train_permY,
method='ridge',
tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold05: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridgeRegFit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 121, 119, 120, 119, 121, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 16.67673 0.3892225 12.039832
## 0.1 13.18946 0.4723426 9.542513
## 0.2 13.03888 0.4947783 9.375433
## 0.3 13.23126 0.5036644 9.450973
## 0.4 13.58231 0.5082904 9.699912
## 0.5 14.03206 0.5108524 10.049187
## 0.6 14.56629 0.5124485 10.544434
## 0.7 15.15949 0.5135351 11.081939
## 0.8 15.79820 0.5143557 11.655795
## 0.9 16.48145 0.5148649 12.295152
## 1.0 17.19775 0.5152141 12.952609
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2.
The model after 10 fold validation identified the optimal lambda penalty term of 0.2 which minimized RMSE, but not the \(R^2\)
ridge_var_imp <- varImp(ridgeRegFit,scale=FALSE)
top_ridge_df <- head(ridge_var_imp$importance |> arrange(-Overall),25)
ggplot(data=top_ridge_df,aes(x=Overall,y=reorder(rownames(top_ridge_df),Overall))) +
geom_bar(stat='identity') +
labs(title='Most important variables in Ridge Regression',y='Predictors')
lfpFit <- train(x=train_finger,
y=train_permY,
method='lasso',
tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.01)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold02: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold08: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
lfpFit
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 118, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 15.84899 NaN 12.557052
## 0.01 14.41453 0.4663622 11.335468
## 0.02 13.42165 0.4776610 10.376542
## 0.03 12.75480 0.4829840 9.600487
## 0.04 12.31003 0.4986082 9.067208
## 0.05 11.92013 0.5237388 8.642053
## 0.06 11.73624 0.5292693 8.386778
## 0.07 11.70822 0.5226766 8.313568
## 0.08 11.66058 0.5188943 8.211338
## 0.09 11.62523 0.5156880 8.153741
## 0.10 11.59529 0.5134110 8.061772
## 0.11 11.54799 0.5124494 7.961476
## 0.12 11.52212 0.5128684 7.976015
## 0.13 11.47629 0.5145714 7.970087
## 0.14 11.45083 0.5138270 7.949818
## 0.15 11.44164 0.5114545 7.924643
## 0.16 11.46520 0.5076419 7.955653
## 0.17 11.51218 0.5034053 7.997677
## 0.18 11.55970 0.4995324 8.039333
## 0.19 11.59722 0.4970285 8.071813
## 0.20 11.62118 0.4956033 8.088565
## 0.21 11.64660 0.4943499 8.108581
## 0.22 11.68438 0.4917432 8.148406
## 0.23 11.73113 0.4884401 8.187316
## 0.24 11.76452 0.4858565 8.199227
## 0.25 11.80173 0.4834556 8.212295
## 0.26 11.85367 0.4798908 8.237946
## 0.27 11.90546 0.4762125 8.274100
## 0.28 11.94583 0.4732587 8.305997
## 0.29 11.98965 0.4701541 8.339552
## 0.30 12.03446 0.4670109 8.371814
## 0.31 12.06864 0.4648629 8.399720
## 0.32 12.09809 0.4632811 8.429887
## 0.33 12.12225 0.4622335 8.456906
## 0.34 12.14739 0.4612992 8.481985
## 0.35 12.16981 0.4606670 8.503759
## 0.36 12.16694 0.4614412 8.510481
## 0.37 12.17808 0.4614279 8.523784
## 0.38 12.19186 0.4614223 8.542521
## 0.39 12.20947 0.4608577 8.564572
## 0.40 12.23086 0.4598441 8.588747
## 0.41 12.25881 0.4584283 8.617467
## 0.42 12.29273 0.4570994 8.651993
## 0.43 12.33188 0.4554681 8.687168
## 0.44 12.37083 0.4536491 8.720676
## 0.45 12.40461 0.4522291 8.759340
## 0.46 12.44470 0.4507220 8.800022
## 0.47 12.50724 0.4483127 8.851814
## 0.48 12.57106 0.4459671 8.903393
## 0.49 12.63669 0.4434637 8.956676
## 0.50 12.70035 0.4409328 9.010415
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.15.
Let’s see what terms remain important in the model:
lasso_var_imp <- varImp(lfpFit,scale=FALSE)
top_lasso_df <- head(lasso_var_imp$importance |> arrange(-Overall),25)
ggplot(data=top_lasso_df,aes(x=Overall,y=reorder(rownames(top_lasso_df),Overall))) +
geom_bar(stat='identity') +
labs(title='Most important variables in Lasso Regression',y='Predictors')
head(lasso_var_imp$importance,20)
## Overall
## X1 0.0423781376
## X2 0.0560929401
## X3 0.0461494364
## X4 0.0461494364
## X5 0.0461494364
## X6 0.2558763551
## X11 0.0020932720
## X12 0.0229244226
## X15 0.1168511490
## X16 0.2613520881
## X20 0.0461494364
## X21 0.0461494364
## X25 0.0158237514
## X26 0.0461494364
## X27 0.0461494364
## X28 0.0461494364
## X29 0.0461494364
## X35 0.0007154195
## X36 0.0021717551
## X37 0.0158237514
set.seed(6)
# Grid search for optimized penalty lambdas
enet_gs <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
# tuning penalized regression model
enet_fit <- train(train_finger, train_permY, method = "enet",
tuneGrid = enet_gs, trControl = tr_ctrl, preProc = c("center", "scale"))
## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
plot(enet_fit)
enet_var_imp <- varImp(enet_fit,scale=FALSE)
top_enet_df <- head(enet_var_imp$importance |> arrange(-Overall),25)
ggplot(data=top_enet_df,aes(x=Overall,y=reorder(rownames(top_enet_df),Overall))) +
geom_bar(stat='identity') +
labs(title='Most important variables in ElasticNet Regression',y='Predictors')
ridge_preds <- predict(ridgeRegFit,test_finger)
lasso_preds <- predict(lfpFit,test_finger)
enet_preds <- predict(enet_fit,test_finger)
ridge_test <- caret::defaultSummary(data.frame(pred=ridge_preds,obs=test_permY))
lasso_test <- caret::defaultSummary(data.frame(pred=lasso_preds,obs=test_permY))
enet_test <- caret::defaultSummary(data.frame(pred=enet_preds,obs=test_permY))
The different models have a mix of metrics on the test set in terms of ranking the unseen data. The RMSE values are all in the same ballpark although the lasso model has the best value for this evaluation criteria. The lasso model’s \(R^2\) value isn’t as high as the Ridge or PLS models, but even the mean absolute error is better. Therefore, if any model would help to identify potential improvements it would be the lasso model. It could also help identify certain important factors and allow the researchers to ignore one’s that statistically are not predictive for their problems.
test_evals <-rbind(pls_test,ridge_test,lasso_test,enet_test)
rownames(test_evals) <- c('PLS','Ridge','Lasso','Elastic Net')
knitr::kable(test_evals)
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| PLS | 13.78783 | 0.3262867 | 9.680945 |
| Ridge | 13.14550 | 0.4220319 | 9.065979 |
| Lasso | 12.89992 | 0.3217248 | 8.834154 |
| Elastic Net | 13.43265 | 0.2534215 | 9.084110 |
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
data("ChemicalManufacturingProcess")
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
summary(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
##
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
##
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
##
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
##
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
x_chem <- ChemicalManufacturingProcess |> select(-Yield)
y_chem <- ChemicalManufacturingProcess |> select(Yield)
x_chem |> pivot_longer(cols=colnames(x_chem)) |> filter(is.na(value)) |> group_by(name) |> summarise(cnt=n()) |> arrange(-cnt)
## # A tibble: 28 × 2
## name cnt
## <chr> <int>
## 1 ManufacturingProcess03 15
## 2 ManufacturingProcess11 10
## 3 ManufacturingProcess10 9
## 4 ManufacturingProcess25 5
## 5 ManufacturingProcess26 5
## 6 ManufacturingProcess27 5
## 7 ManufacturingProcess28 5
## 8 ManufacturingProcess29 5
## 9 ManufacturingProcess30 5
## 10 ManufacturingProcess31 5
## # ℹ 18 more rows
This graphic shows the percentages of missing values by column, which can be helpful to visualize all in one place rather than reviewing specific instances as in the code above.
plot_missing(x_chem, missing_only = TRUE,
ggtheme = theme_classic())
There are a couple of easy methods within preProcess that can be used to apply transformation; however, to preserve the originalLet’s apply KNN to find the nearest neighbors for imputation:
imputed <- preProcess(x_chem,method=c('center','scale','knnImpute'),k=5)
x_chem_imp <- predict(imputed,x_chem)
Let’s confirm there aren’t any null values left
sum(is.na(x_chem_imp))
## [1] 0
set.seed(18)
chem_train <- caret::createDataPartition(y_chem$Yield, p = .70, list= FALSE)
x_train_chem <- x_chem_imp[chem_train,]
x_test_chem <- x_chem_imp[-chem_train,]
y_train_chem <- y_chem[chem_train,]
y_test_chem <- y_chem[-chem_train,]
Let’s see if prioritizing bias, variance or balancing them both minimizes the RMSE:
set.seed(12)
rgs <- data.frame(.lambda = seq(0,0.1,length=15))
ridgeFit_chem <- train(x=x_train_chem,y=y_train_chem,
method='ridge',
tuneGrid=rgs,
trControl=tr_ctrl,
preProc = c('center','scale')
)
ridgeFit_chem
## Ridge Regression
##
## 124 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 3.604689 0.4259973 2.121548
## 0.007142857 2.730452 0.5060252 1.636248
## 0.014285714 2.500808 0.5204816 1.527147
## 0.021428571 2.401962 0.5260009 1.483482
## 0.028571429 2.347087 0.5281048 1.460613
## 0.035714286 2.311949 0.5286619 1.446628
## 0.042857143 2.287149 0.5284755 1.436774
## 0.050000000 2.268308 0.5279328 1.430753
## 0.057142857 2.253142 0.5272322 1.426133
## 0.064285714 2.240366 0.5264792 1.422379
## 0.071428571 2.229220 0.5257302 1.419430
## 0.078571429 2.219235 0.5250147 1.417069
## 0.085714286 2.210112 0.5243477 1.414907
## 0.092857143 2.201655 0.5237355 1.412765
## 0.100000000 2.193734 0.5231794 1.410581
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
The best individual lambda is at 0.1 which will be interesting to compare against the enet model and it’s in-sample training RMSE is 2.19.
Let’s run the lasso regression model now as well:
set.seed(12)
frac_gs <- data.frame(.fraction = seq(0,1,by=0.05))
lassoFit_chem <- train(x=x_train_chem,y=y_train_chem,
method='lasso',
tuneGrid=frac_gs,
trControl=tr_ctrl,
preProc = c('center','scale')
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
The Lasso Model Results:
lassoFit_chem
## The lasso
##
## 124 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 1.774324 NaN 1.4558734
## 0.05 1.185460 0.5949685 0.9761787
## 0.10 1.178212 0.5885738 0.9510623
## 0.15 1.217401 0.5817917 0.9623385
## 0.20 1.178073 0.6114951 0.9425087
## 0.25 1.271762 0.5674444 1.0151536
## 0.30 1.474895 0.5216791 1.1301455
## 0.35 1.577076 0.4987345 1.1936457
## 0.40 1.673373 0.4850159 1.2395459
## 0.45 1.786916 0.4712343 1.2884060
## 0.50 2.054639 0.4573306 1.4112541
## 0.55 2.423125 0.4482429 1.5634704
## 0.60 2.428828 0.4434366 1.5912224
## 0.65 2.460219 0.4404302 1.6118995
## 0.70 2.528606 0.4376021 1.6470107
## 0.75 2.638364 0.4349723 1.6946468
## 0.80 2.752816 0.4331351 1.7399360
## 0.85 2.878973 0.4311473 1.7886424
## 0.90 3.026342 0.4291307 1.8440035
## 0.95 3.365382 0.4275135 2.0106954
## 1.00 3.604689 0.4259973 2.1215479
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.2.
The optimal value for the fraction parameter is 0.2 which yields a 1.178 in RMSE in the in-sample training set, which outperformed the ridge regression.
Lastly, let’s run the enet model which balances and optimizes the parameters from both the lasso and the ridge models.
set.seed(12)
# Grid search for optimized penalty lambdas
enet_gs <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
# tuning penalized regression model
enet_fit_chem <- train(x_train_chem, y_train_chem, method = "enet",
tuneGrid = enet_gs, trControl = tr_ctrl, preProc = c("center", "scale"))
plot(enet_fit_chem)
enet_fit_chem
## Elasticnet
##
## 124 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 1.185460 0.5949685 0.9761787
## 0.00 0.10 1.178212 0.5885738 0.9510623
## 0.00 0.15 1.217401 0.5817917 0.9623385
## 0.00 0.20 1.178073 0.6114951 0.9425087
## 0.00 0.25 1.271762 0.5674444 1.0151536
## 0.00 0.30 1.474895 0.5216791 1.1301455
## 0.00 0.35 1.577076 0.4987345 1.1936457
## 0.00 0.40 1.673373 0.4850159 1.2395459
## 0.00 0.45 1.786916 0.4712343 1.2884060
## 0.00 0.50 2.054639 0.4573306 1.4112541
## 0.00 0.55 2.423125 0.4482429 1.5634704
## 0.00 0.60 2.428828 0.4434366 1.5912224
## 0.00 0.65 2.460219 0.4404302 1.6118995
## 0.00 0.70 2.528606 0.4376021 1.6470107
## 0.00 0.75 2.638364 0.4349723 1.6946468
## 0.00 0.80 2.752816 0.4331351 1.7399360
## 0.00 0.85 2.878973 0.4311473 1.7886424
## 0.00 0.90 3.026342 0.4291307 1.8440035
## 0.00 0.95 3.365382 0.4275135 2.0106954
## 0.00 1.00 3.604689 0.4259973 2.1215479
## 0.01 0.05 1.416959 0.5625946 1.1439554
## 0.01 0.10 1.215877 0.5913592 1.0024114
## 0.01 0.15 1.179914 0.5849740 0.9769658
## 0.01 0.20 1.193521 0.5749623 0.9739957
## 0.01 0.25 1.238737 0.5599391 0.9826093
## 0.01 0.30 1.170353 0.5933902 0.9494856
## 0.01 0.35 1.165372 0.5990682 0.9442126
## 0.01 0.40 1.253421 0.5690645 0.9978396
## 0.01 0.45 1.395531 0.5461331 1.0699375
## 0.01 0.50 1.519609 0.5420232 1.1229987
## 0.01 0.55 1.570763 0.5448129 1.1451582
## 0.01 0.60 1.632989 0.5444553 1.1725678
## 0.01 0.65 1.719087 0.5426914 1.2094253
## 0.01 0.70 1.856697 0.5388151 1.2685455
## 0.01 0.75 2.047185 0.5341109 1.3463414
## 0.01 0.80 2.245804 0.5296128 1.4266243
## 0.01 0.85 2.421063 0.5249558 1.4964331
## 0.01 0.90 2.492555 0.5204042 1.5257318
## 0.01 0.95 2.553484 0.5168639 1.5515391
## 0.01 1.00 2.609982 0.5136214 1.5779534
## 0.10 0.05 1.556425 0.4794030 1.2649904
## 0.10 0.10 1.377568 0.5673686 1.1120479
## 0.10 0.15 1.259038 0.5801671 1.0327710
## 0.10 0.20 1.202020 0.5798737 0.9945356
## 0.10 0.25 1.210505 0.5612809 1.0014380
## 0.10 0.30 1.223439 0.5524961 1.0008888
## 0.10 0.35 1.264786 0.5392222 1.0080186
## 0.10 0.40 1.280128 0.5395270 1.0076313
## 0.10 0.45 1.255157 0.5490661 0.9992590
## 0.10 0.50 1.232135 0.5594726 0.9937856
## 0.10 0.55 1.206974 0.5744677 0.9769703
## 0.10 0.60 1.282965 0.5587305 1.0183883
## 0.10 0.65 1.405600 0.5470514 1.0767923
## 0.10 0.70 1.520317 0.5397956 1.1281169
## 0.10 0.75 1.650197 0.5352548 1.1842849
## 0.10 0.80 1.781200 0.5319931 1.2400027
## 0.10 0.85 1.904189 0.5292173 1.2915373
## 0.10 0.90 2.034644 0.5271368 1.3458364
## 0.10 0.95 2.124133 0.5250614 1.3828653
## 0.10 1.00 2.193734 0.5231794 1.4105805
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.01.
We can see from the output of the elasticnet model that a lambda value of 0.01 (ridge) and a fraction of 0.35 (lasso) provide the optimal RMSE error value based on the training set of 1.165. This metric is the lowest of all of the methods attempted so far for the chemical manufacturing.
x_ridge_pred_chem <- predict(ridgeFit_chem,x_test_chem)
x_lasso_pred_chem <- predict(lassoFit_chem,x_test_chem)
x_enet_pred_chem <- predict(enet_fit_chem,x_test_chem)
caret::defaultSummary(data.frame(pred=x_ridge_pred_chem,obs=y_test_chem))
## RMSE Rsquared MAE
## 2.0067806 0.3769701 1.2991752
caret::defaultSummary(data.frame(pred=x_lasso_pred_chem,obs=y_test_chem))
## RMSE Rsquared MAE
## 1.5022300 0.5067301 1.1811992
caret::defaultSummary(data.frame(pred=x_enet_pred_chem,obs=y_test_chem))
## RMSE Rsquared MAE
## 1.1323445 0.6639588 0.9507090
In line with the results from the training data the enet model outperformed the ridge and lasso models that were only focused on their penalized parameters. The ridge model outperformed it’s RMSE from the resampled training, but it was still higher than the other models tested. The lasso regulularization had a higher RMSE in the test set as compared to the training data, while the enet model had a slightly lower test set RMSE.
plot(varImp(enet_fit_chem,scale=TRUE))
It appears that manufacturing processes are the predictors with most
imporant with manufacturingprocess32 and
manufacturingprocess36 far exceeding the importance
compared to other available predictors. It is noted that more
manufacturing processes are also minimzed to zero with regularization so
not all processes are created equally for this chemical
manufacturing.
imp_chem_preds <- c('ManufacturingProcess32','ManufacturingProcess36','BiologicalMaterial06','ManufacturingProcess13','BiologicalMaterial03','ManufacturingProcess31','Yield')
par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))
plot(x=x_chem_imp$ManufacturingProcess32,y=y_chem$Yield,main='Manu_Process32 vs Yield')
plot(x=x_chem_imp$ManufacturingProcess36,y=y_chem$Yield,main='Manu_Process36 vs Yield')
plot(x=x_chem_imp$ManufacturingProcess31,y=y_chem$Yield,main='Manu_Process31 vs Yield')
plot(x=x_chem_imp$ManufacturingProcess13,y=y_chem$Yield,main='Manu_Process13 vs Yield')
plot(x=x_chem_imp$BiologicalMaterial03,y=y_chem$Yield,main='Bio_Process03 vs Yield')
plot(x=x_chem_imp$BiologicalMaterial06,y=y_chem$Yield,main='Bio_Process06 vs Yield')
Most of these predictors appear to have strong positive or negative linear relationships with yield which would make sense why they would be considered important in a regression model. It’s surprising that process 31 would be significant as there doesn’t seem to be much variety in values that change yield. Manufacturing process 36 also appears to have concentration at specific values although there remains a fairly strong negative relationship between it and yield. Ultimately, this would be indicative to the manufacturer of which processes to invest in future R&D or improvement as they have a greater likelihood of impacting yield and how they should focus on the factors that will consistently impact yield or minimize lowered yield.