We load the permeability data, which includes two datasets relating to 165 cases:
fingerprints: the predictor data for 1107 variables; each variable is binary and takes on the value 0 or 1permeability: the response data, which varies from 0.06 to 55.6.data("permeability")
str(fingerprints)
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
str(permeability)
## num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr "permeability"
cat("Fingerprints range of values: ", range(fingerprints))
## Fingerprints range of values: 0 1
summary(permeability)
## permeability
## Min. : 0.06
## 1st Qu.: 1.55
## Median : 4.91
## Mean :12.24
## 3rd Qu.:15.47
## Max. :55.60
The fingerprints matrix is sparse, as only 15% of the entries are non-zero. Using the nearZeroVar function, we see that:
The filtered dataset is less sparse that the raw dataset, but still only ~31% of the entries are non-zero.
cat("Proportion of non-zero entries in raw data: ", sum(fingerprints != 0) / nrow(fingerprints) / ncol(fingerprints))
## Proportion of non-zero entries in raw data: 0.1547672
# variables flagged as near-zero variance
nzv_idx <- nearZeroVar(fingerprints)
# variables having only 1 unique value
zv_idx <- nearZeroVar(fingerprints, uniqueCut = 1.5 / nrow(fingerprints))
cat("Number of near-zero variance predictors: ", length(nzv_idx))
## Number of near-zero variance predictors: 719
cat("Number of zero variance predictors: ", length(zv_idx))
## Number of zero variance predictors: 38
cat("Number of remaining variables excluding nzv predictors: ", ncol(fingerprints) - length(nzv_idx))
## Number of remaining variables excluding nzv predictors: 388
# filter out near-zero variance predictors
finger <- fingerprints[ , -nzv_idx]
cat("Proportion of non-zero entries in filtered data: ", sum(finger != 0) / nrow(finger) / ncol(finger))
## Proportion of non-zero entries in filtered data: 0.309294
First we partition the data into training and test sets using the createDataPartition function. We choose a 75/25 split, and confirm that the distribution of the response is similar for the training and test sets, and that they match the distribution for permeability seen above.
# training/test split
set.seed(314)
train_idx <- createDataPartition(permeability, p = 0.75, list = FALSE)
finger_train <- finger[train_idx, ]
finger_test <- finger[-train_idx, ]
perm_train <- permeability[train_idx]
perm_test <- permeability[-train_idx]
summary(perm_train)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06 1.55 4.91 12.09 14.98 51.41
summary(perm_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.320 1.508 4.857 12.691 16.454 55.600
Next we decide on pre-processing and tuning methodology, which we will specify within the train function. For pre-processing, we choose:
For tuning, we need to specify the tuning parameter and the tuning methodology:
ncomp, the number of components, is the tuning parameter for partial least squares regression; we will tune the model over a range from 1 to 20 componentsFrom the model summary below, the optimal number of latent variables (components) is 2, for which the re-sampled results on the training set are:
These results are somewhat disappointing, since the RMSE is almost as large as the mean value of the response, and the \(R^2\) indicates that the model explains only half of the total variability in the response.
# 10-fold cross validation
set.seed(1012)
ctrl <- trainControl(method = "cv", number = 10)
# fit pls model with tuning on 1 to 20 components
plsfit <- train(x = finger_train, y = perm_train, method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = ctrl)
plsfit
## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 113, 112, 113, 113, 112, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.76768 0.3286721 9.865718
## 2 11.26782 0.5053179 8.084910
## 3 11.36466 0.5117740 8.697612
## 4 11.74135 0.4966289 8.955769
## 5 11.54523 0.5088994 8.459935
## 6 11.28930 0.5392750 8.615870
## 7 11.50133 0.5143293 8.656224
## 8 11.63637 0.5078302 8.744130
## 9 11.79615 0.5062645 8.831212
## 10 12.23511 0.4785289 9.067635
## 11 12.36024 0.4734019 9.156830
## 12 12.66178 0.4605127 9.480229
## 13 12.93557 0.4509476 9.703446
## 14 13.58501 0.4269536 10.131409
## 15 13.90140 0.4132301 10.316673
## 16 14.21406 0.4004785 10.495583
## 17 14.24828 0.4034536 10.412802
## 18 14.34146 0.4021287 10.212898
## 19 14.48917 0.3941000 10.228304
## 20 14.54602 0.3936051 10.223234
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
summary(plsfit)
## Data: X dimension: 125 388
## Y dimension: 125 1
## Fit method: oscorespls
## Number of components considered: 2
## TRAINING: % variance explained
## 1 comps 2 comps
## X 22.48 34.70
## .outcome 32.30 51.89
Applying the fitted PLS model to the test set, we see the performance metrics below:
As expected, the performance metrics on the test set is worse than on the training set, since the model was fitted on the training data.
pred_test <- predict(plsfit, newdata = finger_test)
data.frame(obs = perm_test, pred = pred_test) %>% defaultSummary()
## RMSE Rsquared MAE
## 12.8363398 0.4017059 8.2097155
Next we try fitting ridge regression and lasso models to the training data. As before, we use the same pre-processing (centering & scaling) and re-sampling method (10-fold cross-validation). The tuning parameters are \(\lambda\) for the ridge regression model and fraction for the lasso model.
From the summary of the ridge regression model, we see that the optimal value of \(\lambda\) is 0.17, for which re-sampled performance metrics on the training set are:
# fit ridge regression model with tuning on lambda from 0.1 to 0.25
set.seed(1012)
ridgeGrid <- data.frame(lambda = seq(0.1, 0.25, length = 10))
ridgefit <- train(x = finger_train, y = perm_train, method = "ridge",
preProcess = c("center", "scale"),
tuneGrid = ridgeGrid,
trControl = ctrl)
ridgefit
## Ridge Regression
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 113, 112, 113, 113, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.1000000 12.70560 0.4758954 9.212872
## 0.1166667 12.64887 0.4807533 9.217095
## 0.1333333 12.58940 0.4863382 9.218976
## 0.1500000 12.56207 0.4899984 9.228963
## 0.1666667 12.53829 0.4938164 9.251851
## 0.1833333 12.54760 0.4959327 9.300625
## 0.2000000 12.54444 0.4999351 9.346377
## 0.2166667 12.54067 0.5026797 9.387724
## 0.2333333 12.56126 0.5048426 9.442846
## 0.2500000 12.58124 0.5072145 9.493727
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1666667.
From the summary of the lasso model, the optimal value of fraction is 0.2, for which the re-sampled performance metrics on the training set are:
# fit lasso model with tuning on fraction from 0.05 to 1
set.seed(1012)
lassoGrid <- data.frame(lambda = 0, fraction = seq(0.05, 1, length = 20))
lassofit <- train(x = finger_train, y = perm_train, method = "enet",
preProcess = c("center", "scale"),
tuneGrid = lassoGrid,
trControl = ctrl)
lassofit
## Elasticnet
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 113, 112, 113, 113, 112, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.05 11.55552 0.5796203 8.900508
## 0.10 10.85161 0.6112838 7.847782
## 0.15 10.76157 0.5967853 7.663987
## 0.20 10.72099 0.5733757 7.663274
## 0.25 10.76875 0.5472910 7.755845
## 0.30 10.91337 0.5259227 7.864847
## 0.35 11.11683 0.5077210 8.016408
## 0.40 11.26524 0.4958540 8.136810
## 0.45 11.38578 0.4842928 8.197103
## 0.50 11.49917 0.4746144 8.240137
## 0.55 11.58142 0.4699934 8.256802
## 0.60 11.60609 0.4689958 8.221204
## 0.65 11.66647 0.4660211 8.210972
## 0.70 11.71553 0.4648591 8.222448
## 0.75 11.80665 0.4622965 8.271713
## 0.80 11.89865 0.4591582 8.303971
## 0.85 11.99548 0.4552830 8.325700
## 0.90 12.07548 0.4521983 8.353757
## 0.95 12.13595 0.4508933 8.369571
## 1.00 12.22527 0.4486670 8.418347
##
## Tuning parameter 'lambda' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.
Let’s find out how many predictors remain in the fitted ridge and lasso models. We do this by extracting the coefficients from the final fitted models using the predict.enet method. We see that the ridge regression model keeps all 388 predictors in the model, while the lasso model keeps only 49 predictors. This illustrates that the lasso model has performed feature selection in setting many coefficients to zero.
ridge_coeff <- predict(ridgefit$finalModel, s = 1, mode = "fraction", type = "coefficients")$coefficients
lasso_coeff <- predict(lassofit$finalModel, s = 0.2, mode = "fraction", type = "coefficients")$coefficients
cat("Number of non-zero coefficients in ridge model: ", sum(ridge_coeff != 0))
## Number of non-zero coefficients in ridge model: 388
cat("Number of non-zero coefficients in lasso model: ", sum(lasso_coeff != 0))
## Number of non-zero coefficients in lasso model: 49
Finally we summarize the performance metrics on the training and test sets, for all the models considered. Note that the training RMSE and \(R^2\) measures below differ from the re-sampled training measures seen above, since these performance metrics are computed on the training set without re-sampling.
Overall we see that the ridge regression model performs best on the test set, with:
However, performance is still disappointing in that the RMSE is roughly equivalent to the mean value of the response, and the model explains less than half of the variability in the response.
pls_train <- data.frame(obs = perm_train, pred = predict(plsfit)) %>% defaultSummary()
pls_test <- data.frame(obs = perm_test, pred = predict(plsfit, finger_test)) %>% defaultSummary()
ridge_train <- data.frame(obs = perm_train, pred = predict(ridgefit)) %>% defaultSummary()
ridge_test <- data.frame(obs = perm_test, pred = predict(ridgefit, finger_test)) %>% defaultSummary()
lasso_train <- data.frame(obs = perm_train, pred = predict(lassofit)) %>% defaultSummary()
lasso_test <- data.frame(obs = perm_test, pred = predict(lassofit, finger_test)) %>% defaultSummary()
perf_df <- data.frame(PLS = c(pls_train[1:2], pls_test[1:2]),
Ridge = c(ridge_train[1:2], ridge_test[1:2]),
Lasso = c(lasso_train[1:2], lasso_test[1:2]))
rownames(perf_df) <- c("Training RMSE", "Training Rsq", "Test RMSE", "Test Rsq")
perf_df %>% kable(caption = "Comparison of performance metrics")
| PLS | Ridge | Lasso | |
|---|---|---|---|
| Training RMSE | 10.5667565 | 6.1283975 | 8.0722802 |
| Training Rsq | 0.5189300 | 0.8431869 | 0.7295394 |
| Test RMSE | 12.8363398 | 12.7596971 | 12.8611534 |
| Test Rsq | 0.4017059 | 0.4272418 | 0.3864340 |
I would recommend further development work and research into model building. It is clear based on the RMSE and \(R^2\) metrics that none of the models considered (partial least squares, ridge, and lasso) does a good job of prediction for this dataset. For example, on the test (holdout) set, the RMSE is equivalent to the mean value of the response, and the \(R^2\) indicates that less than half of the response variation is explained by the model.
Although a cost-benefit analysis should be done to weigh the high cost / high certainty of laboratory experiments against low cost / low certainty of model prediction, I suspect that laboratory experiments would win this comparison. Hence I would recommend that none of the models be used to replace permeability laboratory experiment.
We load the ChemicalManufacturingProcess data, which includes 176 observations of 58 variables, all of which are numeric. The 176 observations relate to manufacturing runs with variations in the biological materials and manufacturing process. The variables include:
Yield, which is the product yield for the manufacturing run and which ranges from 35 to 46BiologicalMaterial__, which relate to measurements of the raw materialsManufacturingProcess__, which relate to measurements of the manufacturing process.For ease of manipulation, we split the dataset into the response data yield and the predictor data cmp and relabel the biological variables as B__ and the manufacturing variables as M__. We see from the summary that many of the manufacturing process predictors have a handful of missing values, which we handle in the next part.
data("ChemicalManufacturingProcess")
str(ChemicalManufacturingProcess)
## 'data.frame': 176 obs. of 58 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.6 14.6 14.6 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
# for ease of manipulation, split response and predictors & relabel columns
yield <- ChemicalManufacturingProcess[1]
cmp <- ChemicalManufacturingProcess[-1]
colnames(cmp) <- c(paste0("B", 1:12), paste0("M", 1:45))
summary(yield)
## Yield
## Min. :35.25
## 1st Qu.:38.75
## Median :39.97
## Mean :40.18
## 3rd Qu.:41.48
## Max. :46.34
summary(cmp)
## B1 B2 B3 B4
## Min. :4.580 Min. :46.87 Min. :56.97 Min. : 9.38
## 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98 1st Qu.:11.24
## Median :6.305 Median :55.09 Median :67.22 Median :12.10
## Mean :6.411 Mean :55.69 Mean :67.70 Mean :12.35
## 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43 3rd Qu.:13.22
## Max. :8.810 Max. :64.75 Max. :78.25 Max. :23.09
##
## B5 B6 B7 B8
## Min. :13.24 Min. :40.60 Min. :100.0 Min. :15.88
## 1st Qu.:17.23 1st Qu.:46.05 1st Qu.:100.0 1st Qu.:17.06
## Median :18.49 Median :48.46 Median :100.0 Median :17.51
## Mean :18.60 Mean :48.91 Mean :100.0 Mean :17.49
## 3rd Qu.:19.90 3rd Qu.:51.34 3rd Qu.:100.0 3rd Qu.:17.88
## Max. :24.85 Max. :59.38 Max. :100.8 Max. :19.14
##
## B9 B10 B11 B12
## Min. :11.44 Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:12.60 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :12.84 Median :2.710 Median :146.1 Median :20.12
## Mean :12.85 Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:13.13 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :14.08 Max. :6.870 Max. :158.7 Max. :22.21
##
## M1 M2 M3 M4
## Min. : 0.00 Min. : 0.00 Min. :1.47 Min. :911.0
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53 1st Qu.:928.0
## Median :11.40 Median :21.00 Median :1.54 Median :934.0
## Mean :11.21 Mean :16.68 Mean :1.54 Mean :931.9
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55 3rd Qu.:936.0
## Max. :14.10 Max. :22.50 Max. :1.60 Max. :946.0
## NA's :1 NA's :3 NA's :15 NA's :1
## M5 M6 M7 M8
## Min. : 923.0 Min. :203.0 Min. :177.0 Min. :177.0
## 1st Qu.: 986.8 1st Qu.:205.7 1st Qu.:177.0 1st Qu.:177.0
## Median : 999.2 Median :206.8 Median :177.0 Median :178.0
## Mean :1001.7 Mean :207.4 Mean :177.5 Mean :177.6
## 3rd Qu.:1008.9 3rd Qu.:208.7 3rd Qu.:178.0 3rd Qu.:178.0
## Max. :1175.3 Max. :227.4 Max. :178.0 Max. :178.0
## NA's :1 NA's :2 NA's :1 NA's :1
## M9 M10 M11 M12
## Min. :38.89 Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.:44.89 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median :45.73 Median : 9.100 Median : 9.400 Median : 0.0
## Mean :45.66 Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.:46.52 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :49.36 Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## M13 M14 M15 M16
## Min. :32.10 Min. :4701 Min. :5904 Min. : 0
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010 1st Qu.:4561
## Median :34.60 Median :4856 Median :6032 Median :4588
## Mean :34.51 Mean :4854 Mean :6039 Mean :4566
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061 3rd Qu.:4619
## Max. :38.60 Max. :5055 Max. :6233 Max. :4852
## NA's :1
## M17 M18 M19 M20
## Min. :31.30 Min. : 0 Min. :5890 Min. : 0
## 1st Qu.:33.50 1st Qu.:4813 1st Qu.:6001 1st Qu.:4553
## Median :34.40 Median :4835 Median :6022 Median :4582
## Mean :34.34 Mean :4810 Mean :6028 Mean :4556
## 3rd Qu.:35.10 3rd Qu.:4862 3rd Qu.:6050 3rd Qu.:4610
## Max. :40.00 Max. :4971 Max. :6146 Max. :4759
##
## M21 M22 M23 M24
## Min. :-1.8000 Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.:-0.6000 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median :-0.3000 Median : 5.000 Median :3.000 Median : 8.000
## Mean :-0.1642 Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 0.0000 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. : 3.6000 Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## M25 M26 M27 M28
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.000
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560 1st Qu.: 0.000
## Median :4855 Median :6047 Median :4587 Median :10.400
## Mean :4828 Mean :6016 Mean :4563 Mean : 6.592
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609 3rd Qu.:10.750
## Max. :4990 Max. :6161 Max. :4710 Max. :11.500
## NA's :5 NA's :5 NA's :5 NA's :5
## M29 M30 M31 M32
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :143.0
## 1st Qu.:19.70 1st Qu.: 8.800 1st Qu.:70.10 1st Qu.:155.0
## Median :19.90 Median : 9.100 Median :70.80 Median :158.0
## Mean :20.01 Mean : 9.161 Mean :70.18 Mean :158.5
## 3rd Qu.:20.40 3rd Qu.: 9.700 3rd Qu.:71.40 3rd Qu.:162.0
## Max. :22.00 Max. :11.200 Max. :72.50 Max. :173.0
## NA's :5 NA's :5 NA's :5
## M33 M34 M35 M36
## Min. :56.00 Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:62.00 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :64.00 Median :2.500 Median :495.0 Median :0.02000
## Mean :63.54 Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:65.00 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :70.00 Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5 NA's :5
## M37 M38 M39 M40
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100 1st Qu.:0.00000
## Median :1.000 Median :3.000 Median :7.200 Median :0.00000
## Mean :1.014 Mean :2.534 Mean :6.851 Mean :0.01771
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300 3rd Qu.:0.00000
## Max. :2.300 Max. :3.000 Max. :7.500 Max. :0.10000
## NA's :1
## M41 M42 M43 M44
## Min. :0.00000 Min. : 0.00 Min. : 0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:11.40 1st Qu.: 0.6000 1st Qu.:1.800
## Median :0.00000 Median :11.60 Median : 0.8000 Median :1.900
## Mean :0.02371 Mean :11.21 Mean : 0.9119 Mean :1.805
## 3rd Qu.:0.00000 3rd Qu.:11.70 3rd Qu.: 1.0250 3rd Qu.:1.900
## Max. :0.20000 Max. :12.10 Max. :11.0000 Max. :2.100
## NA's :1
## M45
## Min. :0.000
## 1st Qu.:2.100
## Median :2.200
## Mean :2.138
## 3rd Qu.:2.300
## Max. :2.600
##
yield <- yield[[1]] # numeric
cmp <- as.matrix(cmp) # matrix
We see below that 28 predictors (all manufacturing process variables) have missing data, and the proportion of cases with missing data is well under 5% for all variables except the “03”, “10”, and “11” manufacturing predictors. Even for these last 3 variables, the proportion of cases with missing data is under 10%.
numb_na <- function(x) sum(is.na(x))
missing <- apply(cmp, MARGIN = 2, FUN = numb_na)
cat("Number of predictors with missing values: ", sum(missing > 0))
## Number of predictors with missing values: 28
tibble(Variable = names(missing),
Numb_NA = missing,
Frac_NA = missing / nrow(cmp)) %>%
filter(Numb_NA > 0) %>%
kable(caption = "Predictors with missing values")
| Variable | Numb_NA | Frac_NA |
|---|---|---|
| M1 | 1 | 0.0056818 |
| M2 | 3 | 0.0170455 |
| M3 | 15 | 0.0852273 |
| M4 | 1 | 0.0056818 |
| M5 | 1 | 0.0056818 |
| M6 | 2 | 0.0113636 |
| M7 | 1 | 0.0056818 |
| M8 | 1 | 0.0056818 |
| M10 | 9 | 0.0511364 |
| M11 | 10 | 0.0568182 |
| M12 | 1 | 0.0056818 |
| M14 | 1 | 0.0056818 |
| M22 | 1 | 0.0056818 |
| M23 | 1 | 0.0056818 |
| M24 | 1 | 0.0056818 |
| M25 | 5 | 0.0284091 |
| M26 | 5 | 0.0284091 |
| M27 | 5 | 0.0284091 |
| M28 | 5 | 0.0284091 |
| M29 | 5 | 0.0284091 |
| M30 | 5 | 0.0284091 |
| M31 | 5 | 0.0284091 |
| M33 | 5 | 0.0284091 |
| M34 | 5 | 0.0284091 |
| M35 | 5 | 0.0284091 |
| M36 | 5 | 0.0284091 |
| M40 | 1 | 0.0056818 |
| M41 | 1 | 0.0056818 |
Given that missing data are not a prevalent feature of this dataset, we will impute values for the missing data. We don’t expect the choice of imputation method to have a material impact or bias on the results given the low incidence of missing data. In this case, we choose the bagged trees method for imputation, which we execute on the dataset using the preProcess function. The preProcess function has the ability to impute missing values using either the k-nearest neighbors or bagged trees approach; however the k-nearest neighbors approach automatically centers and scales the entire dataset, which we prefer not to do until we consider pre-processing in general (in the next part).
impute <- preProcess(cmp, method = "bagImpute")
cmp <- predict(impute, cmp)
As before, we use the createDataPartition function to partition the data into a 75/25 split of training and test sets. We confirm that the distribution of the response is similar for the training and test sets, and that they match the distribution of the response for the full dataset seen above.
# training/test split
set.seed(314)
train_idx <- createDataPartition(yield, p = 0.75, list = FALSE)
yield_train <- yield[train_idx]
yield_test <- yield[-train_idx]
cmp_train <- cmp[train_idx, ]
cmp_test <- cmp[-train_idx, ]
summary(yield_train)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.25 38.75 39.97 40.21 41.48 46.34
summary(yield_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.77 38.78 39.92 40.07 41.05 43.62
Next we decide on pre-processing and tuning methodology, which we will specify within the train function. For pre-processing, we choose:
In this exercise, we decide to fit an elastic net model. We specify the tuning parameters and the tuning methodology as follows:
fraction (the fraction of the full (ordinary least squares) model) and \(\lambda\) (the weight decay or parameter for the ridge penalty) are the tuning parameters for the elastic net model; we will tune the model over a range of fraction from 0.01 to 1 and a range of \(\lambda\) from 0 to 0.5. Note that for fraction = 0, the fitting procedure throws errors, and for \(\lambda\) above 0.5, performance metrics deteriorate dramatically.From the model summary below, the optimal tuning parameters are fraction = 0.37 and \(\lambda\) = 0.05, for which the re-sampled results on the training set are:
These results appear reasonable; to put the RMSE in context, the yield response ranges from 35 to 46 (spread of 11 from min to max) with a mean / median value of ~40. So on the training set, the model predicts response values within 10% of the total range of variation, and it explains 63% of the total response variability. However the real test is how the model performs on the test set, which we evaluate in the next part.
Note that in tuning the elastic net model, the Box-Cox transformation caused errors for some reason, so we have removed this from the pre-processing specification.
# check skewness of predictors
data.frame(predictor_skewness = apply(cmp, MARGIN = 2, FUN = skewness)) %>% summary()
## predictor_skewness
## Min. :-12.85882
## 1st Qu.: -1.68180
## Median : 0.07891
## Mean : -1.61681
## 3rd Qu.: 0.37836
## Max. : 9.05487
# check near-zero variance predictors
nzv_idx <- nearZeroVar(cmp)
cat("Number of near-zero variance predictors: ", length(nzv_idx))
## Number of near-zero variance predictors: 1
# 10-fold cross validation
set.seed(1012)
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
# tuning grid with lambda & fraction
enetGrid <- expand.grid(lambda = c(0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5),
fraction = seq(0.01, 1, length = 20))
# fit elastic net model with tuning on lambda & fraction
enetfit <- train(x = cmp_train, y = yield_train, method = "enet",
preProcess = c("center", "scale", "nzv"),
# box-cox causes Inf values <<< DON'T USE
#preProcess = c("BoxCox", "center", "scale", "nzv"),
tuneGrid = enetGrid,
trControl = ctrl)
enetfit
## Elasticnet
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (56), scaled (56), remove (1)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.000 0.01000000 1.736467 0.5040668 1.4312419
## 0.000 0.06210526 1.258565 0.6076427 1.0395236
## 0.000 0.11421053 1.194780 0.6371438 0.9741011
## 0.000 0.16631579 1.266905 0.6351839 0.9862056
## 0.000 0.21842105 1.272135 0.6383147 0.9862901
## 0.000 0.27052632 1.427319 0.6011421 1.0501023
## 0.000 0.32263158 1.608945 0.5713870 1.1246547
## 0.000 0.37473684 1.745380 0.5511252 1.1884449
## 0.000 0.42684211 2.015506 0.5107628 1.2866805
## 0.000 0.47894737 2.244690 0.4866337 1.3771573
## 0.000 0.53105263 2.503468 0.4707712 1.4698052
## 0.000 0.58315789 2.731987 0.4603900 1.5537597
## 0.000 0.63526316 2.926356 0.4544760 1.6213958
## 0.000 0.68736842 3.128947 0.4496168 1.6788984
## 0.000 0.73947368 3.348314 0.4449675 1.7549604
## 0.000 0.79157895 3.546628 0.4391620 1.8216199
## 0.000 0.84368421 3.735245 0.4347966 1.8834692
## 0.000 0.89578947 3.891915 0.4328652 1.9339763
## 0.000 0.94789474 4.047174 0.4304942 1.9884199
## 0.000 1.00000000 4.183803 0.4295225 2.0389320
## 0.001 0.01000000 1.779540 0.4678851 1.4670757
## 0.001 0.06210526 1.328511 0.6065805 1.0989256
## 0.001 0.11421053 1.224201 0.6159998 1.0053929
## 0.001 0.16631579 1.200389 0.6359026 0.9716690
## 0.001 0.21842105 1.292994 0.6224379 0.9988664
## 0.001 0.27052632 1.271171 0.6347491 0.9830247
## 0.001 0.32263158 1.299487 0.6310666 0.9913723
## 0.001 0.37473684 1.383132 0.6087752 1.0311777
## 0.001 0.42684211 1.470427 0.5838725 1.0798286
## 0.001 0.47894737 1.700263 0.5400556 1.1681947
## 0.001 0.53105263 1.912758 0.5167119 1.2506599
## 0.001 0.58315789 2.101680 0.4993265 1.3213567
## 0.001 0.63526316 2.281688 0.4865754 1.3865954
## 0.001 0.68736842 2.474250 0.4770858 1.4533066
## 0.001 0.73947368 2.655756 0.4700678 1.5181198
## 0.001 0.79157895 2.840843 0.4621230 1.5807925
## 0.001 0.84368421 3.001003 0.4564578 1.6301575
## 0.001 0.89578947 3.161216 0.4518966 1.6717954
## 0.001 0.94789474 3.327116 0.4481326 1.7305826
## 0.001 1.00000000 3.447791 0.4451791 1.7792119
## 0.005 0.01000000 1.804707 0.4396747 1.4879960
## 0.005 0.06210526 1.427757 0.6017997 1.1815678
## 0.005 0.11421053 1.247714 0.6084465 1.0359356
## 0.005 0.16631579 1.209921 0.6229697 0.9909711
## 0.005 0.21842105 1.192347 0.6371211 0.9696260
## 0.005 0.27052632 1.281703 0.6208076 1.0003583
## 0.005 0.32263158 1.317226 0.6236268 1.0041006
## 0.005 0.37473684 1.300681 0.6277740 0.9979750
## 0.005 0.42684211 1.305347 0.6284459 1.0012184
## 0.005 0.47894737 1.324725 0.6101099 1.0228222
## 0.005 0.53105263 1.383722 0.5863219 1.0563114
## 0.005 0.58315789 1.500800 0.5649152 1.1044945
## 0.005 0.63526316 1.688534 0.5365980 1.1795175
## 0.005 0.68736842 1.874413 0.5176590 1.2476391
## 0.005 0.73947368 2.038881 0.5049591 1.3059105
## 0.005 0.79157895 2.179310 0.4962562 1.3545710
## 0.005 0.84368421 2.314388 0.4869293 1.3996855
## 0.005 0.89578947 2.403928 0.4803696 1.4337058
## 0.005 0.94789474 2.486192 0.4756790 1.4640236
## 0.005 1.00000000 2.554515 0.4722272 1.4881801
## 0.010 0.01000000 1.815880 0.4268923 1.4973775
## 0.010 0.06210526 1.484554 0.5961302 1.2273616
## 0.010 0.11421053 1.276949 0.6081311 1.0590029
## 0.010 0.16631579 1.226506 0.6136116 1.0096217
## 0.010 0.21842105 1.199834 0.6290966 0.9789000
## 0.010 0.27052632 1.210219 0.6332910 0.9763518
## 0.010 0.32263158 1.286948 0.6190626 1.0041673
## 0.010 0.37473684 1.328973 0.6182781 1.0134400
## 0.010 0.42684211 1.343121 0.6178307 1.0180046
## 0.010 0.47894737 1.325714 0.6212738 1.0180478
## 0.010 0.53105263 1.337502 0.6075076 1.0334120
## 0.010 0.58315789 1.365185 0.5921380 1.0519158
## 0.010 0.63526316 1.415386 0.5767205 1.0776398
## 0.010 0.68736842 1.513299 0.5566889 1.1220926
## 0.010 0.73947368 1.659256 0.5362644 1.1796739
## 0.010 0.79157895 1.791392 0.5237786 1.2292186
## 0.010 0.84368421 1.907535 0.5135694 1.2704568
## 0.010 0.89578947 2.006479 0.5031506 1.3036017
## 0.010 0.94789474 2.090785 0.4957541 1.3310655
## 0.010 1.00000000 2.150175 0.4908884 1.3539790
## 0.050 0.01000000 1.838591 0.4002276 1.5165287
## 0.050 0.06210526 1.614260 0.5683502 1.3305409
## 0.050 0.11421053 1.424119 0.6017456 1.1794132
## 0.050 0.16631579 1.293204 0.6078216 1.0732303
## 0.050 0.21842105 1.238394 0.6095817 1.0264938
## 0.050 0.27052632 1.220194 0.6157972 0.9999275
## 0.050 0.32263158 1.203893 0.6262489 0.9806030
## 0.050 0.37473684 1.195740 0.6333176 0.9716941
## 0.050 0.42684211 1.224497 0.6286574 0.9825978
## 0.050 0.47894737 1.281397 0.6156957 1.0092439
## 0.050 0.53105263 1.333665 0.6032951 1.0331924
## 0.050 0.58315789 1.370771 0.5968760 1.0513548
## 0.050 0.63526316 1.404729 0.5924819 1.0707358
## 0.050 0.68736842 1.444085 0.5835277 1.0933193
## 0.050 0.73947368 1.449316 0.5765945 1.1037012
## 0.050 0.79157895 1.459555 0.5706832 1.1131993
## 0.050 0.84368421 1.475600 0.5656583 1.1223688
## 0.050 0.89578947 1.474763 0.5620255 1.1213909
## 0.050 0.94789474 1.469699 0.5599514 1.1172719
## 0.050 1.00000000 1.473668 0.5583026 1.1229630
## 0.100 0.01000000 1.845186 0.3918316 1.5220872
## 0.100 0.06210526 1.653363 0.5558789 1.3635510
## 0.100 0.11421053 1.482152 0.5963832 1.2262791
## 0.100 0.16631579 1.351039 0.6045076 1.1195513
## 0.100 0.21842105 1.262597 0.6094722 1.0485408
## 0.100 0.27052632 1.230810 0.6107972 1.0177067
## 0.100 0.32263158 1.218386 0.6158150 0.9960746
## 0.100 0.37473684 1.206552 0.6241205 0.9813348
## 0.100 0.42684211 1.197271 0.6316033 0.9715892
## 0.100 0.47894737 1.208224 0.6314534 0.9759107
## 0.100 0.53105263 1.258877 0.6183305 1.0005760
## 0.100 0.58315789 1.312486 0.6054563 1.0260946
## 0.100 0.63526316 1.361901 0.5954298 1.0517861
## 0.100 0.68736842 1.402028 0.5885313 1.0767386
## 0.100 0.73947368 1.438469 0.5796934 1.0988467
## 0.100 0.79157895 1.475553 0.5709462 1.1182016
## 0.100 0.84368421 1.504668 0.5638137 1.1336099
## 0.100 0.89578947 1.519124 0.5585773 1.1434008
## 0.100 0.94789474 1.536827 0.5542069 1.1525856
## 0.100 1.00000000 1.531571 0.5526342 1.1528342
## 0.250 0.01000000 1.849836 0.3873639 1.5260358
## 0.250 0.06210526 1.682264 0.5467503 1.3876273
## 0.250 0.11421053 1.529939 0.5892613 1.2639234
## 0.250 0.16631579 1.404573 0.6006261 1.1628720
## 0.250 0.21842105 1.303139 0.6080765 1.0792046
## 0.250 0.27052632 1.244027 0.6092051 1.0347210
## 0.250 0.32263158 1.227354 0.6081337 1.0143919
## 0.250 0.37473684 1.220844 0.6137254 0.9962820
## 0.250 0.42684211 1.225623 0.6151381 0.9933042
## 0.250 0.47894737 1.223170 0.6202509 0.9911498
## 0.250 0.53105263 1.234414 0.6203622 1.0001427
## 0.250 0.58315789 1.272353 0.6108608 1.0200701
## 0.250 0.63526316 1.322729 0.6007288 1.0427398
## 0.250 0.68736842 1.367561 0.5934973 1.0636409
## 0.250 0.73947368 1.410923 0.5872054 1.0883719
## 0.250 0.79157895 1.459169 0.5789318 1.1155490
## 0.250 0.84368421 1.510674 0.5700149 1.1408541
## 0.250 0.89578947 1.561640 0.5619365 1.1640816
## 0.250 0.94789474 1.610619 0.5552433 1.1856907
## 0.250 1.00000000 1.661260 0.5491755 1.2069126
## 0.500 0.01000000 1.849795 0.3883011 1.5260028
## 0.500 0.06210526 1.684274 0.5475989 1.3889549
## 0.500 0.11421053 1.535326 0.5866074 1.2675716
## 0.500 0.16631579 1.407711 0.6005078 1.1643840
## 0.500 0.21842105 1.305347 0.6077125 1.0788035
## 0.500 0.27052632 1.243473 0.6090790 1.0323786
## 0.500 0.32263158 1.231404 0.6032952 1.0163096
## 0.500 0.37473684 1.243059 0.6008711 1.0126947
## 0.500 0.42684211 1.258589 0.6030290 1.0180820
## 0.500 0.47894737 1.279723 0.6031634 1.0336243
## 0.500 0.53105263 1.301485 0.6031041 1.0495563
## 0.500 0.58315789 1.335667 0.5990496 1.0683601
## 0.500 0.63526316 1.380988 0.5927207 1.0908532
## 0.500 0.68736842 1.433206 0.5872483 1.1158231
## 0.500 0.73947368 1.500263 0.5804188 1.1488722
## 0.500 0.79157895 1.571920 0.5730067 1.1846068
## 0.500 0.84368421 1.642447 0.5653335 1.2184644
## 0.500 0.89578947 1.702806 0.5583992 1.2466055
## 0.500 0.94789474 1.756548 0.5531248 1.2715247
## 0.500 1.00000000 1.807982 0.5489102 1.2946772
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2184211 and lambda
## = 0.005.
We plot the cross-validation RMSE profile for the various values of fraction and \(\lambda\) considered. We also show diagnostic plots of observations versus model predictions and model residuals versus model predictions for the training set. The model diagnostics look satisfactory.
# plot performance profile
plot(enetfit, main = "Performance profile: elastic net model")
# diagnostic plots
pred_train <- predict(enetfit)
resid_train <- yield_train - pred_train
plot_range <- extendrange(c(pred_train, yield_train))
ggplot() + geom_point(aes(x = pred_train, y = yield_train)) +
geom_abline(slope = 1, lty = 2) +
labs(title = "Observed response vs. predicted response: training set",
x = "Predicted response", y = "Observed response") +
coord_cartesian(xlim = plot_range, ylim = plot_range)
ggplot() + geom_point(aes(x = pred_train, y = resid_train)) +
geom_hline(yintercept = 0, lty = 2) +
labs(title = "Residuals vs. predicted response: training set",
x = "Predicted response", y = "Residual")
Next we evaluate model performance on the test (holdout) set. We observe that, as expected, the test set performance metrics are weaker compared to either the training set metrics or the re-sampled training set metrics.
pred_test <- predict(enetfit, cmp_test)
perf_train <- data.frame(obs = yield_train, pred = pred_train) %>% defaultSummary()
perf_test <- data.frame(obs = yield_test, pred = pred_test) %>% defaultSummary()
perf_df <- data.frame(elasticnet = c(perf_train[1:2], perf_test[1:2]))
rownames(perf_df) <- c("Training RMSE", "Training Rsq", "Test RMSE", "Test Rsq")
perf_df %>% kable(caption = "Comparison of performance metrics: elastic net model")
| elasticnet | |
|---|---|
| Training RMSE | 1.0166599 |
| Training Rsq | 0.7213022 |
| Test RMSE | 1.3554437 |
| Test Rsq | 0.4269860 |
First let’s review the model coefficients to see how many variables remain in the elastic net model. It turns out that in this model, the elastic net fitting procedure has removed 35 predictors (in addition to the near-zero variance predictor removed in pre-processing), leaving just 21 predictors in the final tuned model.
We use the varImp function to compute the variable importance scores, and plot the results below. We see that the top 3 scores are all manufacturing variables, while the top 10 scores are almost evenly split between manufacturing (6) and biological (4) variables. As we go down the list, say to the top 20 scores, there are more manufacturing (13) than biological (7) variables, but this is to be expected since the dataset has a greater number of manufacturing (45) than biological (12) variables to begin with. On balance, both biological and manufacturing variables are well represented at the top of the list, although the manufacturing variables appears to be more important.
# model coefficients
enet_coeff <- predict(enetfit$finalModel, s = (enetfit$finalModel)$tuneValue[[1]],
mode = "fraction", type = "coefficients")$coefficients
cat("Number of non-zero coefficients in elastic net model: ", sum(enet_coeff != 0))
## Number of non-zero coefficients in elastic net model: 20
# variable importance scores
(enetImp <- varImp(enetfit, scale = FALSE))
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## M32 0.4353
## M13 0.3912
## M17 0.3597
## B6 0.3085
## B3 0.2782
## M9 0.2737
## M36 0.2707
## M6 0.2620
## B12 0.2594
## B2 0.2230
## M31 0.1988
## M33 0.1879
## B11 0.1747
## B4 0.1684
## M30 0.1646
## M11 0.1585
## M12 0.1505
## B8 0.1500
## M29 0.1462
## M2 0.1451
# plot
plot(enetImp, top = 20, main = "Variable importance scores: elastic net model")
Finally we plot the coefficient paths for the elastic net fitting procedure, which is consistent with the list of variable importance scores above. The dotted vertical line marks the final tuned model, for which the fraction parameter is set to 0.37.
plot(enetfit$finalModel, xvar = "fraction",
main = "Coefficient paths for elastic net model")
abline(v = (enetfit$finalModel)$tuneValue[[1]], lty = 2, col = 2)
Next we explore the relationships between the top 5 predictors and the response. The 5 predictors include three manufacturing variables and two biological variables: M32, M13, M17, B6, B3.
We generate a pairs plot to review the distributions and correlations of the selected variables.
# pairs plot
topvars <- c('M32', 'M13', 'M17', 'B6', 'B3')
topidx <- match(topvars, colnames(cmp))
top_df <- data.frame(yield, cmp[ , topidx])
ggpairs(top_df)
From the pairs plot it is evident that the top 5 predictors have the following relationships with the response:
M32: positive relationship (0.61 correlation)M13: negative relationship (-0.50 correlation)M17: negative relationship (-0.43 correlation)B6: positive relationship (0.48 correlation)B3: positive relationship (0.45 correlation).We take a closer look below at the scatter plots of the yield response versus each predictor. Note that the plots show the bivariate relationship between the variables including the effect of all other predictors, i.e., it does not depict the relationship while holding constant all other predictors. As such, the plots include the noise, correlation, and confounding effects of the other predictors. However, despite this, they illustrate that the top 5 predictors have discernible relationships with the yield response.
# scatter plots
ggplot(top_df, aes(x = M32, y = yield)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Yield vs. M32")
ggplot(top_df, aes(x = M13, y = yield)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Yield vs. M13")
ggplot(top_df, aes(x = M17, y = yield)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Yield vs. M17")
ggplot(top_df, aes(x = B6, y = yield)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Yield vs. B6")
ggplot(top_df, aes(x = B3, y = yield)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Yield vs. B3")
Based on this information, analysis could be done to assess the impact on yield by adjusting these variables. As stated in the text, the biological predictors can’t be adjusted on their own, but they indicate the quality of the raw materials used; on the other hand, the manufacturing predictors can be adjusted during the manufacturing process. Given this information, we could investigate the costs and benefits of:
M32, decreasing M13, and decreasing M17 in the manufacturing process while holding all other variables constantB6 and B3.One way to get a back-of-the-envelope impact of these adjustments is to estimate the slope coefficients from a simple OLS regression of yield onto each predictor. This is a rough guess since, as indicated above, this ignores the effect of the other predictors and so may be misleading. However, throwing caution to the wind, we see that in the context of an OLS regression for each manufacturing predictor:
M32 is associated with a 0.21% increase in the yieldM13 is associated with a 0.92% increase in the yieldM17 is associated with a 0.63% increase in the yield.# simplistic linear models
lm1 <- lm(yield ~ top_df$M32)
lm2 <- lm(yield ~ top_df$M13)
lm3 <- lm(yield ~ top_df$M17)
coef_df <- data.frame(Manuf_Variable = c("M32", "M13", "M17"),
Linear_Coef = c(coef(lm1)[2], coef(lm2)[2], coef(lm3)[2]))
rownames(coef_df) <- NULL
kable(coef_df,
col.names = c("Manufacturing Predictor", "Slope Coefficient"),
digits = 3,
caption = "Coefficients from simple OLS regression onto yield")
| Manufacturing Predictor | Slope Coefficient |
|---|---|
| M32 | 0.208 |
| M13 | -0.916 |
| M17 | -0.630 |