Exercise 6.2

(a)

We load the permeability data, which includes two datasets relating to 165 cases:

  • fingerprints: the predictor data for 1107 variables; each variable is binary and takes on the value 0 or 1
  • permeability: the response data, which varies from 0.06 to 55.6.
data("permeability")
str(fingerprints)
##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
str(permeability)
##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"
cat("Fingerprints range of values: ", range(fingerprints))
## Fingerprints range of values:  0 1
summary(permeability)
##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

(b)

The fingerprints matrix is sparse, as only 15% of the entries are non-zero. Using the nearZeroVar function, we see that:

  • 719 variables have near zero variance, i.e., the ratio of the most frequent to the less frequent value is greater than 95/5 (the default value in the function), which includes:
    • 38 variables that have no variance (i.e., they have the same value for all cases)
    • 681 variables that have minimal (non-zero) variance
  • 388 variables remain after filtering out the near-zero variance predictors.

The filtered dataset is less sparse that the raw dataset, but still only ~31% of the entries are non-zero.

cat("Proportion of non-zero entries in raw data: ", sum(fingerprints != 0) / nrow(fingerprints) / ncol(fingerprints))
## Proportion of non-zero entries in raw data:  0.1547672
# variables flagged as near-zero variance
nzv_idx <- nearZeroVar(fingerprints)
# variables having only 1 unique value
zv_idx <- nearZeroVar(fingerprints, uniqueCut = 1.5 / nrow(fingerprints))

cat("Number of near-zero variance predictors: ", length(nzv_idx))
## Number of near-zero variance predictors:  719
cat("Number of zero variance predictors: ", length(zv_idx))
## Number of zero variance predictors:  38
cat("Number of remaining variables excluding nzv predictors: ", ncol(fingerprints) - length(nzv_idx))
## Number of remaining variables excluding nzv predictors:  388
# filter out near-zero variance predictors
finger <- fingerprints[ , -nzv_idx]

cat("Proportion of non-zero entries in filtered data: ", sum(finger != 0) / nrow(finger) / ncol(finger))
## Proportion of non-zero entries in filtered data:  0.309294

(c)

First we partition the data into training and test sets using the createDataPartition function. We choose a 75/25 split, and confirm that the distribution of the response is similar for the training and test sets, and that they match the distribution for permeability seen above.

# training/test split
set.seed(314)
train_idx <- createDataPartition(permeability, p = 0.75, list = FALSE)
finger_train <- finger[train_idx, ]
finger_test <- finger[-train_idx, ]
perm_train <- permeability[train_idx]
perm_test <- permeability[-train_idx]
summary(perm_train)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.06    1.55    4.91   12.09   14.98   51.41
summary(perm_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.320   1.508   4.857  12.691  16.454  55.600

Next we decide on pre-processing and tuning methodology, which we will specify within the train function. For pre-processing, we choose:

  • No Box-Cox transformation, since all predictor data is binary and skewness is not an issue
  • Center & scale all predictors; this is not likely to be material, since all predictor data is binary, but still we center & scale the data since we are using partial least squares regression.

For tuning, we need to specify the tuning parameter and the tuning methodology:

  • ncomp, the number of components, is the tuning parameter for partial least squares regression; we will tune the model over a range from 1 to 20 components
  • 10-fold cross-validation will be our re-sampling approach, which we apply to the training set; the tuning parameter will be chosen to minimize the RMSE metric.

From the model summary below, the optimal number of latent variables (components) is 2, for which the re-sampled results on the training set are:

  • RMSE = 11.3
  • \(R^2\) = 0.505

These results are somewhat disappointing, since the RMSE is almost as large as the mean value of the response, and the \(R^2\) indicates that the model explains only half of the total variability in the response.

# 10-fold cross validation
set.seed(1012)
ctrl <- trainControl(method = "cv", number = 10)
# fit pls model with tuning on 1 to 20 components
plsfit <- train(x = finger_train, y = perm_train, method = "pls", 
                preProcess = c("center", "scale"), 
                tuneLength = 20,
                trControl = ctrl)
plsfit
## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 113, 112, 113, 113, 112, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.76768  0.3286721   9.865718
##    2     11.26782  0.5053179   8.084910
##    3     11.36466  0.5117740   8.697612
##    4     11.74135  0.4966289   8.955769
##    5     11.54523  0.5088994   8.459935
##    6     11.28930  0.5392750   8.615870
##    7     11.50133  0.5143293   8.656224
##    8     11.63637  0.5078302   8.744130
##    9     11.79615  0.5062645   8.831212
##   10     12.23511  0.4785289   9.067635
##   11     12.36024  0.4734019   9.156830
##   12     12.66178  0.4605127   9.480229
##   13     12.93557  0.4509476   9.703446
##   14     13.58501  0.4269536  10.131409
##   15     13.90140  0.4132301  10.316673
##   16     14.21406  0.4004785  10.495583
##   17     14.24828  0.4034536  10.412802
##   18     14.34146  0.4021287  10.212898
##   19     14.48917  0.3941000  10.228304
##   20     14.54602  0.3936051  10.223234
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
summary(plsfit)                
## Data:    X dimension: 125 388 
##  Y dimension: 125 1
## Fit method: oscorespls
## Number of components considered: 2
## TRAINING: % variance explained
##           1 comps  2 comps
## X           22.48    34.70
## .outcome    32.30    51.89

(d)

Applying the fitted PLS model to the test set, we see the performance metrics below:

  • RMSE = 12.8
  • \(R^2\) = 0.402

As expected, the performance metrics on the test set is worse than on the training set, since the model was fitted on the training data.

pred_test <- predict(plsfit, newdata = finger_test) 
data.frame(obs = perm_test, pred = pred_test) %>% defaultSummary()
##       RMSE   Rsquared        MAE 
## 12.8363398  0.4017059  8.2097155

(e)

Next we try fitting ridge regression and lasso models to the training data. As before, we use the same pre-processing (centering & scaling) and re-sampling method (10-fold cross-validation). The tuning parameters are \(\lambda\) for the ridge regression model and fraction for the lasso model.

From the summary of the ridge regression model, we see that the optimal value of \(\lambda\) is 0.17, for which re-sampled performance metrics on the training set are:

  • RMSE = 12.5
  • \(R^2\) = 0.494
# fit ridge regression model with tuning on lambda from 0.1 to 0.25
set.seed(1012)
ridgeGrid <- data.frame(lambda = seq(0.1, 0.25, length = 10))
ridgefit <- train(x = finger_train, y = perm_train, method = "ridge", 
                  preProcess = c("center", "scale"), 
                  tuneGrid = ridgeGrid, 
                  trControl = ctrl)
ridgefit
## Ridge Regression 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 113, 112, 113, 113, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda     RMSE      Rsquared   MAE     
##   0.1000000  12.70560  0.4758954  9.212872
##   0.1166667  12.64887  0.4807533  9.217095
##   0.1333333  12.58940  0.4863382  9.218976
##   0.1500000  12.56207  0.4899984  9.228963
##   0.1666667  12.53829  0.4938164  9.251851
##   0.1833333  12.54760  0.4959327  9.300625
##   0.2000000  12.54444  0.4999351  9.346377
##   0.2166667  12.54067  0.5026797  9.387724
##   0.2333333  12.56126  0.5048426  9.442846
##   0.2500000  12.58124  0.5072145  9.493727
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1666667.

From the summary of the lasso model, the optimal value of fraction is 0.2, for which the re-sampled performance metrics on the training set are:

  • RMSE = 10.7
  • \(R^2\) = 0.573
# fit lasso model with tuning on fraction from 0.05 to 1
set.seed(1012)
lassoGrid <- data.frame(lambda = 0, fraction = seq(0.05, 1, length = 20))
lassofit <- train(x = finger_train, y = perm_train, method = "enet", 
                  preProcess = c("center", "scale"), 
                  tuneGrid = lassoGrid, 
                  trControl = ctrl)
lassofit
## Elasticnet 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 113, 112, 113, 113, 112, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE     
##   0.05      11.55552  0.5796203  8.900508
##   0.10      10.85161  0.6112838  7.847782
##   0.15      10.76157  0.5967853  7.663987
##   0.20      10.72099  0.5733757  7.663274
##   0.25      10.76875  0.5472910  7.755845
##   0.30      10.91337  0.5259227  7.864847
##   0.35      11.11683  0.5077210  8.016408
##   0.40      11.26524  0.4958540  8.136810
##   0.45      11.38578  0.4842928  8.197103
##   0.50      11.49917  0.4746144  8.240137
##   0.55      11.58142  0.4699934  8.256802
##   0.60      11.60609  0.4689958  8.221204
##   0.65      11.66647  0.4660211  8.210972
##   0.70      11.71553  0.4648591  8.222448
##   0.75      11.80665  0.4622965  8.271713
##   0.80      11.89865  0.4591582  8.303971
##   0.85      11.99548  0.4552830  8.325700
##   0.90      12.07548  0.4521983  8.353757
##   0.95      12.13595  0.4508933  8.369571
##   1.00      12.22527  0.4486670  8.418347
## 
## Tuning parameter 'lambda' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.

Let’s find out how many predictors remain in the fitted ridge and lasso models. We do this by extracting the coefficients from the final fitted models using the predict.enet method. We see that the ridge regression model keeps all 388 predictors in the model, while the lasso model keeps only 49 predictors. This illustrates that the lasso model has performed feature selection in setting many coefficients to zero.

ridge_coeff <- predict(ridgefit$finalModel, s = 1, mode = "fraction", type = "coefficients")$coefficients
lasso_coeff <- predict(lassofit$finalModel, s = 0.2, mode = "fraction", type = "coefficients")$coefficients
cat("Number of non-zero coefficients in ridge model: ", sum(ridge_coeff != 0))
## Number of non-zero coefficients in ridge model:  388
cat("Number of non-zero coefficients in lasso model: ", sum(lasso_coeff != 0))
## Number of non-zero coefficients in lasso model:  49

Finally we summarize the performance metrics on the training and test sets, for all the models considered. Note that the training RMSE and \(R^2\) measures below differ from the re-sampled training measures seen above, since these performance metrics are computed on the training set without re-sampling.

Overall we see that the ridge regression model performs best on the test set, with:

  • RMSE = 12.8
  • \(R^2\) = 0.427

However, performance is still disappointing in that the RMSE is roughly equivalent to the mean value of the response, and the model explains less than half of the variability in the response.

pls_train <- data.frame(obs = perm_train, pred = predict(plsfit)) %>% defaultSummary()
pls_test <- data.frame(obs = perm_test, pred = predict(plsfit, finger_test)) %>% defaultSummary()
ridge_train <- data.frame(obs = perm_train, pred = predict(ridgefit)) %>% defaultSummary()
ridge_test <- data.frame(obs = perm_test, pred = predict(ridgefit, finger_test)) %>% defaultSummary()
lasso_train <- data.frame(obs = perm_train, pred = predict(lassofit)) %>% defaultSummary()
lasso_test <- data.frame(obs = perm_test, pred = predict(lassofit, finger_test)) %>% defaultSummary()

perf_df <- data.frame(PLS = c(pls_train[1:2], pls_test[1:2]), 
                      Ridge = c(ridge_train[1:2], ridge_test[1:2]), 
                      Lasso = c(lasso_train[1:2], lasso_test[1:2]))
rownames(perf_df) <- c("Training RMSE", "Training Rsq", "Test RMSE", "Test Rsq")

perf_df %>% kable(caption = "Comparison of performance metrics")
Comparison of performance metrics
PLS Ridge Lasso
Training RMSE 10.5667565 6.1283975 8.0722802
Training Rsq 0.5189300 0.8431869 0.7295394
Test RMSE 12.8363398 12.7596971 12.8611534
Test Rsq 0.4017059 0.4272418 0.3864340

(f)

I would recommend further development work and research into model building. It is clear based on the RMSE and \(R^2\) metrics that none of the models considered (partial least squares, ridge, and lasso) does a good job of prediction for this dataset. For example, on the test (holdout) set, the RMSE is equivalent to the mean value of the response, and the \(R^2\) indicates that less than half of the response variation is explained by the model.

Although a cost-benefit analysis should be done to weigh the high cost / high certainty of laboratory experiments against low cost / low certainty of model prediction, I suspect that laboratory experiments would win this comparison. Hence I would recommend that none of the models be used to replace permeability laboratory experiment.

Exercise 6.3

(a)

We load the ChemicalManufacturingProcess data, which includes 176 observations of 58 variables, all of which are numeric. The 176 observations relate to manufacturing runs with variations in the biological materials and manufacturing process. The variables include:

  • Response variable Yield, which is the product yield for the manufacturing run and which ranges from 35 to 46
  • 12 predictor variables labeled as BiologicalMaterial__, which relate to measurements of the raw materials
  • 45 predictor variables labeled as ManufacturingProcess__, which relate to measurements of the manufacturing process.

For ease of manipulation, we split the dataset into the response data yield and the predictor data cmp and relabel the biological variables as B__ and the manufacturing variables as M__. We see from the summary that many of the manufacturing process predictors have a handful of missing values, which we handle in the next part.

data("ChemicalManufacturingProcess")
str(ChemicalManufacturingProcess)
## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
# for ease of manipulation, split response and predictors & relabel columns
yield <- ChemicalManufacturingProcess[1]
cmp <- ChemicalManufacturingProcess[-1]
colnames(cmp) <- c(paste0("B", 1:12), paste0("M", 1:45)) 
summary(yield)
##      Yield      
##  Min.   :35.25  
##  1st Qu.:38.75  
##  Median :39.97  
##  Mean   :40.18  
##  3rd Qu.:41.48  
##  Max.   :46.34
summary(cmp)
##        B1              B2              B3              B4       
##  Min.   :4.580   Min.   :46.87   Min.   :56.97   Min.   : 9.38  
##  1st Qu.:5.978   1st Qu.:52.68   1st Qu.:64.98   1st Qu.:11.24  
##  Median :6.305   Median :55.09   Median :67.22   Median :12.10  
##  Mean   :6.411   Mean   :55.69   Mean   :67.70   Mean   :12.35  
##  3rd Qu.:6.870   3rd Qu.:58.74   3rd Qu.:70.43   3rd Qu.:13.22  
##  Max.   :8.810   Max.   :64.75   Max.   :78.25   Max.   :23.09  
##                                                                 
##        B5              B6              B7              B8       
##  Min.   :13.24   Min.   :40.60   Min.   :100.0   Min.   :15.88  
##  1st Qu.:17.23   1st Qu.:46.05   1st Qu.:100.0   1st Qu.:17.06  
##  Median :18.49   Median :48.46   Median :100.0   Median :17.51  
##  Mean   :18.60   Mean   :48.91   Mean   :100.0   Mean   :17.49  
##  3rd Qu.:19.90   3rd Qu.:51.34   3rd Qu.:100.0   3rd Qu.:17.88  
##  Max.   :24.85   Max.   :59.38   Max.   :100.8   Max.   :19.14  
##                                                                 
##        B9             B10             B11             B12       
##  Min.   :11.44   Min.   :1.770   Min.   :135.8   Min.   :18.35  
##  1st Qu.:12.60   1st Qu.:2.460   1st Qu.:143.8   1st Qu.:19.73  
##  Median :12.84   Median :2.710   Median :146.1   Median :20.12  
##  Mean   :12.85   Mean   :2.801   Mean   :147.0   Mean   :20.20  
##  3rd Qu.:13.13   3rd Qu.:2.990   3rd Qu.:149.6   3rd Qu.:20.75  
##  Max.   :14.08   Max.   :6.870   Max.   :158.7   Max.   :22.21  
##                                                                 
##        M1              M2              M3             M4       
##  Min.   : 0.00   Min.   : 0.00   Min.   :1.47   Min.   :911.0  
##  1st Qu.:10.80   1st Qu.:19.30   1st Qu.:1.53   1st Qu.:928.0  
##  Median :11.40   Median :21.00   Median :1.54   Median :934.0  
##  Mean   :11.21   Mean   :16.68   Mean   :1.54   Mean   :931.9  
##  3rd Qu.:12.15   3rd Qu.:21.50   3rd Qu.:1.55   3rd Qu.:936.0  
##  Max.   :14.10   Max.   :22.50   Max.   :1.60   Max.   :946.0  
##  NA's   :1       NA's   :3       NA's   :15     NA's   :1      
##        M5               M6              M7              M8       
##  Min.   : 923.0   Min.   :203.0   Min.   :177.0   Min.   :177.0  
##  1st Qu.: 986.8   1st Qu.:205.7   1st Qu.:177.0   1st Qu.:177.0  
##  Median : 999.2   Median :206.8   Median :177.0   Median :178.0  
##  Mean   :1001.7   Mean   :207.4   Mean   :177.5   Mean   :177.6  
##  3rd Qu.:1008.9   3rd Qu.:208.7   3rd Qu.:178.0   3rd Qu.:178.0  
##  Max.   :1175.3   Max.   :227.4   Max.   :178.0   Max.   :178.0  
##  NA's   :1        NA's   :2       NA's   :1       NA's   :1      
##        M9             M10              M11              M12        
##  Min.   :38.89   Min.   : 7.500   Min.   : 7.500   Min.   :   0.0  
##  1st Qu.:44.89   1st Qu.: 8.700   1st Qu.: 9.000   1st Qu.:   0.0  
##  Median :45.73   Median : 9.100   Median : 9.400   Median :   0.0  
##  Mean   :45.66   Mean   : 9.179   Mean   : 9.386   Mean   : 857.8  
##  3rd Qu.:46.52   3rd Qu.: 9.550   3rd Qu.: 9.900   3rd Qu.:   0.0  
##  Max.   :49.36   Max.   :11.600   Max.   :11.500   Max.   :4549.0  
##                  NA's   :9        NA's   :10       NA's   :1       
##       M13             M14            M15            M16      
##  Min.   :32.10   Min.   :4701   Min.   :5904   Min.   :   0  
##  1st Qu.:33.90   1st Qu.:4828   1st Qu.:6010   1st Qu.:4561  
##  Median :34.60   Median :4856   Median :6032   Median :4588  
##  Mean   :34.51   Mean   :4854   Mean   :6039   Mean   :4566  
##  3rd Qu.:35.20   3rd Qu.:4882   3rd Qu.:6061   3rd Qu.:4619  
##  Max.   :38.60   Max.   :5055   Max.   :6233   Max.   :4852  
##                  NA's   :1                                   
##       M17             M18            M19            M20      
##  Min.   :31.30   Min.   :   0   Min.   :5890   Min.   :   0  
##  1st Qu.:33.50   1st Qu.:4813   1st Qu.:6001   1st Qu.:4553  
##  Median :34.40   Median :4835   Median :6022   Median :4582  
##  Mean   :34.34   Mean   :4810   Mean   :6028   Mean   :4556  
##  3rd Qu.:35.10   3rd Qu.:4862   3rd Qu.:6050   3rd Qu.:4610  
##  Max.   :40.00   Max.   :4971   Max.   :6146   Max.   :4759  
##                                                              
##       M21               M22              M23             M24        
##  Min.   :-1.8000   Min.   : 0.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:-0.6000   1st Qu.: 3.000   1st Qu.:2.000   1st Qu.: 4.000  
##  Median :-0.3000   Median : 5.000   Median :3.000   Median : 8.000  
##  Mean   :-0.1642   Mean   : 5.406   Mean   :3.017   Mean   : 8.834  
##  3rd Qu.: 0.0000   3rd Qu.: 8.000   3rd Qu.:4.000   3rd Qu.:14.000  
##  Max.   : 3.6000   Max.   :12.000   Max.   :6.000   Max.   :23.000  
##                    NA's   :1        NA's   :1       NA's   :1       
##       M25            M26            M27            M28        
##  Min.   :   0   Min.   :   0   Min.   :   0   Min.   : 0.000  
##  1st Qu.:4832   1st Qu.:6020   1st Qu.:4560   1st Qu.: 0.000  
##  Median :4855   Median :6047   Median :4587   Median :10.400  
##  Mean   :4828   Mean   :6016   Mean   :4563   Mean   : 6.592  
##  3rd Qu.:4877   3rd Qu.:6070   3rd Qu.:4609   3rd Qu.:10.750  
##  Max.   :4990   Max.   :6161   Max.   :4710   Max.   :11.500  
##  NA's   :5      NA's   :5      NA's   :5      NA's   :5       
##       M29             M30              M31             M32       
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.00   Min.   :143.0  
##  1st Qu.:19.70   1st Qu.: 8.800   1st Qu.:70.10   1st Qu.:155.0  
##  Median :19.90   Median : 9.100   Median :70.80   Median :158.0  
##  Mean   :20.01   Mean   : 9.161   Mean   :70.18   Mean   :158.5  
##  3rd Qu.:20.40   3rd Qu.: 9.700   3rd Qu.:71.40   3rd Qu.:162.0  
##  Max.   :22.00   Max.   :11.200   Max.   :72.50   Max.   :173.0  
##  NA's   :5       NA's   :5        NA's   :5                      
##       M33             M34             M35             M36         
##  Min.   :56.00   Min.   :2.300   Min.   :463.0   Min.   :0.01700  
##  1st Qu.:62.00   1st Qu.:2.500   1st Qu.:490.0   1st Qu.:0.01900  
##  Median :64.00   Median :2.500   Median :495.0   Median :0.02000  
##  Mean   :63.54   Mean   :2.494   Mean   :495.6   Mean   :0.01957  
##  3rd Qu.:65.00   3rd Qu.:2.500   3rd Qu.:501.5   3rd Qu.:0.02000  
##  Max.   :70.00   Max.   :2.600   Max.   :522.0   Max.   :0.02200  
##  NA's   :5       NA's   :5       NA's   :5       NA's   :5        
##       M37             M38             M39             M40         
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00000  
##  1st Qu.:0.700   1st Qu.:2.000   1st Qu.:7.100   1st Qu.:0.00000  
##  Median :1.000   Median :3.000   Median :7.200   Median :0.00000  
##  Mean   :1.014   Mean   :2.534   Mean   :6.851   Mean   :0.01771  
##  3rd Qu.:1.300   3rd Qu.:3.000   3rd Qu.:7.300   3rd Qu.:0.00000  
##  Max.   :2.300   Max.   :3.000   Max.   :7.500   Max.   :0.10000  
##                                                  NA's   :1        
##       M41               M42             M43               M44       
##  Min.   :0.00000   Min.   : 0.00   Min.   : 0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:11.40   1st Qu.: 0.6000   1st Qu.:1.800  
##  Median :0.00000   Median :11.60   Median : 0.8000   Median :1.900  
##  Mean   :0.02371   Mean   :11.21   Mean   : 0.9119   Mean   :1.805  
##  3rd Qu.:0.00000   3rd Qu.:11.70   3rd Qu.: 1.0250   3rd Qu.:1.900  
##  Max.   :0.20000   Max.   :12.10   Max.   :11.0000   Max.   :2.100  
##  NA's   :1                                                          
##       M45       
##  Min.   :0.000  
##  1st Qu.:2.100  
##  Median :2.200  
##  Mean   :2.138  
##  3rd Qu.:2.300  
##  Max.   :2.600  
## 
yield <- yield[[1]]     # numeric
cmp <- as.matrix(cmp)   # matrix

(b)

We see below that 28 predictors (all manufacturing process variables) have missing data, and the proportion of cases with missing data is well under 5% for all variables except the “03”, “10”, and “11” manufacturing predictors. Even for these last 3 variables, the proportion of cases with missing data is under 10%.

numb_na <- function(x) sum(is.na(x))
missing <- apply(cmp, MARGIN = 2, FUN = numb_na)
cat("Number of predictors with missing values: ", sum(missing > 0))
## Number of predictors with missing values:  28
tibble(Variable = names(missing), 
       Numb_NA = missing, 
       Frac_NA = missing / nrow(cmp)) %>% 
  filter(Numb_NA > 0) %>%
  kable(caption = "Predictors with missing values")
Predictors with missing values
Variable Numb_NA Frac_NA
M1 1 0.0056818
M2 3 0.0170455
M3 15 0.0852273
M4 1 0.0056818
M5 1 0.0056818
M6 2 0.0113636
M7 1 0.0056818
M8 1 0.0056818
M10 9 0.0511364
M11 10 0.0568182
M12 1 0.0056818
M14 1 0.0056818
M22 1 0.0056818
M23 1 0.0056818
M24 1 0.0056818
M25 5 0.0284091
M26 5 0.0284091
M27 5 0.0284091
M28 5 0.0284091
M29 5 0.0284091
M30 5 0.0284091
M31 5 0.0284091
M33 5 0.0284091
M34 5 0.0284091
M35 5 0.0284091
M36 5 0.0284091
M40 1 0.0056818
M41 1 0.0056818

Given that missing data are not a prevalent feature of this dataset, we will impute values for the missing data. We don’t expect the choice of imputation method to have a material impact or bias on the results given the low incidence of missing data. In this case, we choose the bagged trees method for imputation, which we execute on the dataset using the preProcess function. The preProcess function has the ability to impute missing values using either the k-nearest neighbors or bagged trees approach; however the k-nearest neighbors approach automatically centers and scales the entire dataset, which we prefer not to do until we consider pre-processing in general (in the next part).

impute <- preProcess(cmp, method = "bagImpute")
cmp <- predict(impute, cmp)

(c)

As before, we use the createDataPartition function to partition the data into a 75/25 split of training and test sets. We confirm that the distribution of the response is similar for the training and test sets, and that they match the distribution of the response for the full dataset seen above.

# training/test split
set.seed(314)
train_idx <- createDataPartition(yield, p = 0.75, list = FALSE)
yield_train <- yield[train_idx]
yield_test <- yield[-train_idx]
cmp_train <- cmp[train_idx, ]
cmp_test <- cmp[-train_idx, ]

summary(yield_train)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.25   38.75   39.97   40.21   41.48   46.34
summary(yield_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.77   38.78   39.92   40.07   41.05   43.62

Next we decide on pre-processing and tuning methodology, which we will specify within the train function. For pre-processing, we choose:

  • Box-Cox transformation, since some predictors exhibit significant skewness; from the summary statistics below, we see that skewness of the predictors ranges from -13 to 9.
  • Centering & scaling, since we consider using modeling algorithms that require normalized predictors (e.g., elasticnet).
  • Removing near-zero variance predictors, since non-informative predictors will not help and may only hurt the analysis; here there is only one near-zero variance predictor.

In this exercise, we decide to fit an elastic net model. We specify the tuning parameters and the tuning methodology as follows:

  • fraction (the fraction of the full (ordinary least squares) model) and \(\lambda\) (the weight decay or parameter for the ridge penalty) are the tuning parameters for the elastic net model; we will tune the model over a range of fraction from 0.01 to 1 and a range of \(\lambda\) from 0 to 0.5. Note that for fraction = 0, the fitting procedure throws errors, and for \(\lambda\) above 0.5, performance metrics deteriorate dramatically.
  • Repeated 10-fold cross-validation will be our re-sampling approach, which we apply to the training set; the tuning parameters will be chosen to minimize the RMSE metric. We decide on 5 repeats of the cross-validation process because of the limited size of our training set (only 132 observations in the training set overall, and roughly ~120 observations in each training fold of the cross-validation).

From the model summary below, the optimal tuning parameters are fraction = 0.37 and \(\lambda\) = 0.05, for which the re-sampled results on the training set are:

  • RMSE = 1.2
  • \(R^2\) = 0.635

These results appear reasonable; to put the RMSE in context, the yield response ranges from 35 to 46 (spread of 11 from min to max) with a mean / median value of ~40. So on the training set, the model predicts response values within 10% of the total range of variation, and it explains 63% of the total response variability. However the real test is how the model performs on the test set, which we evaluate in the next part.

Note that in tuning the elastic net model, the Box-Cox transformation caused errors for some reason, so we have removed this from the pre-processing specification.

# check skewness of predictors
data.frame(predictor_skewness = apply(cmp, MARGIN = 2, FUN = skewness)) %>% summary()
##  predictor_skewness 
##  Min.   :-12.85882  
##  1st Qu.: -1.68180  
##  Median :  0.07891  
##  Mean   : -1.61681  
##  3rd Qu.:  0.37836  
##  Max.   :  9.05487
# check near-zero variance predictors
nzv_idx <- nearZeroVar(cmp)
cat("Number of near-zero variance predictors: ", length(nzv_idx))
## Number of near-zero variance predictors:  1
# 10-fold cross validation
set.seed(1012)
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
# tuning grid with lambda & fraction
enetGrid <- expand.grid(lambda = c(0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5), 
                        fraction = seq(0.01, 1, length = 20))
# fit elastic net model with tuning on lambda & fraction
enetfit <- train(x = cmp_train, y = yield_train, method = "enet", 
                 preProcess = c("center", "scale", "nzv"), 
                 # box-cox causes Inf values <<< DON'T USE 
                 #preProcess = c("BoxCox", "center", "scale", "nzv"), 
                 tuneGrid = enetGrid, 
                 trControl = ctrl)
enetfit
## Elasticnet 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction    RMSE      Rsquared   MAE      
##   0.000   0.01000000  1.736467  0.5040668  1.4312419
##   0.000   0.06210526  1.258565  0.6076427  1.0395236
##   0.000   0.11421053  1.194780  0.6371438  0.9741011
##   0.000   0.16631579  1.266905  0.6351839  0.9862056
##   0.000   0.21842105  1.272135  0.6383147  0.9862901
##   0.000   0.27052632  1.427319  0.6011421  1.0501023
##   0.000   0.32263158  1.608945  0.5713870  1.1246547
##   0.000   0.37473684  1.745380  0.5511252  1.1884449
##   0.000   0.42684211  2.015506  0.5107628  1.2866805
##   0.000   0.47894737  2.244690  0.4866337  1.3771573
##   0.000   0.53105263  2.503468  0.4707712  1.4698052
##   0.000   0.58315789  2.731987  0.4603900  1.5537597
##   0.000   0.63526316  2.926356  0.4544760  1.6213958
##   0.000   0.68736842  3.128947  0.4496168  1.6788984
##   0.000   0.73947368  3.348314  0.4449675  1.7549604
##   0.000   0.79157895  3.546628  0.4391620  1.8216199
##   0.000   0.84368421  3.735245  0.4347966  1.8834692
##   0.000   0.89578947  3.891915  0.4328652  1.9339763
##   0.000   0.94789474  4.047174  0.4304942  1.9884199
##   0.000   1.00000000  4.183803  0.4295225  2.0389320
##   0.001   0.01000000  1.779540  0.4678851  1.4670757
##   0.001   0.06210526  1.328511  0.6065805  1.0989256
##   0.001   0.11421053  1.224201  0.6159998  1.0053929
##   0.001   0.16631579  1.200389  0.6359026  0.9716690
##   0.001   0.21842105  1.292994  0.6224379  0.9988664
##   0.001   0.27052632  1.271171  0.6347491  0.9830247
##   0.001   0.32263158  1.299487  0.6310666  0.9913723
##   0.001   0.37473684  1.383132  0.6087752  1.0311777
##   0.001   0.42684211  1.470427  0.5838725  1.0798286
##   0.001   0.47894737  1.700263  0.5400556  1.1681947
##   0.001   0.53105263  1.912758  0.5167119  1.2506599
##   0.001   0.58315789  2.101680  0.4993265  1.3213567
##   0.001   0.63526316  2.281688  0.4865754  1.3865954
##   0.001   0.68736842  2.474250  0.4770858  1.4533066
##   0.001   0.73947368  2.655756  0.4700678  1.5181198
##   0.001   0.79157895  2.840843  0.4621230  1.5807925
##   0.001   0.84368421  3.001003  0.4564578  1.6301575
##   0.001   0.89578947  3.161216  0.4518966  1.6717954
##   0.001   0.94789474  3.327116  0.4481326  1.7305826
##   0.001   1.00000000  3.447791  0.4451791  1.7792119
##   0.005   0.01000000  1.804707  0.4396747  1.4879960
##   0.005   0.06210526  1.427757  0.6017997  1.1815678
##   0.005   0.11421053  1.247714  0.6084465  1.0359356
##   0.005   0.16631579  1.209921  0.6229697  0.9909711
##   0.005   0.21842105  1.192347  0.6371211  0.9696260
##   0.005   0.27052632  1.281703  0.6208076  1.0003583
##   0.005   0.32263158  1.317226  0.6236268  1.0041006
##   0.005   0.37473684  1.300681  0.6277740  0.9979750
##   0.005   0.42684211  1.305347  0.6284459  1.0012184
##   0.005   0.47894737  1.324725  0.6101099  1.0228222
##   0.005   0.53105263  1.383722  0.5863219  1.0563114
##   0.005   0.58315789  1.500800  0.5649152  1.1044945
##   0.005   0.63526316  1.688534  0.5365980  1.1795175
##   0.005   0.68736842  1.874413  0.5176590  1.2476391
##   0.005   0.73947368  2.038881  0.5049591  1.3059105
##   0.005   0.79157895  2.179310  0.4962562  1.3545710
##   0.005   0.84368421  2.314388  0.4869293  1.3996855
##   0.005   0.89578947  2.403928  0.4803696  1.4337058
##   0.005   0.94789474  2.486192  0.4756790  1.4640236
##   0.005   1.00000000  2.554515  0.4722272  1.4881801
##   0.010   0.01000000  1.815880  0.4268923  1.4973775
##   0.010   0.06210526  1.484554  0.5961302  1.2273616
##   0.010   0.11421053  1.276949  0.6081311  1.0590029
##   0.010   0.16631579  1.226506  0.6136116  1.0096217
##   0.010   0.21842105  1.199834  0.6290966  0.9789000
##   0.010   0.27052632  1.210219  0.6332910  0.9763518
##   0.010   0.32263158  1.286948  0.6190626  1.0041673
##   0.010   0.37473684  1.328973  0.6182781  1.0134400
##   0.010   0.42684211  1.343121  0.6178307  1.0180046
##   0.010   0.47894737  1.325714  0.6212738  1.0180478
##   0.010   0.53105263  1.337502  0.6075076  1.0334120
##   0.010   0.58315789  1.365185  0.5921380  1.0519158
##   0.010   0.63526316  1.415386  0.5767205  1.0776398
##   0.010   0.68736842  1.513299  0.5566889  1.1220926
##   0.010   0.73947368  1.659256  0.5362644  1.1796739
##   0.010   0.79157895  1.791392  0.5237786  1.2292186
##   0.010   0.84368421  1.907535  0.5135694  1.2704568
##   0.010   0.89578947  2.006479  0.5031506  1.3036017
##   0.010   0.94789474  2.090785  0.4957541  1.3310655
##   0.010   1.00000000  2.150175  0.4908884  1.3539790
##   0.050   0.01000000  1.838591  0.4002276  1.5165287
##   0.050   0.06210526  1.614260  0.5683502  1.3305409
##   0.050   0.11421053  1.424119  0.6017456  1.1794132
##   0.050   0.16631579  1.293204  0.6078216  1.0732303
##   0.050   0.21842105  1.238394  0.6095817  1.0264938
##   0.050   0.27052632  1.220194  0.6157972  0.9999275
##   0.050   0.32263158  1.203893  0.6262489  0.9806030
##   0.050   0.37473684  1.195740  0.6333176  0.9716941
##   0.050   0.42684211  1.224497  0.6286574  0.9825978
##   0.050   0.47894737  1.281397  0.6156957  1.0092439
##   0.050   0.53105263  1.333665  0.6032951  1.0331924
##   0.050   0.58315789  1.370771  0.5968760  1.0513548
##   0.050   0.63526316  1.404729  0.5924819  1.0707358
##   0.050   0.68736842  1.444085  0.5835277  1.0933193
##   0.050   0.73947368  1.449316  0.5765945  1.1037012
##   0.050   0.79157895  1.459555  0.5706832  1.1131993
##   0.050   0.84368421  1.475600  0.5656583  1.1223688
##   0.050   0.89578947  1.474763  0.5620255  1.1213909
##   0.050   0.94789474  1.469699  0.5599514  1.1172719
##   0.050   1.00000000  1.473668  0.5583026  1.1229630
##   0.100   0.01000000  1.845186  0.3918316  1.5220872
##   0.100   0.06210526  1.653363  0.5558789  1.3635510
##   0.100   0.11421053  1.482152  0.5963832  1.2262791
##   0.100   0.16631579  1.351039  0.6045076  1.1195513
##   0.100   0.21842105  1.262597  0.6094722  1.0485408
##   0.100   0.27052632  1.230810  0.6107972  1.0177067
##   0.100   0.32263158  1.218386  0.6158150  0.9960746
##   0.100   0.37473684  1.206552  0.6241205  0.9813348
##   0.100   0.42684211  1.197271  0.6316033  0.9715892
##   0.100   0.47894737  1.208224  0.6314534  0.9759107
##   0.100   0.53105263  1.258877  0.6183305  1.0005760
##   0.100   0.58315789  1.312486  0.6054563  1.0260946
##   0.100   0.63526316  1.361901  0.5954298  1.0517861
##   0.100   0.68736842  1.402028  0.5885313  1.0767386
##   0.100   0.73947368  1.438469  0.5796934  1.0988467
##   0.100   0.79157895  1.475553  0.5709462  1.1182016
##   0.100   0.84368421  1.504668  0.5638137  1.1336099
##   0.100   0.89578947  1.519124  0.5585773  1.1434008
##   0.100   0.94789474  1.536827  0.5542069  1.1525856
##   0.100   1.00000000  1.531571  0.5526342  1.1528342
##   0.250   0.01000000  1.849836  0.3873639  1.5260358
##   0.250   0.06210526  1.682264  0.5467503  1.3876273
##   0.250   0.11421053  1.529939  0.5892613  1.2639234
##   0.250   0.16631579  1.404573  0.6006261  1.1628720
##   0.250   0.21842105  1.303139  0.6080765  1.0792046
##   0.250   0.27052632  1.244027  0.6092051  1.0347210
##   0.250   0.32263158  1.227354  0.6081337  1.0143919
##   0.250   0.37473684  1.220844  0.6137254  0.9962820
##   0.250   0.42684211  1.225623  0.6151381  0.9933042
##   0.250   0.47894737  1.223170  0.6202509  0.9911498
##   0.250   0.53105263  1.234414  0.6203622  1.0001427
##   0.250   0.58315789  1.272353  0.6108608  1.0200701
##   0.250   0.63526316  1.322729  0.6007288  1.0427398
##   0.250   0.68736842  1.367561  0.5934973  1.0636409
##   0.250   0.73947368  1.410923  0.5872054  1.0883719
##   0.250   0.79157895  1.459169  0.5789318  1.1155490
##   0.250   0.84368421  1.510674  0.5700149  1.1408541
##   0.250   0.89578947  1.561640  0.5619365  1.1640816
##   0.250   0.94789474  1.610619  0.5552433  1.1856907
##   0.250   1.00000000  1.661260  0.5491755  1.2069126
##   0.500   0.01000000  1.849795  0.3883011  1.5260028
##   0.500   0.06210526  1.684274  0.5475989  1.3889549
##   0.500   0.11421053  1.535326  0.5866074  1.2675716
##   0.500   0.16631579  1.407711  0.6005078  1.1643840
##   0.500   0.21842105  1.305347  0.6077125  1.0788035
##   0.500   0.27052632  1.243473  0.6090790  1.0323786
##   0.500   0.32263158  1.231404  0.6032952  1.0163096
##   0.500   0.37473684  1.243059  0.6008711  1.0126947
##   0.500   0.42684211  1.258589  0.6030290  1.0180820
##   0.500   0.47894737  1.279723  0.6031634  1.0336243
##   0.500   0.53105263  1.301485  0.6031041  1.0495563
##   0.500   0.58315789  1.335667  0.5990496  1.0683601
##   0.500   0.63526316  1.380988  0.5927207  1.0908532
##   0.500   0.68736842  1.433206  0.5872483  1.1158231
##   0.500   0.73947368  1.500263  0.5804188  1.1488722
##   0.500   0.79157895  1.571920  0.5730067  1.1846068
##   0.500   0.84368421  1.642447  0.5653335  1.2184644
##   0.500   0.89578947  1.702806  0.5583992  1.2466055
##   0.500   0.94789474  1.756548  0.5531248  1.2715247
##   0.500   1.00000000  1.807982  0.5489102  1.2946772
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2184211 and lambda
##  = 0.005.

We plot the cross-validation RMSE profile for the various values of fraction and \(\lambda\) considered. We also show diagnostic plots of observations versus model predictions and model residuals versus model predictions for the training set. The model diagnostics look satisfactory.

# plot performance profile
plot(enetfit, main = "Performance profile: elastic net model")

# diagnostic plots
pred_train <- predict(enetfit) 
resid_train <- yield_train - pred_train
plot_range <- extendrange(c(pred_train, yield_train))
ggplot() + geom_point(aes(x = pred_train, y = yield_train)) + 
  geom_abline(slope = 1, lty = 2) + 
  labs(title = "Observed response vs. predicted response: training set", 
       x = "Predicted response", y = "Observed response") + 
  coord_cartesian(xlim = plot_range, ylim = plot_range)

ggplot() + geom_point(aes(x = pred_train, y = resid_train)) + 
  geom_hline(yintercept = 0, lty = 2) + 
  labs(title = "Residuals vs. predicted response: training set", 
       x = "Predicted response", y = "Residual")

(d)

Next we evaluate model performance on the test (holdout) set. We observe that, as expected, the test set performance metrics are weaker compared to either the training set metrics or the re-sampled training set metrics.

  • RMSE = 1.3
  • \(R^2\) = 0.473
pred_test <- predict(enetfit, cmp_test)
perf_train <- data.frame(obs = yield_train, pred = pred_train) %>% defaultSummary()
perf_test <- data.frame(obs = yield_test, pred = pred_test) %>% defaultSummary()
perf_df <- data.frame(elasticnet = c(perf_train[1:2], perf_test[1:2]))
rownames(perf_df) <- c("Training RMSE", "Training Rsq", "Test RMSE", "Test Rsq")
perf_df %>% kable(caption = "Comparison of performance metrics: elastic net model")
Comparison of performance metrics: elastic net model
elasticnet
Training RMSE 1.0166599
Training Rsq 0.7213022
Test RMSE 1.3554437
Test Rsq 0.4269860

(e)

First let’s review the model coefficients to see how many variables remain in the elastic net model. It turns out that in this model, the elastic net fitting procedure has removed 35 predictors (in addition to the near-zero variance predictor removed in pre-processing), leaving just 21 predictors in the final tuned model.

We use the varImp function to compute the variable importance scores, and plot the results below. We see that the top 3 scores are all manufacturing variables, while the top 10 scores are almost evenly split between manufacturing (6) and biological (4) variables. As we go down the list, say to the top 20 scores, there are more manufacturing (13) than biological (7) variables, but this is to be expected since the dataset has a greater number of manufacturing (45) than biological (12) variables to begin with. On balance, both biological and manufacturing variables are well represented at the top of the list, although the manufacturing variables appears to be more important.

# model coefficients
enet_coeff <- predict(enetfit$finalModel, s = (enetfit$finalModel)$tuneValue[[1]], 
                       mode = "fraction", type = "coefficients")$coefficients
cat("Number of non-zero coefficients in elastic net model: ", sum(enet_coeff != 0))
## Number of non-zero coefficients in elastic net model:  20
# variable importance scores
(enetImp <- varImp(enetfit, scale = FALSE))
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##     Overall
## M32  0.4353
## M13  0.3912
## M17  0.3597
## B6   0.3085
## B3   0.2782
## M9   0.2737
## M36  0.2707
## M6   0.2620
## B12  0.2594
## B2   0.2230
## M31  0.1988
## M33  0.1879
## B11  0.1747
## B4   0.1684
## M30  0.1646
## M11  0.1585
## M12  0.1505
## B8   0.1500
## M29  0.1462
## M2   0.1451
# plot
plot(enetImp, top = 20, main = "Variable importance scores: elastic net model")

Finally we plot the coefficient paths for the elastic net fitting procedure, which is consistent with the list of variable importance scores above. The dotted vertical line marks the final tuned model, for which the fraction parameter is set to 0.37.

plot(enetfit$finalModel, xvar = "fraction", 
     main = "Coefficient paths for elastic net model")
abline(v = (enetfit$finalModel)$tuneValue[[1]], lty = 2, col = 2)

(f)

Next we explore the relationships between the top 5 predictors and the response. The 5 predictors include three manufacturing variables and two biological variables: M32, M13, M17, B6, B3.

We generate a pairs plot to review the distributions and correlations of the selected variables.

# pairs plot
topvars <- c('M32', 'M13', 'M17', 'B6', 'B3')
topidx <- match(topvars, colnames(cmp))
top_df <- data.frame(yield, cmp[ , topidx])
ggpairs(top_df)

From the pairs plot it is evident that the top 5 predictors have the following relationships with the response:

  • M32: positive relationship (0.61 correlation)
  • M13: negative relationship (-0.50 correlation)
  • M17: negative relationship (-0.43 correlation)
  • B6: positive relationship (0.48 correlation)
  • B3: positive relationship (0.45 correlation).

We take a closer look below at the scatter plots of the yield response versus each predictor. Note that the plots show the bivariate relationship between the variables including the effect of all other predictors, i.e., it does not depict the relationship while holding constant all other predictors. As such, the plots include the noise, correlation, and confounding effects of the other predictors. However, despite this, they illustrate that the top 5 predictors have discernible relationships with the yield response.

# scatter plots
ggplot(top_df, aes(x = M32, y = yield)) + geom_point() + 
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(title = "Yield vs. M32")

ggplot(top_df, aes(x = M13, y = yield)) + geom_point() + 
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(title = "Yield vs. M13")

ggplot(top_df, aes(x = M17, y = yield)) + geom_point() + 
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(title = "Yield vs. M17")

ggplot(top_df, aes(x = B6, y = yield)) + geom_point() + 
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(title = "Yield vs. B6")

ggplot(top_df, aes(x = B3, y = yield)) + geom_point() + 
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(title = "Yield vs. B3")

Based on this information, analysis could be done to assess the impact on yield by adjusting these variables. As stated in the text, the biological predictors can’t be adjusted on their own, but they indicate the quality of the raw materials used; on the other hand, the manufacturing predictors can be adjusted during the manufacturing process. Given this information, we could investigate the costs and benefits of:

  • Increasing M32, decreasing M13, and decreasing M17 in the manufacturing process while holding all other variables constant
  • Potentially changing the raw materials used, in order to increase B6 and B3.

One way to get a back-of-the-envelope impact of these adjustments is to estimate the slope coefficients from a simple OLS regression of yield onto each predictor. This is a rough guess since, as indicated above, this ignores the effect of the other predictors and so may be misleading. However, throwing caution to the wind, we see that in the context of an OLS regression for each manufacturing predictor:

  • 1 unit increase in M32 is associated with a 0.21% increase in the yield
  • 1 unit decrease in M13 is associated with a 0.92% increase in the yield
  • 1 unit decrease in M17 is associated with a 0.63% increase in the yield.
# simplistic linear models
lm1 <- lm(yield ~ top_df$M32)
lm2 <- lm(yield ~ top_df$M13)
lm3 <- lm(yield ~ top_df$M17)
coef_df <- data.frame(Manuf_Variable = c("M32", "M13", "M17"),  
                      Linear_Coef = c(coef(lm1)[2], coef(lm2)[2], coef(lm3)[2])) 
rownames(coef_df) <- NULL
kable(coef_df, 
      col.names = c("Manufacturing Predictor", "Slope Coefficient"), 
      digits = 3, 
      caption = "Coefficients from simple OLS regression onto yield")
Coefficients from simple OLS regression onto yield
Manufacturing Predictor Slope Coefficient
M32 0.208
M13 -0.916
M17 -0.630