6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a) Start R and use these commands to load the data:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability
contains permeability response.
dim(fingerprints)[1] == dim(permeability)[1]
## [1] TRUE
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar
function from the caret
package. How many predictors are left for modeling?
nzv = caret::nearZeroVar(fingerprints)
noNZV = fingerprints[,-nzv]
dim(noNZV)
## [1] 165 388
Although this question is phrased as filtering out predictors with low frequencies, the goal seems to be also to filter out those with high frequencies. For example some predictors are present in every compound, as thus are discarded by this method.
388 predictors made the cut for having fewer than 19 zeroes for every 1 and fewer than 19 ones for every 0. Note that any predictor that failed that test also failed the other default NZV test, which is having at least 10% as many unique values as total values, since there are only 2 unique values for these binary predictors, and there are 165 total values per predictor.
paste("The fewest 1's in a feature in the filtered data.frame is", min(colSums(noNZV)),
"and the fewest 0's is", dim(noNZV)[1] - max(colSums(noNZV)))
## [1] "The fewest 1's in a feature in the filtered data.frame is 9 and the fewest 0's is 9"
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS
model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\) ?
set.seed(624)
trains = sample(1:165, 130) # randomly choose 130 compounds and test with the other 35
tests = setdiff(1:165, trains)
trainDF = noNZV[trains,]
testDF = noNZV[tests,]
Since the data values are all binary, I don’t think they need scaling or centering. If the goal were to do linear regression on them, the next step might be to eliminate highly correlated predictors, but we are asked instead to do PLS on them. Nevertheless, let’s see how correlated the predictors are, and at least remove any that are completely not needed.
nfeats = dim(trainDF)[2]
cormat = matrix(0, nfeats, nfeats)
for (i in 1:(nfeats-1)) {
for (j in ((i+1):nfeats)) {
cormat[i,j] = sum(noNZV[,i] == noNZV[,j]) / 165 * 2 - 1
}
}
sum(cormat) * 2 / nfeats / (nfeats-1) # mean correlation of any two different predictors in noNZV
## [1] 0.1546361
That just built an upper triangular matrix that measured the fraction of the 165 compounds whose presence/absence agreed in each pair of predictors, and then multiplied by 2 and subtracted 1, in order to make it more familiar in terms of normal correlation measures (-1 for complete disagreement, +1 for complete agreement). The mean for the matrix was 0.155 or so. This is because the mean value in the data is only 0.31, i.e. there are just over twice as many zeroes as ones. As such, a typical pair of feature columns matches 0’s in more than the 165/4 indexes that would be expected for a random and balanced distribution of 1’s and 0’s.
hist(cormat[upper.tri(cormat,F)], main = "Pairwise Correlations of Predictors", xlab = "")
ones = which(cormat == 1, arr.ind = T)
discard = setdiff(ones[,1], ones[,2]) # this is all the features that are duped, but keeps one of each dupeset
length(discard)
## [1] 66
66 features can be removed as duplicates, while keeping one that dupes them. Remove them in the test set as well, since the training set has no useful info for them.
trainDF = trainDF[,-discard]
testDF = testDF[,-discard]
dim(trainDF)
## [1] 130 322
Model with PLS, starting by tuning for the optimal number of components.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(624)
ctrl = trainControl(method = "cv", number = 5)
plsTune = train(trainDF, permeability[trains], method = "pls", tuneLength = 20, trControl = ctrl)
plot(plsTune, main = "5-fold Cross-Validated RMSE for Training")
It’s a little choppy because each fold only has 130/5 = 26 compounds in it, but it seems reasonable to use the 8 components that produced the CV-averaged RMSE of just over 11. If anything, you could probably use fewer.
noquote(paste("R-sq for 8 compontents: ", plsTune$results$Rsquared[8]))
## [1] R-sq for 8 compontents: 0.553611369833046
That’s the average for the 5 validation folds, but let’s see how much they vary.
plsTune$resample
## RMSE Rsquared MAE Resample
## 1 11.858163 0.5767516 8.072946 Fold1
## 2 7.019797 0.7855236 5.768535 Fold5
## 3 11.301891 0.6001998 8.436503 Fold2
## 4 13.228300 0.3434449 10.599440 Fold4
## 5 11.805442 0.4621369 8.539903 Fold3
There’s a lot of variance in the \(R^2\) there, with only 26 samples to score for each fold. It’s really anyone’s guess how this will do on the test set.
(d) Predict the response for the test set. What is the test set estimate of \(R^2\) ?
Let’s see how an 8-component model looks.
Tack on the response variable for modeling with formulas.
trainDF = as.data.frame(trainDF)
trainDF$perm = permeability[trains]
testDF = as.data.frame(testDF)
testDF$perm = permeability[tests]
Train on all 130 samples, without folds.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
set.seed(624)
plsFit = plsr(perm ~ . , data = trainDF, ncomp = 8)
plot(plsFit$residuals, main = "PLS Residuals from 8-Component Model")
set.seed(624)
PLSpreds = predict(plsFit, testDF, ncomp = 8)
plsResults = data.frame(obs = testDF$perm, pred = PLSpreds[,,])
defaultSummary(plsResults)
## RMSE Rsquared MAE
## 12.1644683 0.3167209 9.4239309
That \(R^2\) of .317 is even lower than the worst of the 5 cross-validation folds. Splitting 130 training samples into 5 groups and averaging the 5 models’ results on the 5 groups of 26 validation samples, 4 models trained on every sample, so there was a range of worse and better results. The one model trained on all 130 didn’t seem to get as good at look at whatever patterns lay in the 35 test samples.
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
Try Robust Linear Regression, after reducing dimensions via PCA. Use the same splits as before.
set.seed(624)
rlmPCA = train(perm ~ . , data = trainDF, method = 'rlm',
metric = "RMSE", preProcess = "pca", trControl = ctrl)
rlmPCA
## Robust Linear Model
##
## 130 samples
## 322 predictors
##
## Pre-processing: principal component signal extraction (322), centered
## (322), scaled (322)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 104, 104, 104, 105, 103
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 17.91105 0.4366678 14.303706
## FALSE psi.hampel 17.79546 0.4529139 14.261505
## FALSE psi.bisquare 18.18375 0.4239782 14.440848
## TRUE psi.huber 12.41827 0.4586823 8.789986
## TRUE psi.hampel 13.01753 0.4309632 9.212618
## TRUE psi.bisquare 13.64374 0.4544003 9.034237
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
RLMpreds = predict(rlmPCA, testDF)
rlmResults = data.frame(obs = testDF$perm, pred = RLMpreds)
defaultSummary(rlmResults)
## RMSE Rsquared MAE
## 11.0335064 0.3878919 7.4777056
The RMSE is actually lower than in the training batch (11.03 vs. 12.42), but the \(R^2\) dropped from .459 to .388, so there must be less variance in the test responses than in the training ones.
set.seed(624)
paste("Variance of 35 test responses: ", round(var(testDF$perm)))
## [1] "Variance of 35 test responses: 200"
paste("Variance of randomly chosen 35 training responses: ", round(var(sample(trainDF$perm, 35))))
## [1] "Variance of randomly chosen 35 training responses: 285"
Still, this is better than the PLS model did on the test batch.
Let’s see how a Lasso model does, with PCA pre-processing, on the same splits.
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
set.seed(624)
lassoGrid = expand.grid(.lambda = c(0), .fraction = seq(.1, 1, length = 10))
lassoTune = train(perm ~., data = trainDF, method = 'enet',
tuneGrid = lassoGrid, trControl = ctrl,
preProcess = c("scale", "pca"))
lassoTune
## Elasticnet
##
## 130 samples
## 322 predictors
##
## Pre-processing: scaled (322), principal component signal extraction
## (322), centered (322)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 104, 104, 104, 105, 103
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.1 14.42373 0.3255332 11.575030
## 0.2 13.50423 0.3742386 10.772172
## 0.3 12.91839 0.3949580 10.131515
## 0.4 12.50504 0.4097887 9.557359
## 0.5 12.18728 0.4267397 9.204028
## 0.6 11.96239 0.4407688 8.985098
## 0.7 11.83005 0.4516625 8.901894
## 0.8 11.79533 0.4583305 8.979240
## 0.9 11.82826 0.4625073 9.080603
## 1.0 11.94604 0.4626869 9.208198
##
## Tuning parameter 'lambda' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.8 and lambda = 0.
lassoPreds = predict(lassoTune, testDF)
lassoResults = data.frame(obs = testDF$perm, pred = lassoPreds)
defaultSummary(lassoResults)
## RMSE Rsquared MAE
## 11.4986392 0.3309055 8.0728693
Despite a cross-validation \(R^2\) comparable to the Robust model, the Lasso model performed worse on the test set, although still a bit better than the PLS model did. Still, these are small sample sizes, and there is a lot of randomness affecting results.
For example, recall all the variance in \(R^2\) for the cross-validation folds of the PLS model (from .34 to .79). And recall how the decision to use 8 components was not completely obvious, and was based on the lowest RMSE for a jagged plot of small samples. If we’d chosen a different number of CV folds, we easily might’ve come up with only using 5 components, which would’ve led to this model:
set.seed(624)
plsFit = plsr(perm ~ . , data = trainDF, ncomp = 5)
PLSpreds = predict(plsFit, testDF, ncomp = 5)
plsResults = data.frame(obs = testDF$perm, pred = PLSpreds[,,])
defaultSummary(plsResults)
## RMSE Rsquared MAE
## 11.0877393 0.4019992 8.7624625
Now with just slight random fluctuations, PLS goes from being the worst of the 3 models to the best, by a significant \(R^2\) margin.
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
It was mentioned that the reason for trying to replace the lab experiments was that they were very expensive. Without knowing how expensive, these models with \(R^2\) topping out at .4 don’t seem like good candidates to replace anything. Some better version of them maybe would be a great tool to use in order to reduce the number of unsuccessful, costly experiments.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
#library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
cmp = ChemicalManufacturingProcess
dim(cmp)
## [1] 176 58
#summary(cmp)
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
library(finalfit)
cmp %>%
missing_plot()
table(rowSums(is.na(cmp)))
##
## 0 1 3 4 11 16
## 152 10 7 1 5 1
One run is missing 16 features, 5 are missing 11 features, 1 missing 4 feats, 7 missing 3 feats, and 10 are missing just 1 feat. The other 152 runs have no missing features.
table(colSums(is.na(cmp)))
##
## 0 1 2 3 5 9 10 15
## 30 12 1 1 11 1 1 1
30 features (29 plus the response, actually), including the 12 Biological features, are present in all runs. 12 more are missing in just one run, 11 feats are missing in 5 runs, and one feature is missing from each of 2, 3, 9, 10, and 15 runs.
Here’s another angle to look at how the NA’s are clumped:
cmp %>%
missing_pattern(dependent = 'Yield', explanatory = names(cmp)) -> plot2 # to suppress matrix output
Let’s see what the feature distributions look like before imputing missing values.
par(mfrow = c(4,4))
for (n in names(cmp)) {
hist(cmp[[n]], main = paste0(substr(n, 1, 1),
substr(n, nchar(n)-1, nchar(n))), xlab = '')
}
Just one feature has obvious near-zero variance.
table(cmp$BiologicalMaterial07)
##
## 100 100.83
## 173 3
Drop B07 for NZV.
prepro = cmp %>% select(-c(BiologicalMaterial07))
There is a small number of zeroes in a few of these distributions, ruining the plots and potentially worth fixing before dealing with NA’s. At first glance, MP’s 16,18,20,25,26,27,29,30,31. It’s very possible they were keeping some value set at 0 as a sort of control for the run, but if all those zeroes are in one run, there might be cause to remove that run from training, or to give it median-type values instead of zeroes, so that it doesn’t make it harder to model the other features.
prepro[prepro$ManufacturingProcess16==0,]
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 23 40.7 6.58 58.38 67.17
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 23 12.22 18.46 51.45
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 23 18.22 12.83 2.69
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 23 148.49 21.23 11.1
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 23 0 NA 937
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 23 997.7 209.6 177
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 23 177 46.43 NA
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 23 NA 0 32.9
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 23 NA 6002 0
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 23 36.5 4902 6120
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 23 4621 3.6 7
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 23 3 17 4906
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 23 6134 4626 11.2
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 23 21 8.9 70.1
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 23 160 66 2.4
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 23 492 0.019 1.5
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 23 3 7.4 0
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 23 0 11.6 0.8
## ManufacturingProcess44 ManufacturingProcess45
## 23 1.9 2.4
Run 23 just has one of those singular zeroes. It has a normal yield, so make it MP16’s median.
prepro[23, "ManufacturingProcess16"] = median(prepro$ManufacturingProcess16, na.rm = T)
prepro[prepro$ManufacturingProcess18==0,]
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 22 41.18 7.08 61.83 70.69
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 22 13.43 17.72 53.27
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 22 19.14 13.38 3.02
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 22 153.1 21.9 10.9
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 22 0 NA 936
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 22 1024.4 218.7 178
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 22 178 47.04 NA
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 22 NA 0 33.2
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 22 4791 6026 4482
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 22 35.7 0 6111
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 22 0 2.5 6
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 22 2 16 4918
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 22 6152 4618 11.4
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 22 21.1 8.8 70.1
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 22 162 68 2.4
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 22 479 0.018 0.9
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 22 2 7.4 0
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 22 0 11.9 0.8
## ManufacturingProcess44 ManufacturingProcess45
## 22 1.9 2.1
Same story for MP18, except it also has the one 0 for MP20. Make them both medians.
prepro[22, "ManufacturingProcess18"] = median(prepro$ManufacturingProcess18, na.rm = T)
prepro[22, "ManufacturingProcess20"] = median(prepro$ManufacturingProcess20, na.rm = T)
prepro[prepro$ManufacturingProcess25==0,]
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 108 39.75 5.79 53.96 66.53
## NA NA NA NA NA
## NA.1 NA NA NA NA
## NA.2 NA NA NA NA
## NA.3 NA NA NA NA
## NA.4 NA NA NA NA
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 108 10.4 18.26 47.57
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 108 17.24 12.99 2.16
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 108 146.64 20.57 9.8
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 108 21.9 1.5 934
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 108 987.9 208.9 178
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 108 178 45.99 9.4
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 108 9.3 0 34.7
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 108 4807 5967 4536
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 108 34.7 4819 5997
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 108 4544 0 11
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 108 5 5 0
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 108 0 0 0
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 108 0 0 0
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 108 156 62 2.5
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 108 485 0.019 0.7
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 108 3 7 0
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 108 0 11.6 0.8
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## ManufacturingProcess44 ManufacturingProcess45
## 108 1.9 2.4
## NA NA NA
## NA.1 NA NA
## NA.2 NA NA
## NA.3 NA NA
## NA.4 NA NA
This #108 has all the remaining singular zeroes, at MP’s 25,26,27,29,30,31. It also has a response value near the mean. This filter on MP25 has also turned up the group of 5 runs with many NA’s. Let’s see what they are.
prepro[rowSums(is.na(prepro)) > 4,]
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 172 39.66 6.71 56.32 66.19
## 173 39.68 6.87 56.74 66.61
## 174 42.23 7.50 58.41 68.30
## 175 38.48 7.53 58.36 69.25
## 176 39.49 7.53 58.36 69.25
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 172 12.35 20.02 50.26
## 173 12.55 20.18 50.80
## 174 13.33 20.81 52.96
## 175 14.35 20.57 51.31
## 176 14.35 20.57 51.31
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1 16.66 11.44 3.46
## 172 17.54 12.50 2.82
## 173 17.48 12.41 2.82
## 174 17.23 12.04 2.83
## 175 17.87 12.77 3.55
## 176 17.87 12.77 3.55
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1 138.09 18.83 NA
## 172 143.45 20.32 12.8
## 173 143.10 20.24 12.8
## 174 141.72 19.92 13.0
## 175 145.56 20.04 14.1
## 176 145.56 20.04 14.1
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1 NA NA NA
## 172 21.5 1.54 935
## 173 21.5 1.56 933
## 174 20.4 1.55 930
## 175 21.6 1.55 935
## 176 20.8 1.55 932
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1 NA NA NA
## 172 1027.0 206.2 178
## 173 1032.0 206.6 177
## 174 1040.0 208.7 178
## 175 1044.8 208.0 177
## 176 1053.8 207.5 178
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1 NA 43.00 NA
## 172 177 46.78 9.6
## 173 178 46.51 9.5
## 174 177 48.05 10.1
## 175 177 48.11 10.2
## 176 178 48.13 9.9
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1 NA NA 35.5
## 172 9.6 0 33.9
## 173 9.9 0 34.0
## 174 10.4 0 33.1
## 175 10.3 0 32.9
## 176 10.3 0 32.9
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1 4898 6108 4682
## 172 4829 6026 4621
## 173 4833 6029 4608
## 174 4795 6000 4557
## 175 4793 6029 4600
## 176 4824 6068 4630
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1 35.5 4865 6049
## 172 33.5 4815 6011
## 173 33.5 4807 6001
## 174 32.8 4764 5977
## 175 32.4 4787 6030
## 176 32.0 4795 6050
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1 4665 0.0 NA
## 172 4612 -0.4 2
## 173 4584 -0.5 0
## 174 4538 -0.3 0
## 175 4592 -0.5 0
## 176 4607 -0.9 0
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1 NA NA 4873
## 172 2 8 NA
## 173 0 0 NA
## 174 0 0 NA
## 175 0 0 NA
## 176 0 0 NA
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1 6074 4685 10.7
## 172 NA NA NA
## 173 NA NA NA
## 174 NA NA NA
## 175 NA NA NA
## 176 NA NA NA
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1 21 9.9 69.1
## 172 NA NA NA
## 173 NA NA NA
## 174 NA NA NA
## 175 NA NA NA
## 176 NA NA NA
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1 156 66 2.4
## 172 156 NA NA
## 173 158 NA NA
## 174 167 NA NA
## 175 156 NA NA
## 176 160 NA NA
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1 486 0.019 0.5
## 172 NA NA 2.3
## 173 NA NA 1.0
## 174 NA NA 1.3
## 175 NA NA 2.3
## 176 NA NA 0.9
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1 3 7.2 NA
## 172 0 0.0 0
## 173 0 0.0 0
## 174 0 0.0 0
## 175 0 0.0 0
## 176 0 0.0 0
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1 NA 11.6 3.0
## 172 0 0.0 0.6
## 173 0 0.0 0.6
## 174 0 0.0 0.6
## 175 0 0.0 0.5
## 176 0 0.0 0.6
## ManufacturingProcess44 ManufacturingProcess45
## 1 1.8 2.4
## 172 0.0 0.0
## 173 0.0 0.0
## 174 0.0 0.0
## 175 0.0 0.0
## 176 0.0 0.0
This is where the inspection of outlier zeroes blends into the treatment of NA’s. The first thing to keep in mind is that our task is to find Manufacturing Process changes that improve Yield. Since we aren’t tasked with making predictions on data outside of this dataset, I’m less inclined to throw away troublesome rows like these. The response for all these runs is somewhat normal, so let’s just impute medians for everything missing now. And also for MP25 on run #108.
prepro[108, c(37:39,41:43)] = NA #MP's 25-27, 29-31
for (feat in names(prepro)) {
med = median(prepro[,feat], na.rm = T)
prepro[feat][is.na(prepro[feat])] = med
}
And now we should get a better look at the distributions.
par(mfrow = c(4,4))
for (n in names(prepro)) {
hist(prepro[[n]], main = paste0(substr(n, 1, 1),
substr(n, nchar(n)-1, nchar(n))), xlab = '')
}
Those look more useful for a linear model. Still there are a couple of outlier 0 values, but who’s to say if those were intentional or not.
prepro[prepro$ManufacturingProcess01==0,]
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 2 19.04 12.55 3.46
## 3 19.04 12.55 3.46
## 4 19.04 12.55 3.46
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 2 153.67 21.05 0
## 3 153.67 21.05 0
## 4 153.67 21.05 0
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 2 0 1.54 917
## 3 0 1.54 912
## 4 0 1.54 911
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 2 1032.2 210.0 177
## 3 1003.6 207.1 178
## 4 1014.6 213.3 177
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 2 178 46.57 9.1
## 3 178 45.07 9.1
## 4 177 44.92 9.1
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 2 9.4 0 34.0
## 3 9.4 0 34.8
## 4 9.4 0 34.8
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 2 4869 6095 4617
## 3 4878 6087 4617
## 4 4897 6102 4635
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 2 34.0 4867 6097
## 3 34.8 4877 6078
## 4 34.8 4872 6073
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 2 4621 0 3
## 3 4621 0 4
## 4 4611 0 5
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 2 0 3 4869
## 3 1 4 4897
## 4 2 5 4892
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 2 6107 4630 11.2
## 3 6116 4637 11.1
## 4 6111 4630 11.1
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 2 21.4 9.9 68.7
## 3 21.3 9.4 69.3
## 4 21.3 9.4 69.3
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 2 169 66 2.6
## 3 173 66 2.6
## 4 171 68 2.5
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 2 508 0.019 2.0
## 3 509 0.018 0.7
## 4 496 0.018 1.2
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 2 2 7.2 0.1
## 3 2 7.2 0.0
## 4 2 7.2 0.0
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 2 0.15 11.1 0.9
## 3 0.00 12.0 1.0
## 4 0.00 10.6 1.1
## ManufacturingProcess44 ManufacturingProcess45
## 2 1.9 2.2
## 3 1.8 2.3
## 4 1.8 2.1
This shows an interesting aspect of this data: The experimenters were using one Biological type here (all the Bio values are the same for the 3), and trying to see how tweaks to the Manufacturing features affected the Yield. I think if you tweaked any more outlier zeroes at this point, it might hide good information. So we should be able to see some clear correlations between Yield and its predictors now.
library(corrplot)
## corrplot 0.92 loaded
##
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
##
## corrplot
corrplot(cor(prepro), type = "upper", diag = F, )
MP’s 13 (neg), 32 (pos), and 36 (neg), look to have the highest Yield correlations. We’ll come back to this correlation matrix later on.
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(624)
trains = sample(1:176, 136) # randomly choose 136 and test with the other 40
tests = setdiff(1:176, trains)
trainDF = prepro[trains,]
testDF = prepro[tests,]
paste("Baseline of Test Set Variance: ", round(var(testDF$Yield),3))
## [1] "Baseline of Test Set Variance: 2.826"
testVars = c()
for (seed in 625:634) { #Bootstrap 10 sample variances
set.seed(seed)
testVars = c(testVars, var(sample(trainDF$Yield, dim(testDF)[1])))
}
paste("Training set Sample Variance, just to see how optimistic validation scores are, compared to test ones: ",
round(mean(testVars), 3))
## [1] "Training set Sample Variance, just to see how optimistic validation scores are, compared to test ones: 3.752"
It does look like predictions for the test split should do pretty well as far as RMSE, compared to the training set, based on the response variances. But the \(R^2\) will be harder to match, since there is less variance to explain in the test set, so that whatever patterns the model predicts on will probably be overfit as far as the test responses go.
First, min-max scale the values. Use the ranges of the training set, so as not to unrealistically steal info from the test set.
for (feat in names(trainDF)[2:dim(trainDF)[2]]){
hi = max(trainDF[,feat])
lo = min(trainDF[,feat])
trainDF[,feat] = (trainDF[,feat] - lo) / (hi - lo)
testDF[,feat] = (testDF[,feat] - lo) / (hi - lo)
}
Although a PLS model isn’t the best for interpreting features that are important for increasing Yield, it wouldn’t hurt to fit a PLS model anyways, just to have a sort of comparison baseline. And after tuning for the number of components, we can run a model using a method that allows the VIP
heuristic to estimate the features that were most important to the chosen number of components.
set.seed(624)
ctrl = trainControl(method = "cv", number = 8)
plsTune = train(Yield ~., data = trainDF, method = "pls", tuneLength = 12, trControl = ctrl)
plot(plsTune, main = "8-fold Cross-Validated RMSE for Training")
I chose 8-fold CV because 8 divides evenly into the training set size of 136, and because this plot was typical of all CV fold sizes I tried: It improves quickly up to 3 or 4 components, and then levels off. It’s not clear whether anything past 3 components would help on unseen data like test data, even though the RMSE continues to shrink all the way up to 5 components. I’ll go with 3.
noquote(paste("CV training R-sq for 3 components: ", round(plsTune$results$Rsquared[3], 3)))
## [1] CV training R-sq for 3 components: 0.606
And now that baseline PLS model on the test set:
PLSpreds = predict(plsTune, testDF, ncomp = 3)
plsResults = data.frame(obs = testDF$Yield, pred = PLSpreds)
defaultSummary(plsResults)
## RMSE Rsquared MAE
## 1.272187 0.465287 1.007249
Quite a dropoff in \(R^2\), as feared.
Run a separate model with method = "oscorespls"
to find important features, now that a good number of components is known.
Here are the top 15 most important features in that 3-component PLS model:
library(chillR)
## Warning: package 'chillR' was built under R version 4.0.5
##
## Attaching package: 'chillR'
## The following object is masked from 'package:pls':
##
## RMSEP
plsFit = plsr(Yield ~., data = trainDF, method = "oscorespls", ncomp = 3, trControl = ctrl)
vip = VIP(plsFit)
mostImp = vip[,order(-colSums(vip))[1:15]]
colMeans(mostImp)
## ManufacturingProcess28 ManufacturingProcess12 ManufacturingProcess32
## 2.383449 2.311615 1.879525
## BiologicalMaterial02 ManufacturingProcess13 BiologicalMaterial06
## 1.733612 1.597900 1.489294
## ManufacturingProcess31 ManufacturingProcess36 ManufacturingProcess09
## 1.488125 1.449520 1.428692
## ManufacturingProcess02 ManufacturingProcess17 BiologicalMaterial03
## 1.409458 1.397304 1.350621
## BiologicalMaterial08 ManufacturingProcess33 BiologicalMaterial11
## 1.348792 1.286223 1.214325
MP’s #28 and #12 were most important to this model’s 3 components, despite not having huge direct correlations with the Yield response. #32 was 3rd most important, and did have high correlation with the response directly.
set.seed(624)
lassoGrid = expand.grid(.lambda = c(0, 0.033, 0.1, 0.33), .fraction = seq(.1, 1, length = 10))
lassoTune = train(Yield ~., data = trainDF, method = 'enet',
tuneGrid = lassoGrid, trControl = ctrl)
lassoTune
## Elasticnet
##
## 136 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (8 fold)
## Summary of sample sizes: 119, 118, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.000 0.1 1.116059 0.6756750 0.8903974
## 0.000 0.2 1.121678 0.6733601 0.9206355
## 0.000 0.3 1.170147 0.6499737 0.9705279
## 0.000 0.4 1.251857 0.6019600 1.0208761
## 0.000 0.5 1.314627 0.5697722 1.0535342
## 0.000 0.6 1.367438 0.5458011 1.0903394
## 0.000 0.7 1.416817 0.5236316 1.1259032
## 0.000 0.8 1.454659 0.5057967 1.1518887
## 0.000 0.9 1.487780 0.4899723 1.1769317
## 0.000 1.0 1.520760 0.4759886 1.2024101
## 0.033 0.1 1.396945 0.6346758 1.1360204
## 0.033 0.2 1.148951 0.6590703 0.9083812
## 0.033 0.3 1.107264 0.6780254 0.8879987
## 0.033 0.4 1.108920 0.6771411 0.9028130
## 0.033 0.5 1.127816 0.6674918 0.9320018
## 0.033 0.6 1.146833 0.6604335 0.9559716
## 0.033 0.7 1.162844 0.6535368 0.9729133
## 0.033 0.8 1.176343 0.6467818 0.9830184
## 0.033 0.9 1.193255 0.6379106 0.9943950
## 0.033 1.0 1.215179 0.6265708 1.0099776
## 0.100 0.1 1.502108 0.6170029 1.2220311
## 0.100 0.2 1.233344 0.6494447 0.9942807
## 0.100 0.3 1.127471 0.6625898 0.8939303
## 0.100 0.4 1.116866 0.6696148 0.9021902
## 0.100 0.5 1.114611 0.6732466 0.9117684
## 0.100 0.6 1.123233 0.6706142 0.9274933
## 0.100 0.7 1.138240 0.6662002 0.9436407
## 0.100 0.8 1.158103 0.6599212 0.9662322
## 0.100 0.9 1.170746 0.6556813 0.9800475
## 0.100 1.0 1.182938 0.6509149 0.9921183
## 0.330 0.1 1.541399 0.6137824 1.2551412
## 0.330 0.2 1.292069 0.6384255 1.0484441
## 0.330 0.3 1.147888 0.6525697 0.9122188
## 0.330 0.4 1.146815 0.6516299 0.9224663
## 0.330 0.5 1.162780 0.6557183 0.9489530
## 0.330 0.6 1.171668 0.6574460 0.9611523
## 0.330 0.7 1.182928 0.6587762 0.9759807
## 0.330 0.8 1.196021 0.6600107 0.9864035
## 0.330 0.9 1.210167 0.6604534 0.9966103
## 0.330 1.0 1.226355 0.6594973 1.0113366
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.3 and lambda = 0.033.
Use those optimal hyperparameters to make predictions and see which features were important.
enetMod = enet(as.matrix(trainDF[,2:57]), trainDF$Yield[], lambda = 0.033, normalize = T)
enetPreds = predict(enetMod, as.matrix(trainDF[,2:57]), s = 0.3, mode = "fraction", type = "fit")
enetFeats = predict(enetMod, as.matrix(trainDF[,2:57]), s = 0.3, mode = "fraction", type = "coefficients")
enetFeats$coefficients[order(-abs(enetFeats$coefficients))]
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess17
## 4.44853042 3.00693486 -1.74973075
## ManufacturingProcess31 ManufacturingProcess13 BiologicalMaterial03
## -1.19568214 -1.02205261 0.94146453
## ManufacturingProcess36 ManufacturingProcess34 ManufacturingProcess06
## -0.72108365 0.60348833 0.52055986
## ManufacturingProcess37 ManufacturingProcess04 ManufacturingProcess45
## -0.47013775 0.34857981 0.23319643
## BiologicalMaterial05 ManufacturingProcess39 ManufacturingProcess28
## 0.19445567 0.17069532 -0.14372558
## ManufacturingProcess07 ManufacturingProcess40 ManufacturingProcess42
## -0.08988187 -0.07799239 0.07004882
## ManufacturingProcess29 ManufacturingProcess43 ManufacturingProcess44
## 0.04173388 0.03091762 0.02493506
## ManufacturingProcess15 BiologicalMaterial01 BiologicalMaterial02
## 0.02090755 0.00000000 0.00000000
## BiologicalMaterial04 BiologicalMaterial06 BiologicalMaterial08
## 0.00000000 0.00000000 0.00000000
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0.00000000 0.00000000 0.00000000
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess03 ManufacturingProcess05 ManufacturingProcess08
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess14 ManufacturingProcess16 ManufacturingProcess18
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess30 ManufacturingProcess33 ManufacturingProcess35
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess38 ManufacturingProcess41
## 0.00000000 0.00000000
Processes 32 and 09 were the most important ones here. Processes 28 and 12, which were important in the PLS components, didn’t matter much here. #12 was actually pushed to zero by the lasso penalty, as were over half of the features.
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
The elastic net predictions turn out to fare better than the PLS model (.514 \(R^2\) vs .465), but still not great –>
enetTest = predict(enetMod, as.matrix(testDF[,2:57]), s = 0.3, mode = "fraction", type = "fit")
enetResults = data.frame(obs = testDF$Yield, pred = enetTest$fit)
defaultSummary(enetResults)
## RMSE Rsquared MAE
## 1.2053919 0.5138496 0.9886975
Still searching for a better model, I’ll fit a linear model to all features, and back out the ones that aren’t contributing to the fit, one or two at a time. This is an imperfect way to reduce features, in the sense that it’s a (non-optimal) greedy algorithm, removing the least significant feature/s at each step. But although not necessarily optimal, it does produce decent results in general. Here’s what was left for features:
linmod = lm(Yield ~ . - ManufacturingProcess20 - ManufacturingProcess21 -
ManufacturingProcess13 - ManufacturingProcess15 - ManufacturingProcess06 -
ManufacturingProcess24 - ManufacturingProcess30 - ManufacturingProcess35 -
ManufacturingProcess36 - ManufacturingProcess05 - ManufacturingProcess38 -
ManufacturingProcess34 - ManufacturingProcess26 - ManufacturingProcess19 -
ManufacturingProcess12 - ManufacturingProcess10 - ManufacturingProcess02 -
ManufacturingProcess42 - ManufacturingProcess18 - ManufacturingProcess03 -
ManufacturingProcess44 - ManufacturingProcess16 - ManufacturingProcess14 -
ManufacturingProcess08 - ManufacturingProcess07 - ManufacturingProcess22 -
ManufacturingProcess01 - ManufacturingProcess45 - ManufacturingProcess11 -
ManufacturingProcess25 - ManufacturingProcess31 - ManufacturingProcess41 -
ManufacturingProcess40 - ManufacturingProcess23 - ManufacturingProcess39 -
BiologicalMaterial06 - BiologicalMaterial01 - BiologicalMaterial12 -
BiologicalMaterial10 - BiologicalMaterial09 - BiologicalMaterial08 -
BiologicalMaterial04 - BiologicalMaterial11, trainDF)
summary(linmod)
##
## Call:
## lm(formula = Yield ~ . - ManufacturingProcess20 - ManufacturingProcess21 -
## ManufacturingProcess13 - ManufacturingProcess15 - ManufacturingProcess06 -
## ManufacturingProcess24 - ManufacturingProcess30 - ManufacturingProcess35 -
## ManufacturingProcess36 - ManufacturingProcess05 - ManufacturingProcess38 -
## ManufacturingProcess34 - ManufacturingProcess26 - ManufacturingProcess19 -
## ManufacturingProcess12 - ManufacturingProcess10 - ManufacturingProcess02 -
## ManufacturingProcess42 - ManufacturingProcess18 - ManufacturingProcess03 -
## ManufacturingProcess44 - ManufacturingProcess16 - ManufacturingProcess14 -
## ManufacturingProcess08 - ManufacturingProcess07 - ManufacturingProcess22 -
## ManufacturingProcess01 - ManufacturingProcess45 - ManufacturingProcess11 -
## ManufacturingProcess25 - ManufacturingProcess31 - ManufacturingProcess41 -
## ManufacturingProcess40 - ManufacturingProcess23 - ManufacturingProcess39 -
## BiologicalMaterial06 - BiologicalMaterial01 - BiologicalMaterial12 -
## BiologicalMaterial10 - BiologicalMaterial09 - BiologicalMaterial08 -
## BiologicalMaterial04 - BiologicalMaterial11, data = trainDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2162 -0.5475 -0.0145 0.5594 2.5787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0184 1.4994 24.022 < 2e-16 ***
## BiologicalMaterial02 -3.6142 1.3400 -2.697 0.007981 **
## BiologicalMaterial03 3.6472 1.2506 2.916 0.004216 **
## BiologicalMaterial05 1.8028 0.7429 2.427 0.016703 *
## ManufacturingProcess04 2.5564 0.6248 4.092 7.72e-05 ***
## ManufacturingProcess09 2.4006 1.1894 2.018 0.045749 *
## ManufacturingProcess17 -3.0712 1.0573 -2.905 0.004363 **
## ManufacturingProcess27 -4.9842 1.3366 -3.729 0.000293 ***
## ManufacturingProcess28 -0.5086 0.2677 -1.900 0.059794 .
## ManufacturingProcess29 7.2043 1.3369 5.389 3.51e-07 ***
## ManufacturingProcess32 8.0879 1.0349 7.815 2.17e-12 ***
## ManufacturingProcess33 -4.2092 1.0576 -3.980 0.000118 ***
## ManufacturingProcess37 -1.0896 0.4513 -2.415 0.017241 *
## ManufacturingProcess43 2.0628 1.0273 2.008 0.046843 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9595 on 122 degrees of freedom
## Multiple R-squared: 0.7688, Adjusted R-squared: 0.7442
## F-statistic: 31.21 on 13 and 122 DF, p-value: < 2.2e-16
The Adjusted \(R^2\) is still a bit lower than the non-adjusted, but they’re both better than the other models. Let’s see how things translate to the test set.
linmodpred = predict(linmod, testDF)
lmDF = data.frame(obs = testDF$Yield, pred = linmodpred)
defaultSummary(lmDF)
## RMSE Rsquared MAE
## 1.0949590 0.5955950 0.8548047
Even after pulling out all features that weren’t statistically significant towards predicting the response, the model proves to be overfit. That is, the adjusted \(R^2\) in training drops off the table in testing, from .741 to an \(R^2\) of .596 for the testDF. But we were already ready for that, after seeing how different the test and training splits were, in terms of variance. And again, these are small splits, so it’s going to be unlikely that a model fit to one split predicts the other one well. Think of dealing a bridge hand to 4 players, and modeling outcomes on three of the hands, and then expecting the predictions on the fourth hand to be as accurate as the training fit. If the deck were 520 cards instead of 52, you’d have better predictions.
At any rate, the best I could do was this last model, OLS without penalties, just hand-reduced features.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
The most important manufacturing predictor was #32, which agrees with the correlation matrix above. #29 is also important, moreso than in the correlation matrix. It was correlated with a lot of the biological features that were positively correlated with Yield, and when those were dropped from the model, their positive coefficients seem to have been given to #29 in their place. #36 was expected to have a large negative correlation with the Yield, yet it doesn’t factor into the linear model. The likely reason, if you glance at the correlation matrix, is that #36 has a huge negative correlation with #32, such that only one of them is needed in the linear model.
The Biological features in general don’t seem as important, but when backing out features one by one, the Bio feats stayed in till the end, when they all came out in a large batch. So they are only less important to the model because it can give their importance to correlated Manufacturing features which remain in the model. I’d think it would be worth the experimenters’ time to try to understand why a handful of manufacturing processes were so highly correlated with biological features. Those processes end up being the most important ones in all models, but unless they get decoupled from the biological properties, they don’t really provide as much meaning to the question we’re trying to answer here.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
At first it seems like increasing #32 and #29 would be the best way to increase Yield. They have the highest positive coefficients. But if you look at the correlation matrix, they also are highly correlated with the Bio features, which are all positively correlated with Yield. In effect, those two processes were highest when the Bio levels were highest, so of course the Yield was higher in those runs. And the model assigns higher coefficients to 29 and 32 when it drops out all the Bio features. If you look back at the correlation matrix, you see that it’s actually Process #09 that has a high correlation with the Yield without being correlated much with the Bio feats. #13 is similarly negative where #09 is positive, again without much correlation with the Bio feats. And those were 2 of the most important 5 features in the ElasticNet model. But 13 and 09 are very highly negatively correlated, so that 13 ended up being insignificant in the model (it was important in the PLS model components earlier, where 09 happened to not be). The point of all that is that either increasing #09 or decreasing #13 may have a stronger effect on increasing Yield than raising #29 and #32 does.