6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

dim(fingerprints)[1] == dim(permeability)[1]
## [1] TRUE

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nzv = caret::nearZeroVar(fingerprints)
noNZV = fingerprints[,-nzv]
dim(noNZV)
## [1] 165 388

Although this question is phrased as filtering out predictors with low frequencies, the goal seems to be also to filter out those with high frequencies. For example some predictors are present in every compound, as thus are discarded by this method.

388 predictors made the cut for having fewer than 19 zeroes for every 1 and fewer than 19 ones for every 0. Note that any predictor that failed that test also failed the other default NZV test, which is having at least 10% as many unique values as total values, since there are only 2 unique values for these binary predictors, and there are 165 total values per predictor.

paste("The fewest 1's in a feature in the filtered data.frame is", min(colSums(noNZV)),
      "and the fewest 0's is", dim(noNZV)[1] - max(colSums(noNZV)))
## [1] "The fewest 1's in a feature in the filtered data.frame is 9 and the fewest 0's is 9"

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\) ?

set.seed(624)
trains = sample(1:165, 130)  # randomly choose 130 compounds and test with the other 35
tests = setdiff(1:165, trains)
trainDF = noNZV[trains,]
testDF = noNZV[tests,]

Since the data values are all binary, I don’t think they need scaling or centering. If the goal were to do linear regression on them, the next step might be to eliminate highly correlated predictors, but we are asked instead to do PLS on them. Nevertheless, let’s see how correlated the predictors are, and at least remove any that are completely not needed.

nfeats = dim(trainDF)[2]
cormat = matrix(0, nfeats, nfeats)
for (i in 1:(nfeats-1)) {
  for (j in ((i+1):nfeats)) {
    cormat[i,j] = sum(noNZV[,i] == noNZV[,j]) / 165 * 2 - 1
  }
}
sum(cormat) * 2 / nfeats / (nfeats-1)  # mean correlation of any two different predictors in noNZV
## [1] 0.1546361

That just built an upper triangular matrix that measured the fraction of the 165 compounds whose presence/absence agreed in each pair of predictors, and then multiplied by 2 and subtracted 1, in order to make it more familiar in terms of normal correlation measures (-1 for complete disagreement, +1 for complete agreement). The mean for the matrix was 0.155 or so. This is because the mean value in the data is only 0.31, i.e. there are just over twice as many zeroes as ones. As such, a typical pair of feature columns matches 0’s in more than the 165/4 indexes that would be expected for a random and balanced distribution of 1’s and 0’s.

hist(cormat[upper.tri(cormat,F)], main = "Pairwise Correlations of Predictors", xlab = "")

ones = which(cormat == 1, arr.ind = T)
discard = setdiff(ones[,1], ones[,2]) # this is all the features that are duped, but keeps one of each dupeset
length(discard)
## [1] 66

66 features can be removed as duplicates, while keeping one that dupes them. Remove them in the test set as well, since the training set has no useful info for them.

trainDF = trainDF[,-discard]
testDF = testDF[,-discard]
dim(trainDF)
## [1] 130 322

Model with PLS, starting by tuning for the optimal number of components.

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(624)
ctrl = trainControl(method = "cv", number = 5)
plsTune = train(trainDF, permeability[trains], method = "pls", tuneLength = 20, trControl = ctrl)
plot(plsTune, main = "5-fold Cross-Validated RMSE for Training")

It’s a little choppy because each fold only has 130/5 = 26 compounds in it, but it seems reasonable to use the 8 components that produced the CV-averaged RMSE of just over 11. If anything, you could probably use fewer.

noquote(paste("R-sq for 8 compontents: ", plsTune$results$Rsquared[8]))
## [1] R-sq for 8 compontents:  0.553611369833046

That’s the average for the 5 validation folds, but let’s see how much they vary.

plsTune$resample
##        RMSE  Rsquared       MAE Resample
## 1 11.858163 0.5767516  8.072946    Fold1
## 2  7.019797 0.7855236  5.768535    Fold5
## 3 11.301891 0.6001998  8.436503    Fold2
## 4 13.228300 0.3434449 10.599440    Fold4
## 5 11.805442 0.4621369  8.539903    Fold3

There’s a lot of variance in the \(R^2\) there, with only 26 samples to score for each fold. It’s really anyone’s guess how this will do on the test set.

(d) Predict the response for the test set. What is the test set estimate of \(R^2\) ?

Let’s see how an 8-component model looks.

Tack on the response variable for modeling with formulas.

trainDF = as.data.frame(trainDF)
trainDF$perm = permeability[trains]
testDF = as.data.frame(testDF)
testDF$perm = permeability[tests]

Train on all 130 samples, without folds.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(624)
plsFit = plsr(perm ~ . , data = trainDF, ncomp = 8)
plot(plsFit$residuals, main = "PLS Residuals from 8-Component Model")

set.seed(624)
PLSpreds = predict(plsFit, testDF, ncomp = 8)
plsResults = data.frame(obs = testDF$perm, pred = PLSpreds[,,])
defaultSummary(plsResults)
##       RMSE   Rsquared        MAE 
## 12.1644683  0.3167209  9.4239309

That \(R^2\) of .317 is even lower than the worst of the 5 cross-validation folds. Splitting 130 training samples into 5 groups and averaging the 5 models’ results on the 5 groups of 26 validation samples, 4 models trained on every sample, so there was a range of worse and better results. The one model trained on all 130 didn’t seem to get as good at look at whatever patterns lay in the 35 test samples.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Try Robust Linear Regression, after reducing dimensions via PCA. Use the same splits as before.

set.seed(624)
rlmPCA = train(perm ~ . , data = trainDF, method = 'rlm', 
               metric = "RMSE", preProcess = "pca", trControl = ctrl)
rlmPCA
## Robust Linear Model 
## 
## 130 samples
## 322 predictors
## 
## Pre-processing: principal component signal extraction (322), centered
##  (322), scaled (322) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 104, 104, 104, 105, 103 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE      Rsquared   MAE      
##   FALSE      psi.huber     17.91105  0.4366678  14.303706
##   FALSE      psi.hampel    17.79546  0.4529139  14.261505
##   FALSE      psi.bisquare  18.18375  0.4239782  14.440848
##    TRUE      psi.huber     12.41827  0.4586823   8.789986
##    TRUE      psi.hampel    13.01753  0.4309632   9.212618
##    TRUE      psi.bisquare  13.64374  0.4544003   9.034237
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
RLMpreds = predict(rlmPCA, testDF)
rlmResults = data.frame(obs = testDF$perm, pred = RLMpreds)
defaultSummary(rlmResults)
##       RMSE   Rsquared        MAE 
## 11.0335064  0.3878919  7.4777056

The RMSE is actually lower than in the training batch (11.03 vs. 12.42), but the \(R^2\) dropped from .459 to .388, so there must be less variance in the test responses than in the training ones.

set.seed(624)
paste("Variance of 35 test responses: ", round(var(testDF$perm)))
## [1] "Variance of 35 test responses:  200"
paste("Variance of randomly chosen 35 training responses: ", round(var(sample(trainDF$perm, 35))))
## [1] "Variance of randomly chosen 35 training responses:  285"

Still, this is better than the PLS model did on the test batch.

Let’s see how a Lasso model does, with PCA pre-processing, on the same splits.

library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
set.seed(624)
lassoGrid = expand.grid(.lambda = c(0), .fraction = seq(.1, 1, length = 10))
lassoTune = train(perm ~., data = trainDF, method = 'enet',
                  tuneGrid = lassoGrid, trControl = ctrl,
                  preProcess = c("scale", "pca"))
lassoTune
## Elasticnet 
## 
## 130 samples
## 322 predictors
## 
## Pre-processing: scaled (322), principal component signal extraction
##  (322), centered (322) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 104, 104, 104, 105, 103 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.1       14.42373  0.3255332  11.575030
##   0.2       13.50423  0.3742386  10.772172
##   0.3       12.91839  0.3949580  10.131515
##   0.4       12.50504  0.4097887   9.557359
##   0.5       12.18728  0.4267397   9.204028
##   0.6       11.96239  0.4407688   8.985098
##   0.7       11.83005  0.4516625   8.901894
##   0.8       11.79533  0.4583305   8.979240
##   0.9       11.82826  0.4625073   9.080603
##   1.0       11.94604  0.4626869   9.208198
## 
## Tuning parameter 'lambda' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.8 and lambda = 0.
lassoPreds = predict(lassoTune, testDF)
lassoResults = data.frame(obs = testDF$perm, pred = lassoPreds)
defaultSummary(lassoResults)
##       RMSE   Rsquared        MAE 
## 11.4986392  0.3309055  8.0728693

Despite a cross-validation \(R^2\) comparable to the Robust model, the Lasso model performed worse on the test set, although still a bit better than the PLS model did. Still, these are small sample sizes, and there is a lot of randomness affecting results.

For example, recall all the variance in \(R^2\) for the cross-validation folds of the PLS model (from .34 to .79). And recall how the decision to use 8 components was not completely obvious, and was based on the lowest RMSE for a jagged plot of small samples. If we’d chosen a different number of CV folds, we easily might’ve come up with only using 5 components, which would’ve led to this model:

set.seed(624)
plsFit = plsr(perm ~ . , data = trainDF, ncomp = 5)
PLSpreds = predict(plsFit, testDF, ncomp = 5)
plsResults = data.frame(obs = testDF$perm, pred = PLSpreds[,,])
defaultSummary(plsResults)
##       RMSE   Rsquared        MAE 
## 11.0877393  0.4019992  8.7624625

Now with just slight random fluctuations, PLS goes from being the worst of the 3 models to the best, by a significant \(R^2\) margin.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

It was mentioned that the reason for trying to replace the lab experiments was that they were very expensive. Without knowing how expensive, these models with \(R^2\) topping out at .4 don’t seem like good candidates to replace anything. Some better version of them maybe would be a great tool to use in order to reduce the number of unsuccessful, costly experiments.


6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

#library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

cmp = ChemicalManufacturingProcess
dim(cmp)
## [1] 176  58
#summary(cmp)

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

library(finalfit)

cmp %>%
  missing_plot()

table(rowSums(is.na(cmp)))
## 
##   0   1   3   4  11  16 
## 152  10   7   1   5   1

One run is missing 16 features, 5 are missing 11 features, 1 missing 4 feats, 7 missing 3 feats, and 10 are missing just 1 feat. The other 152 runs have no missing features.

table(colSums(is.na(cmp)))
## 
##  0  1  2  3  5  9 10 15 
## 30 12  1  1 11  1  1  1

30 features (29 plus the response, actually), including the 12 Biological features, are present in all runs. 12 more are missing in just one run, 11 feats are missing in 5 runs, and one feature is missing from each of 2, 3, 9, 10, and 15 runs.

Here’s another angle to look at how the NA’s are clumped:

cmp %>%
  missing_pattern(dependent = 'Yield', explanatory = names(cmp)) -> plot2 # to suppress matrix output

Let’s see what the feature distributions look like before imputing missing values.

par(mfrow = c(4,4))
for (n in names(cmp)) {
  hist(cmp[[n]], main = paste0(substr(n, 1, 1),
                               substr(n, nchar(n)-1, nchar(n))), xlab = '')
}

Just one feature has obvious near-zero variance.

table(cmp$BiologicalMaterial07)
## 
##    100 100.83 
##    173      3

Drop B07 for NZV.

prepro = cmp %>% select(-c(BiologicalMaterial07))

There is a small number of zeroes in a few of these distributions, ruining the plots and potentially worth fixing before dealing with NA’s. At first glance, MP’s 16,18,20,25,26,27,29,30,31. It’s very possible they were keeping some value set at 0 as a sort of control for the run, but if all those zeroes are in one run, there might be cause to remove that run from training, or to give it median-type values instead of zeroes, so that it doesn’t make it harder to model the other features.

prepro[prepro$ManufacturingProcess16==0,]
##    Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 23  40.7                 6.58                58.38                67.17
##    BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 23                12.22                18.46                51.45
##    BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 23                18.22                12.83                 2.69
##    BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 23               148.49                21.23                   11.1
##    ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 23                      0                     NA                    937
##    ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 23                  997.7                  209.6                    177
##    ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 23                    177                  46.43                     NA
##    ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 23                     NA                      0                   32.9
##    ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 23                     NA                   6002                      0
##    ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 23                   36.5                   4902                   6120
##    ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 23                   4621                    3.6                      7
##    ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 23                      3                     17                   4906
##    ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 23                   6134                   4626                   11.2
##    ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 23                     21                    8.9                   70.1
##    ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 23                    160                     66                    2.4
##    ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 23                    492                  0.019                    1.5
##    ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 23                      3                    7.4                      0
##    ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 23                      0                   11.6                    0.8
##    ManufacturingProcess44 ManufacturingProcess45
## 23                    1.9                    2.4

Run 23 just has one of those singular zeroes. It has a normal yield, so make it MP16’s median.

prepro[23, "ManufacturingProcess16"] = median(prepro$ManufacturingProcess16, na.rm = T)
prepro[prepro$ManufacturingProcess18==0,]
##    Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 22 41.18                 7.08                61.83                70.69
##    BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 22                13.43                17.72                53.27
##    BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 22                19.14                13.38                 3.02
##    BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 22                153.1                 21.9                   10.9
##    ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 22                      0                     NA                    936
##    ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 22                 1024.4                  218.7                    178
##    ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 22                    178                  47.04                     NA
##    ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 22                     NA                      0                   33.2
##    ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 22                   4791                   6026                   4482
##    ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 22                   35.7                      0                   6111
##    ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 22                      0                    2.5                      6
##    ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 22                      2                     16                   4918
##    ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 22                   6152                   4618                   11.4
##    ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 22                   21.1                    8.8                   70.1
##    ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 22                    162                     68                    2.4
##    ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 22                    479                  0.018                    0.9
##    ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 22                      2                    7.4                      0
##    ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 22                      0                   11.9                    0.8
##    ManufacturingProcess44 ManufacturingProcess45
## 22                    1.9                    2.1

Same story for MP18, except it also has the one 0 for MP20. Make them both medians.

prepro[22, "ManufacturingProcess18"] = median(prepro$ManufacturingProcess18, na.rm = T)
prepro[22, "ManufacturingProcess20"] = median(prepro$ManufacturingProcess20, na.rm = T)
prepro[prepro$ManufacturingProcess25==0,]
##      Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 108  39.75                 5.79                53.96                66.53
## NA      NA                   NA                   NA                   NA
## NA.1    NA                   NA                   NA                   NA
## NA.2    NA                   NA                   NA                   NA
## NA.3    NA                   NA                   NA                   NA
## NA.4    NA                   NA                   NA                   NA
##      BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 108                  10.4                18.26                47.57
## NA                     NA                   NA                   NA
## NA.1                   NA                   NA                   NA
## NA.2                   NA                   NA                   NA
## NA.3                   NA                   NA                   NA
## NA.4                   NA                   NA                   NA
##      BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 108                 17.24                12.99                 2.16
## NA                     NA                   NA                   NA
## NA.1                   NA                   NA                   NA
## NA.2                   NA                   NA                   NA
## NA.3                   NA                   NA                   NA
## NA.4                   NA                   NA                   NA
##      BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 108                146.64                20.57                    9.8
## NA                     NA                   NA                     NA
## NA.1                   NA                   NA                     NA
## NA.2                   NA                   NA                     NA
## NA.3                   NA                   NA                     NA
## NA.4                   NA                   NA                     NA
##      ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 108                    21.9                    1.5                    934
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 108                   987.9                  208.9                    178
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 108                     178                  45.99                    9.4
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 108                     9.3                      0                   34.7
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 108                    4807                   5967                   4536
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 108                    34.7                   4819                   5997
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 108                    4544                      0                     11
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 108                       5                      5                      0
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 108                       0                      0                      0
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 108                       0                      0                      0
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 108                     156                     62                    2.5
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 108                     485                  0.019                    0.7
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 108                       3                      7                      0
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 108                       0                   11.6                    0.8
## NA                       NA                     NA                     NA
## NA.1                     NA                     NA                     NA
## NA.2                     NA                     NA                     NA
## NA.3                     NA                     NA                     NA
## NA.4                     NA                     NA                     NA
##      ManufacturingProcess44 ManufacturingProcess45
## 108                     1.9                    2.4
## NA                       NA                     NA
## NA.1                     NA                     NA
## NA.2                     NA                     NA
## NA.3                     NA                     NA
## NA.4                     NA                     NA

This #108 has all the remaining singular zeroes, at MP’s 25,26,27,29,30,31. It also has a response value near the mean. This filter on MP25 has also turned up the group of 5 runs with many NA’s. Let’s see what they are.

prepro[rowSums(is.na(prepro)) > 4,]
##     Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1   38.00                 6.25                49.58                56.97
## 172 39.66                 6.71                56.32                66.19
## 173 39.68                 6.87                56.74                66.61
## 174 42.23                 7.50                58.41                68.30
## 175 38.48                 7.53                58.36                69.25
## 176 39.49                 7.53                58.36                69.25
##     BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                  12.74                19.51                43.73
## 172                12.35                20.02                50.26
## 173                12.55                20.18                50.80
## 174                13.33                20.81                52.96
## 175                14.35                20.57                51.31
## 176                14.35                20.57                51.31
##     BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1                  16.66                11.44                 3.46
## 172                17.54                12.50                 2.82
## 173                17.48                12.41                 2.82
## 174                17.23                12.04                 2.83
## 175                17.87                12.77                 3.55
## 176                17.87                12.77                 3.55
##     BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1                 138.09                18.83                     NA
## 172               143.45                20.32                   12.8
## 173               143.10                20.24                   12.8
## 174               141.72                19.92                   13.0
## 175               145.56                20.04                   14.1
## 176               145.56                20.04                   14.1
##     ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1                       NA                     NA                     NA
## 172                   21.5                   1.54                    935
## 173                   21.5                   1.56                    933
## 174                   20.4                   1.55                    930
## 175                   21.6                   1.55                    935
## 176                   20.8                   1.55                    932
##     ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1                       NA                     NA                     NA
## 172                 1027.0                  206.2                    178
## 173                 1032.0                  206.6                    177
## 174                 1040.0                  208.7                    178
## 175                 1044.8                  208.0                    177
## 176                 1053.8                  207.5                    178
##     ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1                       NA                  43.00                     NA
## 172                    177                  46.78                    9.6
## 173                    178                  46.51                    9.5
## 174                    177                  48.05                   10.1
## 175                    177                  48.11                   10.2
## 176                    178                  48.13                    9.9
##     ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1                       NA                     NA                   35.5
## 172                    9.6                      0                   33.9
## 173                    9.9                      0                   34.0
## 174                   10.4                      0                   33.1
## 175                   10.3                      0                   32.9
## 176                   10.3                      0                   32.9
##     ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1                     4898                   6108                   4682
## 172                   4829                   6026                   4621
## 173                   4833                   6029                   4608
## 174                   4795                   6000                   4557
## 175                   4793                   6029                   4600
## 176                   4824                   6068                   4630
##     ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1                     35.5                   4865                   6049
## 172                   33.5                   4815                   6011
## 173                   33.5                   4807                   6001
## 174                   32.8                   4764                   5977
## 175                   32.4                   4787                   6030
## 176                   32.0                   4795                   6050
##     ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1                     4665                    0.0                     NA
## 172                   4612                   -0.4                      2
## 173                   4584                   -0.5                      0
## 174                   4538                   -0.3                      0
## 175                   4592                   -0.5                      0
## 176                   4607                   -0.9                      0
##     ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1                       NA                     NA                   4873
## 172                      2                      8                     NA
## 173                      0                      0                     NA
## 174                      0                      0                     NA
## 175                      0                      0                     NA
## 176                      0                      0                     NA
##     ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1                     6074                   4685                   10.7
## 172                     NA                     NA                     NA
## 173                     NA                     NA                     NA
## 174                     NA                     NA                     NA
## 175                     NA                     NA                     NA
## 176                     NA                     NA                     NA
##     ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1                       21                    9.9                   69.1
## 172                     NA                     NA                     NA
## 173                     NA                     NA                     NA
## 174                     NA                     NA                     NA
## 175                     NA                     NA                     NA
## 176                     NA                     NA                     NA
##     ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1                      156                     66                    2.4
## 172                    156                     NA                     NA
## 173                    158                     NA                     NA
## 174                    167                     NA                     NA
## 175                    156                     NA                     NA
## 176                    160                     NA                     NA
##     ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1                      486                  0.019                    0.5
## 172                     NA                     NA                    2.3
## 173                     NA                     NA                    1.0
## 174                     NA                     NA                    1.3
## 175                     NA                     NA                    2.3
## 176                     NA                     NA                    0.9
##     ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1                        3                    7.2                     NA
## 172                      0                    0.0                      0
## 173                      0                    0.0                      0
## 174                      0                    0.0                      0
## 175                      0                    0.0                      0
## 176                      0                    0.0                      0
##     ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1                       NA                   11.6                    3.0
## 172                      0                    0.0                    0.6
## 173                      0                    0.0                    0.6
## 174                      0                    0.0                    0.6
## 175                      0                    0.0                    0.5
## 176                      0                    0.0                    0.6
##     ManufacturingProcess44 ManufacturingProcess45
## 1                      1.8                    2.4
## 172                    0.0                    0.0
## 173                    0.0                    0.0
## 174                    0.0                    0.0
## 175                    0.0                    0.0
## 176                    0.0                    0.0

This is where the inspection of outlier zeroes blends into the treatment of NA’s. The first thing to keep in mind is that our task is to find Manufacturing Process changes that improve Yield. Since we aren’t tasked with making predictions on data outside of this dataset, I’m less inclined to throw away troublesome rows like these. The response for all these runs is somewhat normal, so let’s just impute medians for everything missing now. And also for MP25 on run #108.

prepro[108, c(37:39,41:43)] = NA   #MP's 25-27, 29-31
for (feat in names(prepro)) {
  med = median(prepro[,feat], na.rm = T)
  prepro[feat][is.na(prepro[feat])] = med
}

And now we should get a better look at the distributions.

par(mfrow = c(4,4))
for (n in names(prepro)) {
  hist(prepro[[n]], main = paste0(substr(n, 1, 1),
                               substr(n, nchar(n)-1, nchar(n))), xlab = '')
}

Those look more useful for a linear model. Still there are a couple of outlier 0 values, but who’s to say if those were intentional or not.

prepro[prepro$ManufacturingProcess01==0,]
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
##   BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 2                19.04                12.55                 3.46
## 3                19.04                12.55                 3.46
## 4                19.04                12.55                 3.46
##   BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 2               153.67                21.05                      0
## 3               153.67                21.05                      0
## 4               153.67                21.05                      0
##   ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 2                      0                   1.54                    917
## 3                      0                   1.54                    912
## 4                      0                   1.54                    911
##   ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 2                 1032.2                  210.0                    177
## 3                 1003.6                  207.1                    178
## 4                 1014.6                  213.3                    177
##   ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 2                    178                  46.57                    9.1
## 3                    178                  45.07                    9.1
## 4                    177                  44.92                    9.1
##   ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 2                    9.4                      0                   34.0
## 3                    9.4                      0                   34.8
## 4                    9.4                      0                   34.8
##   ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 2                   4869                   6095                   4617
## 3                   4878                   6087                   4617
## 4                   4897                   6102                   4635
##   ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 2                   34.0                   4867                   6097
## 3                   34.8                   4877                   6078
## 4                   34.8                   4872                   6073
##   ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 2                   4621                      0                      3
## 3                   4621                      0                      4
## 4                   4611                      0                      5
##   ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 2                      0                      3                   4869
## 3                      1                      4                   4897
## 4                      2                      5                   4892
##   ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 2                   6107                   4630                   11.2
## 3                   6116                   4637                   11.1
## 4                   6111                   4630                   11.1
##   ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 2                   21.4                    9.9                   68.7
## 3                   21.3                    9.4                   69.3
## 4                   21.3                    9.4                   69.3
##   ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 2                    169                     66                    2.6
## 3                    173                     66                    2.6
## 4                    171                     68                    2.5
##   ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 2                    508                  0.019                    2.0
## 3                    509                  0.018                    0.7
## 4                    496                  0.018                    1.2
##   ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 2                      2                    7.2                    0.1
## 3                      2                    7.2                    0.0
## 4                      2                    7.2                    0.0
##   ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 2                   0.15                   11.1                    0.9
## 3                   0.00                   12.0                    1.0
## 4                   0.00                   10.6                    1.1
##   ManufacturingProcess44 ManufacturingProcess45
## 2                    1.9                    2.2
## 3                    1.8                    2.3
## 4                    1.8                    2.1

This shows an interesting aspect of this data: The experimenters were using one Biological type here (all the Bio values are the same for the 3), and trying to see how tweaks to the Manufacturing features affected the Yield. I think if you tweaked any more outlier zeroes at this point, it might hide good information. So we should be able to see some clear correlations between Yield and its predictors now.

library(corrplot)
## corrplot 0.92 loaded
## 
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
## 
##     corrplot
corrplot(cor(prepro), type = "upper", diag = F, )

MP’s 13 (neg), 32 (pos), and 36 (neg), look to have the highest Yield correlations. We’ll come back to this correlation matrix later on.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(624)
trains = sample(1:176, 136)  # randomly choose 136 and test with the other 40
tests = setdiff(1:176, trains)
trainDF = prepro[trains,]
testDF = prepro[tests,]
paste("Baseline of Test Set Variance: ", round(var(testDF$Yield),3))
## [1] "Baseline of Test Set Variance:  2.826"
testVars = c()
for (seed in 625:634) {  #Bootstrap 10 sample variances
  set.seed(seed)
  testVars = c(testVars, var(sample(trainDF$Yield, dim(testDF)[1])))
}
  paste("Training set Sample Variance, just to see how optimistic validation scores are, compared to test ones: ",
      round(mean(testVars), 3))
## [1] "Training set Sample Variance, just to see how optimistic validation scores are, compared to test ones:  3.752"

It does look like predictions for the test split should do pretty well as far as RMSE, compared to the training set, based on the response variances. But the \(R^2\) will be harder to match, since there is less variance to explain in the test set, so that whatever patterns the model predicts on will probably be overfit as far as the test responses go.

First, min-max scale the values. Use the ranges of the training set, so as not to unrealistically steal info from the test set.

for (feat in names(trainDF)[2:dim(trainDF)[2]]){
  hi = max(trainDF[,feat])
  lo = min(trainDF[,feat])
  trainDF[,feat] = (trainDF[,feat] - lo) / (hi - lo)
  testDF[,feat] = (testDF[,feat] - lo) / (hi - lo)
}

Although a PLS model isn’t the best for interpreting features that are important for increasing Yield, it wouldn’t hurt to fit a PLS model anyways, just to have a sort of comparison baseline. And after tuning for the number of components, we can run a model using a method that allows the VIP heuristic to estimate the features that were most important to the chosen number of components.

set.seed(624)
ctrl = trainControl(method = "cv", number = 8)
plsTune = train(Yield ~., data = trainDF, method = "pls", tuneLength = 12, trControl = ctrl)
plot(plsTune, main = "8-fold Cross-Validated RMSE for Training")

I chose 8-fold CV because 8 divides evenly into the training set size of 136, and because this plot was typical of all CV fold sizes I tried: It improves quickly up to 3 or 4 components, and then levels off. It’s not clear whether anything past 3 components would help on unseen data like test data, even though the RMSE continues to shrink all the way up to 5 components. I’ll go with 3.

noquote(paste("CV training R-sq for 3 components: ", round(plsTune$results$Rsquared[3], 3)))
## [1] CV training R-sq for 3 components:  0.606

And now that baseline PLS model on the test set:

PLSpreds = predict(plsTune, testDF, ncomp = 3)
plsResults = data.frame(obs = testDF$Yield, pred = PLSpreds)
defaultSummary(plsResults)
##     RMSE Rsquared      MAE 
## 1.272187 0.465287 1.007249

Quite a dropoff in \(R^2\), as feared.

Run a separate model with method = "oscorespls" to find important features, now that a good number of components is known.

Here are the top 15 most important features in that 3-component PLS model:

library(chillR)
## Warning: package 'chillR' was built under R version 4.0.5
## 
## Attaching package: 'chillR'
## The following object is masked from 'package:pls':
## 
##     RMSEP
plsFit = plsr(Yield ~., data = trainDF, method = "oscorespls", ncomp = 3, trControl = ctrl)
vip = VIP(plsFit)
mostImp = vip[,order(-colSums(vip))[1:15]]
colMeans(mostImp)
## ManufacturingProcess28 ManufacturingProcess12 ManufacturingProcess32 
##               2.383449               2.311615               1.879525 
##   BiologicalMaterial02 ManufacturingProcess13   BiologicalMaterial06 
##               1.733612               1.597900               1.489294 
## ManufacturingProcess31 ManufacturingProcess36 ManufacturingProcess09 
##               1.488125               1.449520               1.428692 
## ManufacturingProcess02 ManufacturingProcess17   BiologicalMaterial03 
##               1.409458               1.397304               1.350621 
##   BiologicalMaterial08 ManufacturingProcess33   BiologicalMaterial11 
##               1.348792               1.286223               1.214325

MP’s #28 and #12 were most important to this model’s 3 components, despite not having huge direct correlations with the Yield response. #32 was 3rd most important, and did have high correlation with the response directly.

Try to use ElasticNet to remove insignificant features.

set.seed(624)
lassoGrid = expand.grid(.lambda = c(0, 0.033, 0.1, 0.33), .fraction = seq(.1, 1, length = 10))
lassoTune = train(Yield ~., data = trainDF, method = 'enet',
                  tuneGrid = lassoGrid, trControl = ctrl)
lassoTune
## Elasticnet 
## 
## 136 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 119, 118, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.000   0.1       1.116059  0.6756750  0.8903974
##   0.000   0.2       1.121678  0.6733601  0.9206355
##   0.000   0.3       1.170147  0.6499737  0.9705279
##   0.000   0.4       1.251857  0.6019600  1.0208761
##   0.000   0.5       1.314627  0.5697722  1.0535342
##   0.000   0.6       1.367438  0.5458011  1.0903394
##   0.000   0.7       1.416817  0.5236316  1.1259032
##   0.000   0.8       1.454659  0.5057967  1.1518887
##   0.000   0.9       1.487780  0.4899723  1.1769317
##   0.000   1.0       1.520760  0.4759886  1.2024101
##   0.033   0.1       1.396945  0.6346758  1.1360204
##   0.033   0.2       1.148951  0.6590703  0.9083812
##   0.033   0.3       1.107264  0.6780254  0.8879987
##   0.033   0.4       1.108920  0.6771411  0.9028130
##   0.033   0.5       1.127816  0.6674918  0.9320018
##   0.033   0.6       1.146833  0.6604335  0.9559716
##   0.033   0.7       1.162844  0.6535368  0.9729133
##   0.033   0.8       1.176343  0.6467818  0.9830184
##   0.033   0.9       1.193255  0.6379106  0.9943950
##   0.033   1.0       1.215179  0.6265708  1.0099776
##   0.100   0.1       1.502108  0.6170029  1.2220311
##   0.100   0.2       1.233344  0.6494447  0.9942807
##   0.100   0.3       1.127471  0.6625898  0.8939303
##   0.100   0.4       1.116866  0.6696148  0.9021902
##   0.100   0.5       1.114611  0.6732466  0.9117684
##   0.100   0.6       1.123233  0.6706142  0.9274933
##   0.100   0.7       1.138240  0.6662002  0.9436407
##   0.100   0.8       1.158103  0.6599212  0.9662322
##   0.100   0.9       1.170746  0.6556813  0.9800475
##   0.100   1.0       1.182938  0.6509149  0.9921183
##   0.330   0.1       1.541399  0.6137824  1.2551412
##   0.330   0.2       1.292069  0.6384255  1.0484441
##   0.330   0.3       1.147888  0.6525697  0.9122188
##   0.330   0.4       1.146815  0.6516299  0.9224663
##   0.330   0.5       1.162780  0.6557183  0.9489530
##   0.330   0.6       1.171668  0.6574460  0.9611523
##   0.330   0.7       1.182928  0.6587762  0.9759807
##   0.330   0.8       1.196021  0.6600107  0.9864035
##   0.330   0.9       1.210167  0.6604534  0.9966103
##   0.330   1.0       1.226355  0.6594973  1.0113366
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.3 and lambda = 0.033.

Use those optimal hyperparameters to make predictions and see which features were important.

enetMod = enet(as.matrix(trainDF[,2:57]), trainDF$Yield[], lambda = 0.033, normalize = T)
enetPreds = predict(enetMod, as.matrix(trainDF[,2:57]), s = 0.3, mode = "fraction", type = "fit")
enetFeats = predict(enetMod, as.matrix(trainDF[,2:57]), s = 0.3, mode = "fraction", type = "coefficients")
enetFeats$coefficients[order(-abs(enetFeats$coefficients))]
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess17 
##             4.44853042             3.00693486            -1.74973075 
## ManufacturingProcess31 ManufacturingProcess13   BiologicalMaterial03 
##            -1.19568214            -1.02205261             0.94146453 
## ManufacturingProcess36 ManufacturingProcess34 ManufacturingProcess06 
##            -0.72108365             0.60348833             0.52055986 
## ManufacturingProcess37 ManufacturingProcess04 ManufacturingProcess45 
##            -0.47013775             0.34857981             0.23319643 
##   BiologicalMaterial05 ManufacturingProcess39 ManufacturingProcess28 
##             0.19445567             0.17069532            -0.14372558 
## ManufacturingProcess07 ManufacturingProcess40 ManufacturingProcess42 
##            -0.08988187            -0.07799239             0.07004882 
## ManufacturingProcess29 ManufacturingProcess43 ManufacturingProcess44 
##             0.04173388             0.03091762             0.02493506 
## ManufacturingProcess15   BiologicalMaterial01   BiologicalMaterial02 
##             0.02090755             0.00000000             0.00000000 
##   BiologicalMaterial04   BiologicalMaterial06   BiologicalMaterial08 
##             0.00000000             0.00000000             0.00000000 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##             0.00000000             0.00000000             0.00000000 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess03 ManufacturingProcess05 ManufacturingProcess08 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess14 ManufacturingProcess16 ManufacturingProcess18 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess30 ManufacturingProcess33 ManufacturingProcess35 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess38 ManufacturingProcess41 
##             0.00000000             0.00000000

Processes 32 and 09 were the most important ones here. Processes 28 and 12, which were important in the PLS components, didn’t matter much here. #12 was actually pushed to zero by the lasso penalty, as were over half of the features.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The elastic net predictions turn out to fare better than the PLS model (.514 \(R^2\) vs .465), but still not great –>

enetTest = predict(enetMod, as.matrix(testDF[,2:57]), s = 0.3, mode = "fraction", type = "fit")
enetResults = data.frame(obs = testDF$Yield, pred = enetTest$fit)
defaultSummary(enetResults)
##      RMSE  Rsquared       MAE 
## 1.2053919 0.5138496 0.9886975

Still searching for a better model, I’ll fit a linear model to all features, and back out the ones that aren’t contributing to the fit, one or two at a time. This is an imperfect way to reduce features, in the sense that it’s a (non-optimal) greedy algorithm, removing the least significant feature/s at each step. But although not necessarily optimal, it does produce decent results in general. Here’s what was left for features:

linmod = lm(Yield ~ . - ManufacturingProcess20 - ManufacturingProcess21 - 
              ManufacturingProcess13 - ManufacturingProcess15 - ManufacturingProcess06 -
              ManufacturingProcess24 - ManufacturingProcess30 - ManufacturingProcess35 -
              ManufacturingProcess36 - ManufacturingProcess05 - ManufacturingProcess38 -
              ManufacturingProcess34 - ManufacturingProcess26 - ManufacturingProcess19 -
              ManufacturingProcess12 - ManufacturingProcess10 - ManufacturingProcess02 -
              ManufacturingProcess42 - ManufacturingProcess18 - ManufacturingProcess03 -
              ManufacturingProcess44 - ManufacturingProcess16 - ManufacturingProcess14 -
              ManufacturingProcess08 - ManufacturingProcess07 - ManufacturingProcess22 -
              ManufacturingProcess01 - ManufacturingProcess45 - ManufacturingProcess11 -
              ManufacturingProcess25 - ManufacturingProcess31 - ManufacturingProcess41 -
              ManufacturingProcess40 - ManufacturingProcess23 - ManufacturingProcess39 -
              BiologicalMaterial06 - BiologicalMaterial01 - BiologicalMaterial12 -
              BiologicalMaterial10 - BiologicalMaterial09 - BiologicalMaterial08 -
              BiologicalMaterial04 - BiologicalMaterial11, trainDF)
summary(linmod)
## 
## Call:
## lm(formula = Yield ~ . - ManufacturingProcess20 - ManufacturingProcess21 - 
##     ManufacturingProcess13 - ManufacturingProcess15 - ManufacturingProcess06 - 
##     ManufacturingProcess24 - ManufacturingProcess30 - ManufacturingProcess35 - 
##     ManufacturingProcess36 - ManufacturingProcess05 - ManufacturingProcess38 - 
##     ManufacturingProcess34 - ManufacturingProcess26 - ManufacturingProcess19 - 
##     ManufacturingProcess12 - ManufacturingProcess10 - ManufacturingProcess02 - 
##     ManufacturingProcess42 - ManufacturingProcess18 - ManufacturingProcess03 - 
##     ManufacturingProcess44 - ManufacturingProcess16 - ManufacturingProcess14 - 
##     ManufacturingProcess08 - ManufacturingProcess07 - ManufacturingProcess22 - 
##     ManufacturingProcess01 - ManufacturingProcess45 - ManufacturingProcess11 - 
##     ManufacturingProcess25 - ManufacturingProcess31 - ManufacturingProcess41 - 
##     ManufacturingProcess40 - ManufacturingProcess23 - ManufacturingProcess39 - 
##     BiologicalMaterial06 - BiologicalMaterial01 - BiologicalMaterial12 - 
##     BiologicalMaterial10 - BiologicalMaterial09 - BiologicalMaterial08 - 
##     BiologicalMaterial04 - BiologicalMaterial11, data = trainDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2162 -0.5475 -0.0145  0.5594  2.5787 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             36.0184     1.4994  24.022  < 2e-16 ***
## BiologicalMaterial02    -3.6142     1.3400  -2.697 0.007981 ** 
## BiologicalMaterial03     3.6472     1.2506   2.916 0.004216 ** 
## BiologicalMaterial05     1.8028     0.7429   2.427 0.016703 *  
## ManufacturingProcess04   2.5564     0.6248   4.092 7.72e-05 ***
## ManufacturingProcess09   2.4006     1.1894   2.018 0.045749 *  
## ManufacturingProcess17  -3.0712     1.0573  -2.905 0.004363 ** 
## ManufacturingProcess27  -4.9842     1.3366  -3.729 0.000293 ***
## ManufacturingProcess28  -0.5086     0.2677  -1.900 0.059794 .  
## ManufacturingProcess29   7.2043     1.3369   5.389 3.51e-07 ***
## ManufacturingProcess32   8.0879     1.0349   7.815 2.17e-12 ***
## ManufacturingProcess33  -4.2092     1.0576  -3.980 0.000118 ***
## ManufacturingProcess37  -1.0896     0.4513  -2.415 0.017241 *  
## ManufacturingProcess43   2.0628     1.0273   2.008 0.046843 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9595 on 122 degrees of freedom
## Multiple R-squared:  0.7688, Adjusted R-squared:  0.7442 
## F-statistic: 31.21 on 13 and 122 DF,  p-value: < 2.2e-16

The Adjusted \(R^2\) is still a bit lower than the non-adjusted, but they’re both better than the other models. Let’s see how things translate to the test set.

linmodpred = predict(linmod, testDF)
lmDF = data.frame(obs = testDF$Yield, pred = linmodpred)
defaultSummary(lmDF)
##      RMSE  Rsquared       MAE 
## 1.0949590 0.5955950 0.8548047

Even after pulling out all features that weren’t statistically significant towards predicting the response, the model proves to be overfit. That is, the adjusted \(R^2\) in training drops off the table in testing, from .741 to an \(R^2\) of .596 for the testDF. But we were already ready for that, after seeing how different the test and training splits were, in terms of variance. And again, these are small splits, so it’s going to be unlikely that a model fit to one split predicts the other one well. Think of dealing a bridge hand to 4 players, and modeling outcomes on three of the hands, and then expecting the predictions on the fourth hand to be as accurate as the training fit. If the deck were 520 cards instead of 52, you’d have better predictions.

At any rate, the best I could do was this last model, OLS without penalties, just hand-reduced features.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The most important manufacturing predictor was #32, which agrees with the correlation matrix above. #29 is also important, moreso than in the correlation matrix. It was correlated with a lot of the biological features that were positively correlated with Yield, and when those were dropped from the model, their positive coefficients seem to have been given to #29 in their place. #36 was expected to have a large negative correlation with the Yield, yet it doesn’t factor into the linear model. The likely reason, if you glance at the correlation matrix, is that #36 has a huge negative correlation with #32, such that only one of them is needed in the linear model.

The Biological features in general don’t seem as important, but when backing out features one by one, the Bio feats stayed in till the end, when they all came out in a large batch. So they are only less important to the model because it can give their importance to correlated Manufacturing features which remain in the model. I’d think it would be worth the experimenters’ time to try to understand why a handful of manufacturing processes were so highly correlated with biological features. Those processes end up being the most important ones in all models, but unless they get decoupled from the biological properties, they don’t really provide as much meaning to the question we’re trying to answer here.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

At first it seems like increasing #32 and #29 would be the best way to increase Yield. They have the highest positive coefficients. But if you look at the correlation matrix, they also are highly correlated with the Bio features, which are all positively correlated with Yield. In effect, those two processes were highest when the Bio levels were highest, so of course the Yield was higher in those runs. And the model assigns higher coefficients to 29 and 32 when it drops out all the Bio features. If you look back at the correlation matrix, you see that it’s actually Process #09 that has a high correlation with the Yield without being correlated much with the Bio feats. #13 is similarly negative where #09 is positive, again without much correlation with the Bio feats. And those were 2 of the most important 5 features in the ElasticNet model. But 13 and 09 are very highly negatively correlated, so that 13 ended up being insignificant in the model (it was important in the PLS model components earlier, where 09 happened to not be). The point of all that is that either increasing #09 or decreasing #13 may have a stronger effect on increasing Yield than raising #29 and #32 does.