library(tidyverse)
library(elasticnet)
library(caret)
library(lars)
library(pls)
library(stats)
library(corrplot)
library(MASS)
library(skimr)
library(DataExplorer)
library(cowplot)

Introduction

In this document, we will be going through exercises 6.2 and 6.3 from Applied Predictive Modeling - Kuhn and Johnson.

Exercises

6.2

Developing a model to predict permeability (see Sect. 1.4) could save sig- nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

6.2.a

Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(permeability)

library(AppliedPredictiveModeling)
data(permeability)
glimpse(fingerprints)
##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

The matrix fingerprints contains the 1,107 binary molecular predic- tors for the 165 compounds, while permeability contains permeability response.

6.2.b

The fingerprint predictors indicate the presence or absence of substruc- tures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

preProcess(fingerprints, method = c("nzv")) |>
  predict(fingerprints) -> fp2

glimpse(fp2)
##  num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...

After filtering out predictors with near zero variance we have gone from 1107 predictors to 388 predictors, a significant reduction in dimensionality.

6.2.c

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

First we combine the permeability vector with our predictors that have variability and then pre-process the data through box cox transforming, centering, scaling, and KNN imputing any missing values.

cbind(fp2, permeability) -> perm0

preProcess(perm0, method = c("BoxCox", "center", "scale", "knnImpute")) |>
  predict(perm0) -> perm

glimpse(perm)
##  num [1:165, 1:389] -0.545 -0.545 -0.545 -0.545 -0.545 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:389] "X1" "X2" "X3" "X4" ...

Next we split our perm matrix into a training and test set. With seventy-five percent of the data in the training set and twenty-five percent of the data in the test set.

part <- createDataPartition(perm[permeability], p = 0.75, list = FALSE)

perm_train <- perm[part,]
perm_test <- perm[-part,]

glimpse(perm_train)
##  num [1:102, 1:389] -0.545 -0.545 -0.545 -0.545 -0.545 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:102] "1" "2" "3" "4" ...
##   ..$ : chr [1:389] "X1" "X2" "X3" "X4" ...

Finally, we fit a partial least squares model to our pre-processed data with leave one out cross-validation (as our sample size is particularly small to use other types of cross-validation) and get a RMSE of 0.736 at our optimal latent variable count of 10, the R^2 estimate at this point is 0.427.

ctrl <- trainControl(method = "LOOCV")

plsMod <- train(permeability ~ .,
  data = perm_train,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl)

plsMod
## Partial Least Squares 
## 
## 102 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.9010919  0.1217495  0.6939066
##    2     0.8590802  0.2075768  0.6247377
##    3     0.8526899  0.2358577  0.6260563
##    4     0.8725298  0.2266161  0.6694598
##    5     0.8930519  0.2228040  0.6851284
##    6     0.8979930  0.2278254  0.6678485
##    7     0.8934648  0.2386564  0.6370833
##    8     0.8841138  0.2629411  0.6277170
##    9     0.8811162  0.2830333  0.6241630
##   10     0.8688959  0.3031262  0.6167156
##   11     0.8606853  0.3180188  0.6141173
##   12     0.8603868  0.3275751  0.6069377
##   13     0.8654464  0.3331123  0.6028758
##   14     0.8707803  0.3421078  0.6073016
##   15     0.8970733  0.3294813  0.6199620
##   16     0.9049859  0.3302663  0.6294052
##   17     0.9172635  0.3272540  0.6346216
##   18     0.9388844  0.3185650  0.6444598
##   19     0.9453448  0.3182420  0.6571713
##   20     0.9561492  0.3107406  0.6703285
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

6.2.d

Predict the response for the test set. What is the test set estimate of R2?

Our test set estimate R^2 improves at 0.45, which indicates that we are not overfitting our data. However, our RMSE does increase.

predict(plsMod, perm_test) %>% 
  data.frame(pred = .,obs = perm_test[,"permeability"]) %>% 
  defaultSummary()
##      RMSE  Rsquared       MAE 
## 0.7208822 0.5640783 0.5972188

6.2.e

Try building other models discussed in this chapter. Do any have better predictive performance?

We know that ordinary least squares linear regression models typically do not work with data that has more predictors than samples due to not properly being able to calculate the matrix math. However, we’ll attempt it anyways just to see what happens. If we do not turn warnings off, we would see a multitude of them being thrown because the rank of the matrix is not properly conforming to the formula.

lmMod <- train(permeability ~ .,
  data = perm_train,
  method = "lm",
  trControl = ctrl)

lmMod
## Linear Regression 
## 
## 102 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   2.511697  0.05111571  1.521113
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Interestingly, our RMSE and R^2 improve from the training set to the test set of data, but it is still worse than our PLS model.

predict(lmMod, perm_test) %>% 
  data.frame(pred = .,obs = perm_test[,"permeability"]) %>% 
  defaultSummary()
##       RMSE   Rsquared        MAE 
## 3.17606062 0.09142992 2.08020438

Robust linear regression is more robust to these sorts of situations than ols regression, however we still get warnings on convergence not occurring. This is likely because even with PCR pre-processing, we will have an overwhelming amount of predictors compared to samples.

rlmMod <- train(permeability ~ .,
  data = perm_train,
  method = "rlm",
  trControl = ctrl,
  preProcess = "pca")

rlmMod
## Robust Linear Model 
## 
## 102 samples
## 388 predictors
## 
## Pre-processing: principal component signal extraction (388), centered
##  (388), scaled (388) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE       Rsquared   MAE      
##   FALSE      psi.huber     0.8559909  0.2410611  0.6418483
##   FALSE      psi.hampel    0.8473473  0.2532282  0.6351584
##   FALSE      psi.bisquare  0.8482686  0.2540469  0.6304235
##    TRUE      psi.huber     0.8622666  0.2330546  0.6304642
##    TRUE      psi.hampel    0.8506020  0.2527969  0.6255035
##    TRUE      psi.bisquare  0.8716683  0.2340878  0.6357662
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = FALSE and psi = psi.hampel.

Once more we have an RMSE and R^2 worse than our PLS model.

predict(rlmMod, perm_test) %>% 
  data.frame(pred = .,obs = perm_test[,"permeability"]) %>% 
  defaultSummary()
##      RMSE  Rsquared       MAE 
## 0.8353002 0.4028382 0.6644438

Next we move on to ridge regression which imposes a penalty on the coefficients based on the different lambdas that we will be training over. Unfortunately, we have to switch to bootstrap sampling for our sampling method as cross-validation leaves too few samples and loocv takes too long with ridge regression computation. Interestingly enough, due to the fact that we are bootstrap sampling we get warnings where our columns end up not having variance as it is possible to pick only samples where some columns have little to no variance.

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(123)
ctrl2 <- trainControl(method = "boot", number = 10)

ridgeMod <- train(permeability ~ .,
  data = perm_train,
  method = "ridge",
  tuneGrid = ridgeGrid,
  trControl = ctrl2)

ridgeMod
## Ridge Regression 
## 
## 102 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (10 reps) 
## Summary of sample sizes: 102, 102, 102, 102, 102, 102, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE          Rsquared   MAE         
##   0.000000000  2.338772e+15  0.2129040  7.867978e+14
##   0.007142857  2.613045e+01  0.1745175  1.763005e+01
##   0.014285714  2.392851e+01  0.2655391  1.692635e+01
##   0.021428571  1.015868e+00  0.2962509  7.690285e-01
##   0.028571429  9.809925e-01  0.3044725  7.409401e-01
##   0.035714286  9.985125e-01  0.3038649  7.396621e-01
##   0.042857143  9.495400e-01  0.3160033  7.146635e-01
##   0.050000000  9.464283e-01  0.3185995  7.118317e-01
##   0.057142857  9.478012e-01  0.3190852  7.074453e-01
##   0.064285714  9.497998e-01  0.3173699  7.021505e-01
##   0.071428571  9.850421e-01  0.3046736  7.203057e-01
##   0.078571429  1.374722e+01  0.2464708  6.288617e+00
##   0.085714286  9.691437e-01  0.3100106  7.213918e-01
##   0.092857143  9.681900e-01  0.3102202  7.138643e-01
##   0.100000000  9.680734e-01  0.3089661  7.033478e-01
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.05.

Once more we have an RMSE and R^2 worse than our PLS model.

predict(ridgeMod, perm_test) %>% 
  data.frame(pred = .,obs = perm_test[,"permeability"]) %>% 
  defaultSummary()
##      RMSE  Rsquared       MAE 
## 0.8759110 0.5407593 0.6901475

The final model we will move on to is a elasticnet regression model which also imposes a penalty on the coefficients based on the different lambdas that we will be training over, but adds a fraction parameter from lasso regression. Lasso regression can remove variables entirely rather than minimizing them in the case of ridge regression.

Against the training set we finally get a better RMSE of 0.652 and R^2 of 0.48 than PLS. However, it is the test set evaluation that actually matters.

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

enetMod <- train(permeability ~ .,
  data = perm_train,
  method = "enet",
  tuneGrid = enetGrid,
  trControl = ctrl2)

enetMod
## Elasticnet 
## 
## 102 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (10 reps) 
## Summary of sample sizes: 102, 102, 102, 102, 102, 102, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE          Rsquared    MAE         
##   0.00    0.05      1.060145e+14  0.18845346  6.597065e+13
##   0.00    0.10      2.129071e+14  0.19625912  1.325340e+14
##   0.00    0.15      3.198032e+14  0.18995562  1.990973e+14
##   0.00    0.20      4.267002e+14  0.18040069  2.656606e+14
##   0.00    0.25      5.335975e+14  0.17997353  3.322239e+14
##   0.00    0.30      6.404951e+14  0.18685581  3.987872e+14
##   0.00    0.35      7.473927e+14  0.19085562  4.653505e+14
##   0.00    0.40      8.542904e+14  0.19369172  5.319138e+14
##   0.00    0.45      9.611881e+14  0.19125990  5.984771e+14
##   0.00    0.50      1.068086e+15  0.18173668  6.650404e+14
##   0.00    0.55      1.174984e+15  0.16727275  7.316037e+14
##   0.00    0.60      1.281881e+15  0.15989829  7.981671e+14
##   0.00    0.65      1.388779e+15  0.14826789  8.647304e+14
##   0.00    0.70      1.495677e+15  0.13620642  9.312937e+14
##   0.00    0.75      1.602575e+15  0.12314743  9.978570e+14
##   0.00    0.80      1.709473e+15  0.11369709  1.064420e+15
##   0.00    0.85      1.816371e+15  0.10692291  1.130984e+15
##   0.00    0.90      1.923268e+15  0.10240820  1.197547e+15
##   0.00    0.95      2.030166e+15  0.09620196  1.264110e+15
##   0.00    1.00      2.137064e+15  0.09137284  1.330674e+15
##   0.01    0.05      9.175360e-01  0.24222323  6.935241e-01
##   0.01    0.10      9.591997e-01  0.23641083  6.953621e-01
##   0.01    0.15      1.034065e+00  0.22512951  7.319238e-01
##   0.01    0.20      1.110411e+00  0.21650154  7.799446e-01
##   0.01    0.25      1.178201e+00  0.21122211  8.281039e-01
##   0.01    0.30      1.232041e+00  0.21262792  8.701561e-01
##   0.01    0.35      1.266409e+00  0.21749900  8.995095e-01
##   0.01    0.40      1.298237e+00  0.21991166  9.260274e-01
##   0.01    0.45      1.337567e+00  0.21581253  9.603022e-01
##   0.01    0.50      1.381381e+00  0.21137879  9.938310e-01
##   0.01    0.55      1.422838e+00  0.20854893  1.027205e+00
##   0.01    0.60      1.456440e+00  0.20841888  1.054094e+00
##   0.01    0.65      1.486400e+00  0.20969902  1.077776e+00
##   0.01    0.70      1.520476e+00  0.20754871  1.102941e+00
##   0.01    0.75      1.553639e+00  0.20610418  1.126797e+00
##   0.01    0.80      1.585162e+00  0.20707134  1.147158e+00
##   0.01    0.85      1.621052e+00  0.20668720  1.169150e+00
##   0.01    0.90      1.655344e+00  0.20558711  1.191131e+00
##   0.01    0.95      1.689932e+00  0.20357306  1.214191e+00
##   0.01    1.00      1.850922e+00  0.19584410  1.307557e+00
##   0.10    0.05      9.390592e-01  0.25394962  7.135089e-01
##   0.10    0.10      9.285588e-01  0.25354829  6.890072e-01
##   0.10    0.15      9.717944e-01  0.24480147  7.128103e-01
##   0.10    0.20      1.028864e+00  0.23674815  7.481613e-01
##   0.10    0.25      1.090746e+00  0.23110519  7.875675e-01
##   0.10    0.30      1.156757e+00  0.22485969  8.320143e-01
##   0.10    0.35      1.213240e+00  0.22004034  8.684658e-01
##   0.10    0.40      1.265975e+00  0.21877074  9.036774e-01
##   0.10    0.45      1.320024e+00  0.21649521  9.384309e-01
##   0.10    0.50      1.366260e+00  0.21509150  9.673767e-01
##   0.10    0.55      1.407809e+00  0.21608438  9.943209e-01
##   0.10    0.60      1.449386e+00  0.21635825  1.021547e+00
##   0.10    0.65      1.489571e+00  0.21694761  1.047668e+00
##   0.10    0.70      1.530593e+00  0.21666726  1.074705e+00
##   0.10    0.75      1.569451e+00  0.21703201  1.099568e+00
##   0.10    0.80      1.605907e+00  0.21875068  1.122804e+00
##   0.10    0.85      1.643328e+00  0.21977008  1.147523e+00
##   0.10    0.90      1.679449e+00  0.22166528  1.171132e+00
##   0.10    0.95      1.715297e+00  0.22371092  1.193962e+00
##   0.10    1.00      1.820179e+00  0.22655763  1.245938e+00
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.01.

Our best model ends up being this elasticnet model with the lowest RMSE and highest R^2.

predict(enetMod, perm_test) %>% 
  data.frame(pred = .,obs = perm_test[,"permeability"]) %>% 
  defaultSummary()
##      RMSE  Rsquared       MAE 
## 0.6459984 0.6598375 0.5390329

6.2.f

Would you recommend any of your models to replace the permeability laboratory experiment?

I would not recommend any of the models trained here to replace laboratory experiments for permeability. This is because even the best model only had an R^2 of 0.55, which means there’s 45% of variability still unaccounted for within the huge amounts of data for each sample. When dealing with medical use cases, like creating drugs in this case, we need to be as close to sure as possible of our conclusions. Our R^2 does not bring much confidence for predictions of the permeability. Even if we had an R^2 of 0.99 it would not be enough, predictive models can never replace experiments as they can not establish causality that experiments can.

What could work is using a model to determine those that are likely to have high permeability and then performing the laboratory test on only promising subjects. Thus, saving time and money while respecting the limitations of predictive models. Especially weaker ones like we have.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

6.3.a

Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturingProcess)

data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176  58

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

6.3.b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We utilize nearest neighbors imputation within the Caret preprocessing function to impute our missing values with values that come from the five nearest neighbors of the row with missing values.

preProcess(ChemicalManufacturingProcess, method = c("knnImpute")) |>
  predict(ChemicalManufacturingProcess) -> cmp0

colSums(is.na(cmp0))
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      0                      0 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                      0                      0                      0 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      0                      0                      0 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      0                      0 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      0                      0                      0 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      0                      0 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      0                      0                      0 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      0                      0                      0 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      0                      0                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      0                      0                      0 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      0                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      0                      0 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0

6.3.c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

We pre-process the data through box cox transforming, centering, and scaling through the care pre-process function.

preProcess(cmp0, method = c("BoxCox", "center", "scale")) |>
  predict(cmp0) -> cmp

head(cmp)
##        Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 -1.1792673           -0.2261036           -1.5140979          -2.68303622
## 2  1.2263678            2.2391498            1.3089960          -0.05623504
## 3  1.0042258            2.2391498            1.3089960          -0.05623504
## 4  0.6737219            2.2391498            1.3089960          -0.05623504
## 5  1.2534583            1.4827653            1.8939391           1.13594780
## 6  1.8386128           -0.4081962            0.6620886          -0.59859075
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1            0.2201765            0.4941942           -1.3828880
## 2            1.2964386            0.4128555            1.1290767
## 3            1.2964386            0.4128555            1.1290767
## 4            1.2964386            0.4128555            1.1290767
## 5            0.9414412           -0.3734185            1.5348350
## 6            1.5894524            1.7305423            0.6192092
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1           -0.1313107            -1.233131           -3.3962895
## 2           -0.1313107             2.282619           -0.7227225
## 3           -0.1313107             2.282619           -0.7227225
## 4           -0.1313107             2.282619           -0.7227225
## 5           -0.1313107             1.071310           -0.1205678
## 6           -0.1313107             1.189487           -1.7343424
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1            1.1005296            -1.838655           -1.7709224
## 2            1.1005296             1.393395            1.0989855
## 3            1.1005296             1.393395            1.0989855
## 4            1.1005296             1.393395            1.0989855
## 5            0.4162193             0.136256            1.0989855
## 6            1.6346255             1.022062            0.7240877
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1              0.2147727              0.5601078             0.34571766
## 2             -6.1677588             -1.9906382             0.16161385
## 3             -6.1677588             -1.9906382             0.06956195
## 4             -6.1677588             -1.9906382             0.43776956
## 5             -0.2804237             -1.9906382             0.06956195
## 6              0.4348600             -1.9906382             0.52982146
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1              0.5634453             -0.4444208             -0.5369781
## 2             -2.3748177              1.0041713              0.9728317
## 3             -3.1732587              0.0651474             -0.1056039
## 4             -3.3329470              0.4263104              2.2000169
## 5             -2.2151295              0.8498562             -0.6262279
## 6             -1.2570002              0.4985430              0.5637699
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1             -0.1592067             -0.3085582             -1.7201524
## 2             -0.9597892              0.8982473              0.5883746
## 3              1.0416669              0.8982473             -0.3815947
## 4             -0.9597892             -1.1130952             -0.4785917
## 5              1.0416669              0.8982473             -0.4527258
## 6              1.0416669              0.8982473             -0.2199332
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1            -0.09987539            -0.12370689             -0.4790178
## 2             0.50808790             1.05469693             -0.4790178
## 3             0.29662241             0.52160949             -0.4790178
## 4            -0.04700902             0.77412459             -0.4790178
## 5            -0.41707364             0.07269374             -0.4790178
## 6             0.27018922             1.39138374             -0.4790178
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1             0.97711512              0.8146112              1.1846438
## 2            -0.50030980              0.2819864              0.9617071
## 3             0.28765016              0.4472838              0.8245152
## 4             0.28765016              0.7962448              1.0817499
## 5             0.09066017              2.5410501              3.3282665
## 6            -0.50030980              2.4124855              3.1396277
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1              0.3303945              0.9263296              0.1505348
## 2              0.1455765             -0.2753953              0.1559773
## 3              0.1455765              0.3655246              0.1831898
## 4              0.1967569              0.3655246              0.1695836
## 5              0.4754056             -0.3555103              0.2076811
## 6              0.6261033             -0.7560852              0.1423710
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1              0.4563798              0.3109942              0.2109804
## 2              1.5095063              0.1849230              0.2109804
## 3              1.0926437              0.1849230              0.2109804
## 4              0.9829430              0.1562704              0.2109804
## 5              1.6192070              0.2938027             -0.6884239
## 6              1.9044287              0.3998171             -0.5599376
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1             0.05816752              0.8277813              0.8862028
## 2            -0.72469885             -1.8211188             -1.0116829
## 3            -0.42359640             -1.2190961             -0.8391478
## 4            -0.12249395             -0.6170733             -0.6666127
## 5             0.78081340              0.5869722              1.5763431
## 6             1.08191585             -1.2190961             -1.3567530
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1              0.1216629              0.1270590              0.3499400
## 2              0.1107966              0.1990831              0.1923020
## 3              0.1868610              0.2187260              0.2123650
## 4              0.1732780              0.2078133              0.1923020
## 5              0.2765083              0.2951152              0.3470739
## 6              0.1162297              0.2449166              0.3556723
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1              0.8102213              0.6061835              0.7545135
## 2              0.9053457              0.8500424              0.7545135
## 3              0.8863208              0.7890777              0.2361095
## 4              0.8863208              0.7890777              0.2361095
## 5              0.9243705              0.9719719             -0.1786138
## 6              0.9433954              1.0329366              0.9618751
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1             -0.1962205             -0.4568829               1.008364
## 2             -0.2692173              1.9517531               1.008364
## 3             -0.1597221              2.6928719               1.008364
## 4             -0.1597221              2.3223125               1.824702
## 5             -0.1414729              2.3223125               2.641041
## 6             -0.3604634              2.6928719               2.641041
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1             -1.7288406            -0.86659833             -0.6556938
## 2              1.9863724             1.17444104             -0.6556938
## 3              1.9863724             1.26721556             -1.8143496
## 4              0.1287659             0.06114684             -1.8143496
## 5              0.1287659            -2.53653963             -2.9730053
## 6              0.1287659            -0.49550026             -1.8143496
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1             -1.1540243              0.7174727              0.2317270
## 2              2.2161351             -0.8224687              0.2317270
## 3             -0.7046697             -0.8224687              0.2317270
## 4              0.4187168             -0.8224687              0.2317270
## 5             -1.8280562             -0.8224687              0.2981503
## 6             -1.3787016             -0.8224687              0.2317270
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1             0.05952767            -0.06881159             0.20279570
## 2             2.15490152             2.35335647            -0.05472265
## 3            -0.46431580            -0.44145283             0.40881037
## 4            -0.46431580            -0.44145283            -0.31224099
## 5            -0.46431580            -0.44145283            -0.10622632
## 6            -0.46431580            -0.44145283             0.15129203
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1             2.40564734            -0.01588055             0.64371849
## 2            -0.01374656             0.29467248             0.15220242
## 3             0.10146268            -0.01588055             0.39796046
## 4             0.21667191            -0.01588055            -0.09355562
## 5             0.21667191            -0.32643359            -0.09355562
## 6             1.48397347            -0.01588055            -0.33931365

Next we split our cmp dataset into a training and test set. With seventy-five percent of the data in the training set and twenty-five percent of the data in the test set.

part <- createDataPartition(cmp$Yield, p = 0.75, list = FALSE)

cmp_train <- cmp[part,]
cmp_test <- cmp[-part,]

dim(cmp_train)
## [1] 132  58

Finally, we fit a partial least squares model to our pre-processed data with leave one out cross-validation (as our sample size is particularly small to use other types of cross-validation) and get a RMSE of 0.66 at our optimal latent variable count of 3, the R^2 estimate at this point is 0.58.

ctrl <- trainControl(method = "LOOCV")

plsMod <- train(Yield ~ .,
  data = cmp_train,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl)

plsMod
## Partial Least Squares 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 131, 131, 131, 131, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7400162  0.4228175  0.5913992
##    2     0.6756481  0.5200159  0.5330866
##    3     0.6487015  0.5589877  0.5260593
##    4     0.6435842  0.5681294  0.5211681
##    5     0.6429131  0.5711725  0.5227984
##    6     0.6517865  0.5646730  0.5253235
##    7     0.6716447  0.5468849  0.5373863
##    8     0.6880985  0.5324103  0.5514444
##    9     0.7191428  0.5077167  0.5619498
##   10     0.7568289  0.4805845  0.5853519
##   11     0.7906695  0.4534789  0.5962430
##   12     0.8389162  0.4204543  0.6057354
##   13     0.8902487  0.3923241  0.6087229
##   14     0.9448500  0.3637252  0.6111238
##   15     0.9848760  0.3448316  0.6104914
##   16     1.0363313  0.3205594  0.6166124
##   17     1.0840917  0.3008371  0.6241746
##   18     1.1469213  0.2747819  0.6356008
##   19     1.2653162  0.2356819  0.6560169
##   20     1.3202770  0.2186103  0.6667299
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.

6.3.d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Our test set estimate R^2 slightly deteriorates at 0.56. However, our RMSE does decrease to 0.634 which is good.

predict(plsMod, cmp_test) %>% 
  data.frame(pred = .,obs = cmp_test$Yield) %>% 
  defaultSummary()
##      RMSE  Rsquared       MAE 
## 0.6914523 0.6128964 0.5620114

6.3.e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

We get our top 6 most important predictors as the manufacturing process of 32, 09, 17, 13, 36, and 06. However, four of these predictors are of 90% importance and above indicating that they should be focused on modifying these manufacturing processes to improve yield.

plot(varImp(plsMod), top = 25)

6.3.f

Explore the relationships between each of the top predictors and the re- sponse. How could this information be helpful in improving yield in future runs of the manufacturing process?

We see that as manufacturing process 32 and 09 increase so does yield. Meaning we should focus on increasing these manufacturing processes to increase yield as we know they are highly important to increases in yield versus having moreso incidental correlation. For manufacturing process 17 and 13 we see a decrease of yield with an increase in these manufacturing processes. Thus, we should be reigning said processes back in an attempt to increase yield.

p1 <- ggplot(data = ChemicalManufacturingProcess, aes(y = Yield, x = ManufacturingProcess32)) +
  geom_point()
p2 <- ggplot(data = ChemicalManufacturingProcess, aes(y = Yield, x = ManufacturingProcess09)) +
  geom_point()
p3 <- ggplot(data = ChemicalManufacturingProcess, aes(y = Yield, x = ManufacturingProcess17)) +
  geom_point()
p4 <- ggplot(data = ChemicalManufacturingProcess, aes(y = Yield, x = ManufacturingProcess13)) +
  geom_point()

cowplot::plot_grid(
  p1,p2,p3,p4
)