DATA 624: Assignment 07

Author

Curtis Elsasser

Published

April 6, 2025

Setup

library(AppliedPredictiveModeling)
library(caret)
library(RANN)
library(tidyverse)

set.seed(314159)

Assignment

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

Problem 6.2

Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

In this project there were 165 unique compounds; 1,107 molecular fingerprints were determined for each. A molecular fingerprint is a binary sequence of numbers that represents the presence or absence of a specific molecular substructure.

permeability: permeability values for each compound.
fingerprints: a matrix of binary fingerprint indicator variables.

6.2.a

Start R and use these commands to load the data. The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data(permeability)
dim(permeability)

[1] 165   1

dim(fingerprints)

[1]  165 1107

summary(apply(fingerprints, 2, mean))

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.006061 0.024242 0.154767 0.181818 1.000000

as_data_frame(permeability) |>
  ggplot(mapping = aes(x = permeability)) +
  geom_histogram(bins = 50) +
  labs(title = "Permeability Histogram")

Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
  tibble, or `as.data.frame()` to convert to a data frame.

6.2.b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

# get columns (predictors) that have a low frequency.
nz <- nearZeroVar(fingerprints, saveMetrics = FALSE)

It looks like we got back an index array? The documentation says the following:

if saveMetrics = FALSE, a vector of integers corresponding to the column positions of the problematic predictors. If saveMetrics = TRUE, a data frame with columns.

So, we have to remove the columns that are in the nz array or we want to keep the complement of the nz array. We can use the - operator to remove the columns.

fnz <- fingerprints[, -nz]
paste("Number of predictors before filtering:", ncol(fingerprints))

[1] "Number of predictors before filtering: 1107"

paste("Number of predictors after filtering:", ncol(fnz))

[1] "Number of predictors after filtering: 388"

6.2.c

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

Let’s split the data into a training and test set. We will use 70% of the data for training and 30% for testing.

x <- fnz
y <- permeability
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
train_x <- x[train_index, ]
train_y <- y[train_index]
test_x <- x[-train_index, ]
test_y <- y[-train_index]

Now we can pre-process the data. We will use the preProcess function from the caret package to center and scale the data.

pp = preProcess(x = train_x, method = c("center", "scale"))
train_x = predict(pp, train_x)
test_x = predict(pp, test_x)

Now we shall fit a model using PLS

pls_model <- train(
  x = train_x,
  y = train_y,
  method = "pls",
  tuneLength = 10
)
summary(pls_model)

Data:   X dimension: 117 388 
    Y dimension: 117 1
Fit method: oscorespls
Number of components considered: 5
TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps
X           23.25    34.70    40.27    45.89    52.83
.outcome    33.37    50.98    59.39    65.97    71.21

print(pls_model)

Partial Least Squares 

117 samples
388 predictors

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 117, 117, 117, 117, 117, 117, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE      
   1     13.76865  0.2991406  10.857122
   2     12.67955  0.4068872   9.057553
   3     12.75722  0.4127669   9.296703
   4     12.87457  0.4112108   9.522711
   5     12.63621  0.4337232   9.283996
   6     12.67895  0.4394412   9.214388
   7     12.74123  0.4420929   9.276826
   8     12.67530  0.4483525   9.285716
   9     12.71561  0.4487545   9.363156
  10     13.00803  0.4321720   9.549234

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 5.

The optimal number of latent variables appears to change when I run it repeatedly. I’ll do this programmatically. The answer is as follows:

optimal_ncomp <- pls_model$bestTune$ncomp
optimal_r2 <- max(pls_model$resample$Rsquared)
paste("Optimal number of latent variables:", optimal_ncomp)

[1] "Optimal number of latent variables: 5"

paste("Resampled estimate of R2:", optimal_r2)

[1] "Resampled estimate of R2: 0.606339030925167"

6.2.d

Predict the response for the test set. What is the test set estimate of R2?

pls_preds <- predict(pls_model, newdata = test_x)
pls_metrics <- postResample(pls_preds, test_y)
print("PLS metrics:")

[1] "PLS metrics:"

print(pls_metrics)

     RMSE  Rsquared       MAE 
11.711837  0.432058  8.795413

6.2.e

Try building other models discussed in this chapter. Do any have better predictive performance?

We shall build an OLS model, a ridge model, lasso model and elastic-net model and compare their performance.

# TODO: there are a number of warnings. Why?

# OLS model
ols_model <- train(
  x = train_x,
  y = train_y,
  method = "lm",
  tuneLength = 10
)

# Ridge Model: this model takes a loooong time to run on my machine. 
# I've trimmed the number of tuning parameters to 3. Same for the fellas below.
ridge_model <- train(
  x = train_x,
  y = train_y,
  method = "ridge",
  tuneLength = 3
)

# Lasso model
lasso_model <- train(
  x = train_x,
  y = train_y,
  method = "lasso",
  tuneLength = 5
)

# Elastic-net model
en_model <- train(
  x = train_x,
  y = train_y,
  method = "enet",
  tuneLength = 3
)

Now let’s see how they perform. We will use the resamples function from the caret package to compare the models. But first we need to make some predictions on the test set.

ols_preds <- predict(ols_model, newdata = test_x)

Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases

ridge_preds <- predict(ridge_model, newdata = test_x)
lasso_preds <- predict(lasso_model, newdata = test_x)
en_preds <- predict(en_model, newdata = test_x)

Now we shall see how they compare.

ols_metrics <- postResample(ols_preds, test_y)
ridge_metrics <- postResample(ridge_preds, test_y)
lasso_metrics <- postResample(lasso_preds, test_y)
en_metrics <- postResample(en_preds, test_y)

print("PLS metrics:")

[1] "PLS metrics:"

print(pls_metrics)

     RMSE  Rsquared       MAE 
11.711837  0.432058  8.795413

print("OLS metrics:")

[1] "OLS metrics:"

print(ols_metrics)

       RMSE    Rsquared         MAE 
28.58135167  0.08927716 18.94034521

print("Ridge metrics:")

[1] "Ridge metrics:"

print(ridge_metrics)

      RMSE   Rsquared        MAE 
11.8937084  0.4884249  8.6162418

print("Lasso metrics:")

[1] "Lasso metrics:"

print(lasso_metrics)

      RMSE   Rsquared        MAE 
12.7176597  0.3268589  8.3039603

print("Elastic Net metrics:")

[1] "Elastic Net metrics:"

print(en_metrics)

      RMSE   Rsquared        MAE 
12.5201378  0.3022864  9.1476482

6.2.f

Would you recommend any of your models to replace the permeability laboratory experiment?

Yes, going by metrics alone I would. The PLS model has an R2 of ~0.432 where the Ridge model has an R2 of ~0.488. All of the other models’ R2 is lower. I am noticing that not all of the metrics agree with one another. For example, Ridge has an RMSE of ~11.89 when PLS’s is ~11.71. But, the ridge model takes a long time to build. I am going to say that for the meager gain we get in R2, I would not recommend the ridge model. The PLS model is a good compromise between performance and time to build.

Problem 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

From its vignette, the ChemicalManufacturingProcess data set contains the following variables:

The data set consisted of 177 samples of biological material for which 57 characteristics were measured. Of the 57 characteristics, there were 12 measurements of the biological starting material, and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by–products at various steps. Some of the process measurements can be controlled, while others are observed. Predictors are continuous, count, categorical; some are correlated, and some contain missing values. Samples are not independent because sets of samples come from the same batch of biological starting material.

6.3.a

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)
class(ChemicalManufacturingProcess)

[1] "data.frame"

glimpse(ChemicalManufacturingProcess)

Rows: 176
Columns: 58
$ Yield                  <dbl> 38.00, 42.44, 42.03, 41.42, 42.49, 43.57, 43.12…
$ BiologicalMaterial01   <dbl> 6.25, 8.01, 8.01, 8.01, 7.47, 6.12, 7.48, 6.94,…
$ BiologicalMaterial02   <dbl> 49.58, 60.97, 60.97, 60.97, 63.33, 58.36, 64.47…
$ BiologicalMaterial03   <dbl> 56.97, 67.48, 67.48, 67.48, 72.25, 65.31, 72.41…
$ BiologicalMaterial04   <dbl> 12.74, 14.65, 14.65, 14.65, 14.02, 15.17, 13.82…
$ BiologicalMaterial05   <dbl> 19.51, 19.36, 19.36, 19.36, 17.91, 21.79, 17.71…
$ BiologicalMaterial06   <dbl> 43.73, 53.14, 53.14, 53.14, 54.66, 51.23, 54.45…
$ BiologicalMaterial07   <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
$ BiologicalMaterial08   <dbl> 16.66, 19.04, 19.04, 19.04, 18.22, 18.30, 18.72…
$ BiologicalMaterial09   <dbl> 11.44, 12.55, 12.55, 12.55, 12.80, 12.13, 12.95…
$ BiologicalMaterial10   <dbl> 3.46, 3.46, 3.46, 3.46, 3.05, 3.78, 3.04, 3.85,…
$ BiologicalMaterial11   <dbl> 138.09, 153.67, 153.67, 153.67, 147.61, 151.88,…
$ BiologicalMaterial12   <dbl> 18.83, 21.05, 21.05, 21.05, 21.05, 20.76, 20.75…
$ ManufacturingProcess01 <dbl> NA, 0.0, 0.0, 0.0, 10.7, 12.0, 11.5, 12.0, 12.0…
$ ManufacturingProcess02 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ManufacturingProcess03 <dbl> NA, NA, NA, NA, NA, NA, 1.56, 1.55, 1.56, 1.55,…
$ ManufacturingProcess04 <dbl> NA, 917, 912, 911, 918, 924, 933, 929, 928, 938…
$ ManufacturingProcess05 <dbl> NA, 1032.2, 1003.6, 1014.6, 1027.5, 1016.8, 988…
$ ManufacturingProcess06 <dbl> NA, 210.0, 207.1, 213.3, 205.7, 208.9, 210.0, 2…
$ ManufacturingProcess07 <dbl> NA, 177, 178, 177, 178, 178, 177, 178, 177, 177…
$ ManufacturingProcess08 <dbl> NA, 178, 178, 177, 178, 178, 178, 178, 177, 177…
$ ManufacturingProcess09 <dbl> 43.00, 46.57, 45.07, 44.92, 44.96, 45.32, 49.36…
$ ManufacturingProcess10 <dbl> NA, NA, NA, NA, NA, NA, 11.6, 10.2, 9.7, 10.1, …
$ ManufacturingProcess11 <dbl> NA, NA, NA, NA, NA, NA, 11.5, 11.3, 11.1, 10.2,…
$ ManufacturingProcess12 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ManufacturingProcess13 <dbl> 35.5, 34.0, 34.8, 34.8, 34.6, 34.0, 32.4, 33.6,…
$ ManufacturingProcess14 <dbl> 4898, 4869, 4878, 4897, 4992, 4985, 4745, 4854,…
$ ManufacturingProcess15 <dbl> 6108, 6095, 6087, 6102, 6233, 6222, 5999, 6105,…
$ ManufacturingProcess16 <dbl> 4682, 4617, 4617, 4635, 4733, 4786, 4486, 4626,…
$ ManufacturingProcess17 <dbl> 35.5, 34.0, 34.8, 34.8, 33.9, 33.4, 33.8, 33.6,…
$ ManufacturingProcess18 <dbl> 4865, 4867, 4877, 4872, 4886, 4862, 4758, 4766,…
$ ManufacturingProcess19 <dbl> 6049, 6097, 6078, 6073, 6102, 6115, 6013, 6022,…
$ ManufacturingProcess20 <dbl> 4665, 4621, 4621, 4611, 4659, 4696, 4522, 4552,…
$ ManufacturingProcess21 <dbl> 0.0, 0.0, 0.0, 0.0, -0.7, -0.6, 1.4, 0.0, 0.0, …
$ ManufacturingProcess22 <dbl> NA, 3, 4, 5, 8, 9, 1, 2, 3, 4, 6, 7, 8, 10, 11,…
$ ManufacturingProcess23 <dbl> NA, 0, 1, 2, 4, 1, 1, 2, 3, 1, 3, 4, 1, 2, 3, 4…
$ ManufacturingProcess24 <dbl> NA, 3, 4, 5, 18, 1, 1, 2, 3, 4, 6, 7, 8, 2, 15,…
$ ManufacturingProcess25 <dbl> 4873, 4869, 4897, 4892, 4930, 4871, 4795, 4806,…
$ ManufacturingProcess26 <dbl> 6074, 6107, 6116, 6111, 6151, 6128, 6057, 6059,…
$ ManufacturingProcess27 <dbl> 4685, 4630, 4637, 4630, 4684, 4687, 4572, 4586,…
$ ManufacturingProcess28 <dbl> 10.7, 11.2, 11.1, 11.1, 11.3, 11.4, 11.2, 11.1,…
$ ManufacturingProcess29 <dbl> 21.0, 21.4, 21.3, 21.3, 21.6, 21.7, 21.2, 21.2,…
$ ManufacturingProcess30 <dbl> 9.9, 9.9, 9.4, 9.4, 9.0, 10.1, 11.2, 10.9, 10.5…
$ ManufacturingProcess31 <dbl> 69.1, 68.7, 69.3, 69.3, 69.4, 68.2, 67.6, 67.9,…
$ ManufacturingProcess32 <dbl> 156, 169, 173, 171, 171, 173, 159, 161, 160, 16…
$ ManufacturingProcess33 <dbl> 66, 66, 66, 68, 70, 70, 65, 65, 65, 66, 67, 67,…
$ ManufacturingProcess34 <dbl> 2.4, 2.6, 2.6, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.…
$ ManufacturingProcess35 <dbl> 486, 508, 509, 496, 468, 490, 475, 478, 491, 48…
$ ManufacturingProcess36 <dbl> 0.019, 0.019, 0.018, 0.018, 0.017, 0.018, 0.019…
$ ManufacturingProcess37 <dbl> 0.5, 2.0, 0.7, 1.2, 0.2, 0.4, 0.8, 1.0, 1.2, 1.…
$ ManufacturingProcess38 <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 3,…
$ ManufacturingProcess39 <dbl> 7.2, 7.2, 7.2, 7.2, 7.3, 7.2, 7.3, 7.3, 7.4, 7.…
$ ManufacturingProcess40 <dbl> NA, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
$ ManufacturingProcess41 <dbl> NA, 0.15, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
$ ManufacturingProcess42 <dbl> 11.6, 11.1, 12.0, 10.6, 11.0, 11.5, 11.7, 11.4,…
$ ManufacturingProcess43 <dbl> 3.0, 0.9, 1.0, 1.1, 1.1, 2.2, 0.7, 0.8, 0.9, 0.…
$ ManufacturingProcess44 <dbl> 1.8, 1.9, 1.8, 1.8, 1.7, 1.8, 2.0, 2.0, 1.9, 1.…
$ ManufacturingProcess45 <dbl> 2.4, 2.2, 2.3, 2.1, 2.1, 2.0, 2.2, 2.2, 2.1, 2.…

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

6.3.b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

pp <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
df <- predict(pp, ChemicalManufacturingProcess)
df |>
  select(Yield, BiologicalMaterial01:BiologicalMaterial10, ManufacturingProcess01:ManufacturingProcess10) |>
  summary()

     Yield         BiologicalMaterial01 BiologicalMaterial02
 Min.   :-2.6692   Min.   :-2.5653      Min.   :-2.1858     
 1st Qu.:-0.7716   1st Qu.:-0.6078      1st Qu.:-0.7457     
 Median :-0.1119   Median :-0.1491      Median :-0.1484     
 Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000     
 3rd Qu.: 0.7035   3rd Qu.: 0.6423      3rd Qu.: 0.7557     
 Max.   : 3.3394   Max.   : 3.3597      Max.   : 2.2459     
 BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
 Min.   :-2.6830      Min.   :-1.6731      Min.   :-2.90576    
 1st Qu.:-0.6811      1st Qu.:-0.6222      1st Qu.:-0.73944    
 Median :-0.1212      Median :-0.1405      Median :-0.05891    
 Mean   : 0.0000      Mean   : 0.0000      Mean   : 0.00000    
 3rd Qu.: 0.6804      3rd Qu.: 0.4907      3rd Qu.: 0.70568    
 Max.   : 2.6355      Max.   : 6.0523      Max.   : 3.38985    
 BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
 Min.   :-2.2184      Min.   :-0.1313      Min.   :-2.38535    
 1st Qu.:-0.7622      1st Qu.:-0.1313      1st Qu.:-0.64225    
 Median :-0.1202      Median :-0.1313      Median : 0.02249    
 Mean   : 0.0000      Mean   : 0.0000      Mean   : 0.00000    
 3rd Qu.: 0.6499      3rd Qu.:-0.1313      3rd Qu.: 0.56906    
 Max.   : 2.7948      Max.   : 7.5723      Max.   : 2.43034    
 BiologicalMaterial09 BiologicalMaterial10 ManufacturingProcess01
 Min.   :-3.39629     Min.   :-1.7202      Min.   :-6.149703     
 1st Qu.:-0.59627     1st Qu.:-0.5685      1st Qu.:-0.223563     
 Median :-0.03627     Median :-0.1513      Median : 0.105667     
 Mean   : 0.00000     Mean   : 0.0000      Mean   : 0.001224     
 3rd Qu.: 0.67428     3rd Qu.: 0.3161      3rd Qu.: 0.503487     
 Max.   : 2.96246     Max.   : 6.7920      Max.   : 1.587202     
 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
 Min.   :-1.969253      Min.   :-3.10582       Min.   :-3.323233     
 1st Qu.: 0.308956      1st Qu.:-0.42705       1st Qu.:-0.613828     
 Median : 0.509627      Median : 0.37658       Median : 0.342432     
 Mean   : 0.009518      Mean   : 0.04123       Mean   : 0.003213     
 3rd Qu.: 0.568648      3rd Qu.: 0.46587       3rd Qu.: 0.661186     
 Max.   : 0.686690      Max.   : 2.69818       Max.   : 2.254953     
 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
 Min.   :-2.577803      Min.   :-1.630631      Min.   :-0.9580199    
 1st Qu.:-0.487046      1st Qu.:-0.630408      1st Qu.:-0.9580199    
 Median :-0.086583      Median :-0.222910      Median :-0.9580199    
 Mean   :-0.002534      Mean   :-0.006574      Mean   :-0.0009072    
 3rd Qu.: 0.230347      3rd Qu.: 0.480950      3rd Qu.: 1.0378549    
 Max.   : 5.686954      Max.   : 7.408415      Max.   : 1.0378549    
 ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
 Min.   :-1.111973      Min.   :-4.37787       Min.   :-2.18999      
 1st Qu.:-1.111973      1st Qu.:-0.49799       1st Qu.:-0.62482      
 Median : 0.894164      Median : 0.04519       Median :-0.10310      
 Mean   :-0.001759      Mean   : 0.00000       Mean   : 0.02156      
 3rd Qu.: 0.894164      3rd Qu.: 0.55281       3rd Qu.: 0.54906      
 Max.   : 0.894164      Max.   : 2.39252       Max.   : 3.15768

6.3.c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First, we shall get rid of near-zero variance columns

nz <- nearZeroVar(df, saveMetrics = FALSE)
df <- df[, -nz]

We shall split the data into a training and test set. We will use 70% of the data for training and 30% for testing.

x <- df[, -1]
y <- df$Yield
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
train_x <- x[train_index, ]
train_y <- y[train_index]
test_x <- x[-train_index, ]
test_y <- y[-train_index]

Now we can pre-process the data. We will use the preProcess function from the caret package to center and scale the data.

pp = preProcess(x = train_x, method = c("center", "scale"))
train_x = predict(pp, train_x)
test_x = predict(pp, test_x)

Now we shall fit a model using plain old ordinary linear regression. The reason being, because I don’t have a very good command of regression, nor do I know the various types. So think it’s best to start with the simplest model and work up from there.

ols_model <- train(
  x = train_x,
  y = train_y,
  method = "lm"
)
summary(ols_model)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.13026 -0.30489 -0.01857  0.28034  1.19769 

Coefficients: (1 not defined because of singularities)
                         Estimate Std. Error t value Pr(>|t|)   
(Intercept)             0.0113158  0.0540546   0.209   0.8348   
BiologicalMaterial01   -0.0029958  0.2095800  -0.014   0.9886   
BiologicalMaterial02   -0.3271515  0.3512272  -0.931   0.3549   
BiologicalMaterial03    0.4123031  0.6980873   0.591   0.5567   
BiologicalMaterial04    0.2243085  0.8179353   0.274   0.7847   
BiologicalMaterial05    0.0670686  0.1639000   0.409   0.6837   
BiologicalMaterial06    0.1113903  0.8229524   0.135   0.8927   
BiologicalMaterial08    0.1853403  0.3461276   0.535   0.5941   
BiologicalMaterial09   -0.1787805  0.3884041  -0.460   0.6468   
BiologicalMaterial10   -0.1445224  0.7103661  -0.203   0.8394   
BiologicalMaterial11   -0.3537948  0.3169114  -1.116   0.2682   
BiologicalMaterial12    0.2345370  0.3834923   0.612   0.5429   
ManufacturingProcess01 -0.0383975  0.1338948  -0.287   0.7752   
ManufacturingProcess02  0.0200441  0.3087179   0.065   0.9484   
ManufacturingProcess03 -0.0743048  0.0914729  -0.812   0.4194   
ManufacturingProcess04  0.3410369  0.1351014   2.524   0.0139 * 
ManufacturingProcess05 -0.0911598  0.0987751  -0.923   0.3593   
ManufacturingProcess06 -0.0171473  0.0905210  -0.189   0.8503   
ManufacturingProcess07 -0.0213464  0.0835230  -0.256   0.7991   
ManufacturingProcess08  0.0052438  0.1045576   0.050   0.9601   
ManufacturingProcess09 -0.0903041  0.1955562  -0.462   0.6457   
ManufacturingProcess10 -0.3779426  0.3449609  -1.096   0.2771   
ManufacturingProcess11  0.3806085  0.4542818   0.838   0.4051   
ManufacturingProcess12  0.2209485  0.1273567   1.735   0.0873 . 
ManufacturingProcess13 -0.2941555  0.3317023  -0.887   0.3783   
ManufacturingProcess14 -0.3065264  0.6877300  -0.446   0.6572   
ManufacturingProcess15  0.0991896  0.5841598   0.170   0.8657   
ManufacturingProcess16  0.1475591  0.2834882   0.521   0.6044   
ManufacturingProcess17 -0.0173720  0.3078581  -0.056   0.9552   
ManufacturingProcess18  1.5885318  1.5474785   1.027   0.3083   
ManufacturingProcess19  0.1135276  0.2707592   0.419   0.6763   
ManufacturingProcess20 -1.5347109  1.5432315  -0.994   0.3235   
ManufacturingProcess21         NA         NA      NA       NA   
ManufacturingProcess22 -0.0406860  0.1005572  -0.405   0.6870   
ManufacturingProcess23  0.0159373  0.1092315   0.146   0.8844   
ManufacturingProcess24 -0.0862917  0.1038739  -0.831   0.4090   
ManufacturingProcess25 -4.3655538  5.1343559  -0.850   0.3982   
ManufacturingProcess26  4.5547304  5.0330211   0.905   0.3687   
ManufacturingProcess27 -1.2390901  2.5722956  -0.482   0.6316   
ManufacturingProcess28 -0.2998207  0.1243015  -2.412   0.0186 * 
ManufacturingProcess29  1.2347236  1.2754306   0.968   0.3364   
ManufacturingProcess30 -0.3458454  0.5978704  -0.578   0.5649   
ManufacturingProcess31  0.2113403  0.5236676   0.404   0.6878   
ManufacturingProcess32  0.7822739  0.2610018   2.997   0.0038 **
ManufacturingProcess33 -0.4434718  0.2366853  -1.874   0.0653 . 
ManufacturingProcess34  0.0002518  0.1217067   0.002   0.9984   
ManufacturingProcess35 -0.1582873  0.1378626  -1.148   0.2549   
ManufacturingProcess36  0.0641419  0.2164028   0.296   0.7678   
ManufacturingProcess37 -0.1276992  0.1119200  -1.141   0.2579   
ManufacturingProcess38  0.0379693  0.1129182   0.336   0.7377   
ManufacturingProcess39  0.0796509  0.1507351   0.528   0.5989   
ManufacturingProcess40 -0.0189772  0.1948236  -0.097   0.9227   
ManufacturingProcess41 -0.0559618  0.1955561  -0.286   0.7756   
ManufacturingProcess42  0.1584192  0.3413444   0.464   0.6441   
ManufacturingProcess43  0.1586926  0.0816100   1.945   0.0560 . 
ManufacturingProcess44 -0.1994278  0.3221932  -0.619   0.5380   
ManufacturingProcess45  0.1364654  0.1827490   0.747   0.4578   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6019 on 68 degrees of freedom
Multiple R-squared:  0.8065,    Adjusted R-squared:   0.65 
F-statistic: 5.153 on 55 and 68 DF,  p-value: 2.407e-10

Let’e find the optimal value of the performance metric. We shall use the R2 value as the performance metric.

print(paste("Optimal R-Squared", ols_model$results$Rsquared))

[1] "Optimal R-Squared 0.111745777763209"

Oh dear, that’s not very good, is it.

6.3.d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

predictions <- predict(ols_model, newdata = test_x)
test_r2 <- cor(test_y, predictions)^2
paste("Test set estimate of R2:", test_r2)

[1] "Test set estimate of R2: 0.127782678908639"

Well, that is double the R2 value of the training set. It’s still pretty crummy though.

6.3.e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Let’s see. We can use the varImp function to get the variable importance of the predictors.

varImp(ols_model)

lm variable importance

  only 20 most important variables shown (out of 55)

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess04   84.21
ManufacturingProcess28   80.46
ManufacturingProcess43   64.85
ManufacturingProcess33   62.49
ManufacturingProcess12   57.85
ManufacturingProcess35   38.26
ManufacturingProcess37   38.03
BiologicalMaterial11     37.20
ManufacturingProcess10   36.51
ManufacturingProcess18   34.20
ManufacturingProcess20   33.13
ManufacturingProcess29   32.25
BiologicalMaterial02     31.03
ManufacturingProcess05   30.74
ManufacturingProcess26   30.15
ManufacturingProcess13   29.54
ManufacturingProcess25   28.32
ManufacturingProcess11   27.90
ManufacturingProcess24   27.67

Wow, how interesting and amazing. ManufacturingProcess32 scored 100. That’s high, but I’m not sure the value is meaningful. I read that caret uses the f-score for linear models? I have my doubts. Nonetheless, ManufacturingProcess32 is our champion.

Who dominates? Let’s see:

df_imp <- varImp(ols_model)$importance |>
  arrange(desc(Overall)) |>
  mutate(Index = row_number()) |>
  print()

                           Overall Index
ManufacturingProcess32 100.0000000     1
ManufacturingProcess04  84.2112599     2
ManufacturingProcess28  80.4631816     3
ManufacturingProcess43  64.8538418     4
ManufacturingProcess33  62.4884204     5
ManufacturingProcess12  57.8543025     6
ManufacturingProcess35  38.2649373     7
ManufacturingProcess37  38.0256614     8
BiologicalMaterial11    37.2042557     9
ManufacturingProcess10  36.5106556    10
ManufacturingProcess18  34.2042194    11
ManufacturingProcess20  33.1341354    12
ManufacturingProcess29  32.2528714    13
BiologicalMaterial02    31.0298498    14
ManufacturingProcess05  30.7444067    15
ManufacturingProcess26  30.1456411    16
ManufacturingProcess13  29.5391977    17
ManufacturingProcess25  28.3191327    18
ManufacturingProcess11  27.9038396    19
ManufacturingProcess24  27.6671361    20
ManufacturingProcess03  27.0521286    21
ManufacturingProcess45  24.8626451    22
ManufacturingProcess44  20.5968078    23
BiologicalMaterial12    20.3501518    24
BiologicalMaterial03    19.6502236    25
ManufacturingProcess30  19.2443663    26
BiologicalMaterial08    17.8089030    27
ManufacturingProcess39  17.5734565    28
ManufacturingProcess16  17.3095533    29
ManufacturingProcess27  16.0139109    30
ManufacturingProcess42  15.4262121    31
ManufacturingProcess09  15.3486614    32
BiologicalMaterial09    15.2990524    33
ManufacturingProcess14  14.8120100    34
ManufacturingProcess19  13.9301138    35
BiologicalMaterial05    13.5932682    36
ManufacturingProcess22  13.4397047    37
ManufacturingProcess31  13.4053833    38
ManufacturingProcess38  11.1576650    39
ManufacturingProcess36   9.8270195    40
ManufacturingProcess01   9.5055929    41
ManufacturingProcess41   9.4853534    42
BiologicalMaterial04     9.0870487    43
ManufacturingProcess07   8.4639529    44
BiologicalMaterial10     6.7235521    45
ManufacturingProcess06   6.2554956    46
ManufacturingProcess15   5.6000945    47
ManufacturingProcess23   4.8022988    48
BiologicalMaterial06     4.4500872    49
ManufacturingProcess40   3.1831096    50
ManufacturingProcess02   2.0986780    51
ManufacturingProcess17   1.8149466    52
ManufacturingProcess08   1.6053969    53
BiologicalMaterial01     0.4081823    54
ManufacturingProcess34   0.0000000    55

We know that there are more manufacturing process predictors than there are biological material predictors. So let’s consider the ratio of manufacturing process predictors in the first half compared to the second half. There are 4 biological material predictors in the first half and 7 in the second half. It is clear that the manufacturing process predictors dominate the list.

6.3.f

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Let’s see how much correlation there is between the correlation of the ranking of the manufacturing-process/biological-material importance list and the Yield variable.

map(rownames(df_imp), function(name) {
  c(name, cor(df$Yield, df[[name]]))
}) |>
  head(n = 20)

[[1]]
[1] "ManufacturingProcess32" "0.608332149747848"     

[[2]]
[1] "ManufacturingProcess04" "-0.26607332782866"     

[[3]]
[1] "ManufacturingProcess28" "0.267564814390742"     

[[4]]
[1] "ManufacturingProcess43" "0.159268513493375"     

[[5]]
[1] "ManufacturingProcess33" "0.425916169727246"     

[[6]]
[1] "ManufacturingProcess12" "0.351303654798767"     

[[7]]
[1] "ManufacturingProcess35" "-0.168589183302461"    

[[8]]
[1] "ManufacturingProcess37" "-0.159314119899988"    

[[9]]
[1] "BiologicalMaterial11" "0.354914346197819"   

[[10]]
[1] "ManufacturingProcess10" "0.216704136578715"     

[[11]]
[1] "ManufacturingProcess18" "-0.0589292500070705"   

[[12]]
[1] "ManufacturingProcess20" "-0.0655114643844646"   

[[13]]
[1] "ManufacturingProcess29" "0.153857214490886"     

[[14]]
[1] "BiologicalMaterial02" "0.481515794019022"   

[[15]]
[1] "ManufacturingProcess05" "0.11287519567964"      

[[16]]
[1] "ManufacturingProcess26" "0.0385058771120283"    

[[17]]
[1] "ManufacturingProcess13" "-0.503679718408746"    

[[18]]
[1] "ManufacturingProcess25" "0.00701190869255503"   

[[19]]
[1] "ManufacturingProcess11" "0.354100992969725"     

[[20]]
[1] "ManufacturingProcess24" "-0.214890881693355"

I am noticing that there is more correlation (0.608) between the first importance predictor ManufacturingProcess32 and the Yield variable than the R2 value of the model (0.37). Actually, it is a hair under. I was thinking that we could do better by using the ManufacturingProcess32 alone. It would be simpler, but not better. We’ll come back to this if we have time.