library(AppliedPredictiveModeling)
library(caret)
library(RANN)
library(tidyverse)
set.seed(314159)
DATA 624: Assignment 07
Setup
Assignment
In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
Problem 6.2
Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.
In this project there were 165 unique compounds; 1,107 molecular fingerprints were determined for each. A molecular fingerprint is a binary sequence of numbers that represents the presence or absence of a specific molecular substructure.
permeability
: permeability values for each compound.fingerprints
: a matrix of binary fingerprint indicator variables.
6.2.a
Start R and use these commands to load the data. The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
data(permeability)
dim(permeability)
[1] 165 1
dim(fingerprints)
[1] 165 1107
summary(apply(fingerprints, 2, mean))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.006061 0.024242 0.154767 0.181818 1.000000
as_data_frame(permeability) |>
ggplot(mapping = aes(x = permeability)) +
geom_histogram(bins = 50) +
labs(title = "Permeability Histogram")
Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
tibble, or `as.data.frame()` to convert to a data frame.
6.2.b
The fingerprint
predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar
function from the caret package. How many predictors are left for modeling?
# get columns (predictors) that have a low frequency.
<- nearZeroVar(fingerprints, saveMetrics = FALSE) nz
It looks like we got back an index array? The documentation says the following:
if saveMetrics = FALSE, a vector of integers corresponding to the column positions of the problematic predictors. If saveMetrics = TRUE, a data frame with columns.
So, we have to remove the columns that are in the nz
array or we want to keep the complement of the nz
array. We can use the -
operator to remove the columns.
<- fingerprints[, -nz]
fnz paste("Number of predictors before filtering:", ncol(fingerprints))
[1] "Number of predictors before filtering: 1107"
paste("Number of predictors after filtering:", ncol(fnz))
[1] "Number of predictors after filtering: 388"
6.2.c
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
Let’s split the data into a training and test set. We will use 70% of the data for training and 30% for testing.
<- fnz
x <- permeability
y <- createDataPartition(y, p = 0.7, list = FALSE)
train_index <- x[train_index, ]
train_x <- y[train_index]
train_y <- x[-train_index, ]
test_x <- y[-train_index] test_y
Now we can pre-process the data. We will use the preProcess
function from the caret package to center and scale the data.
= preProcess(x = train_x, method = c("center", "scale"))
pp = predict(pp, train_x)
train_x = predict(pp, test_x) test_x
Now we shall fit a model using PLS
<- train(
pls_model x = train_x,
y = train_y,
method = "pls",
tuneLength = 10
)summary(pls_model)
Data: X dimension: 117 388
Y dimension: 117 1
Fit method: oscorespls
Number of components considered: 5
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps
X 23.25 34.70 40.27 45.89 52.83
.outcome 33.37 50.98 59.39 65.97 71.21
print(pls_model)
Partial Least Squares
117 samples
388 predictors
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 117, 117, 117, 117, 117, 117, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 13.76865 0.2991406 10.857122
2 12.67955 0.4068872 9.057553
3 12.75722 0.4127669 9.296703
4 12.87457 0.4112108 9.522711
5 12.63621 0.4337232 9.283996
6 12.67895 0.4394412 9.214388
7 12.74123 0.4420929 9.276826
8 12.67530 0.4483525 9.285716
9 12.71561 0.4487545 9.363156
10 13.00803 0.4321720 9.549234
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 5.
The optimal number of latent variables appears to change when I run it repeatedly. I’ll do this programmatically. The answer is as follows:
<- pls_model$bestTune$ncomp
optimal_ncomp <- max(pls_model$resample$Rsquared)
optimal_r2 paste("Optimal number of latent variables:", optimal_ncomp)
[1] "Optimal number of latent variables: 5"
paste("Resampled estimate of R2:", optimal_r2)
[1] "Resampled estimate of R2: 0.606339030925167"
6.2.d
Predict the response for the test set. What is the test set estimate of R2?
<- predict(pls_model, newdata = test_x)
pls_preds <- postResample(pls_preds, test_y)
pls_metrics print("PLS metrics:")
[1] "PLS metrics:"
print(pls_metrics)
RMSE Rsquared MAE
11.711837 0.432058 8.795413
6.2.e
Try building other models discussed in this chapter. Do any have better predictive performance?
We shall build an OLS model, a ridge model, lasso model and elastic-net model and compare their performance.
# TODO: there are a number of warnings. Why?
# OLS model
<- train(
ols_model x = train_x,
y = train_y,
method = "lm",
tuneLength = 10
)
# Ridge Model: this model takes a loooong time to run on my machine.
# I've trimmed the number of tuning parameters to 3. Same for the fellas below.
<- train(
ridge_model x = train_x,
y = train_y,
method = "ridge",
tuneLength = 3
)
# Lasso model
<- train(
lasso_model x = train_x,
y = train_y,
method = "lasso",
tuneLength = 5
)
# Elastic-net model
<- train(
en_model x = train_x,
y = train_y,
method = "enet",
tuneLength = 3
)
Now let’s see how they perform. We will use the resamples
function from the caret package to compare the models. But first we need to make some predictions on the test set.
<- predict(ols_model, newdata = test_x) ols_preds
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
<- predict(ridge_model, newdata = test_x)
ridge_preds <- predict(lasso_model, newdata = test_x)
lasso_preds <- predict(en_model, newdata = test_x) en_preds
Now we shall see how they compare.
<- postResample(ols_preds, test_y)
ols_metrics <- postResample(ridge_preds, test_y)
ridge_metrics <- postResample(lasso_preds, test_y)
lasso_metrics <- postResample(en_preds, test_y)
en_metrics
print("PLS metrics:")
[1] "PLS metrics:"
print(pls_metrics)
RMSE Rsquared MAE
11.711837 0.432058 8.795413
print("OLS metrics:")
[1] "OLS metrics:"
print(ols_metrics)
RMSE Rsquared MAE
28.58135167 0.08927716 18.94034521
print("Ridge metrics:")
[1] "Ridge metrics:"
print(ridge_metrics)
RMSE Rsquared MAE
11.8937084 0.4884249 8.6162418
print("Lasso metrics:")
[1] "Lasso metrics:"
print(lasso_metrics)
RMSE Rsquared MAE
12.7176597 0.3268589 8.3039603
print("Elastic Net metrics:")
[1] "Elastic Net metrics:"
print(en_metrics)
RMSE Rsquared MAE
12.5201378 0.3022864 9.1476482
6.2.f
Would you recommend any of your models to replace the permeability laboratory experiment?
Yes, going by metrics alone I would. The PLS model has an R2 of ~0.432 where the Ridge model has an R2 of ~0.488. All of the other models’ R2 is lower. I am noticing that not all of the metrics agree with one another. For example, Ridge has an RMSE of ~11.89 when PLS’s is ~11.71. But, the ridge model takes a long time to build. I am going to say that for the meager gain we get in R2, I would not recommend the ridge model. The PLS model is a good compromise between performance and time to build.
Problem 6.3
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
From its vignette, the ChemicalManufacturingProcess data set contains the following variables:
The data set consisted of 177 samples of biological material for which 57 characteristics were measured. Of the 57 characteristics, there were 12 measurements of the biological starting material, and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by–products at various steps. Some of the process measurements can be controlled, while others are observed. Predictors are continuous, count, categorical; some are correlated, and some contain missing values. Samples are not independent because sets of samples come from the same batch of biological starting material.
6.3.a
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
class(ChemicalManufacturingProcess)
[1] "data.frame"
glimpse(ChemicalManufacturingProcess)
Rows: 176
Columns: 58
$ Yield <dbl> 38.00, 42.44, 42.03, 41.42, 42.49, 43.57, 43.12…
$ BiologicalMaterial01 <dbl> 6.25, 8.01, 8.01, 8.01, 7.47, 6.12, 7.48, 6.94,…
$ BiologicalMaterial02 <dbl> 49.58, 60.97, 60.97, 60.97, 63.33, 58.36, 64.47…
$ BiologicalMaterial03 <dbl> 56.97, 67.48, 67.48, 67.48, 72.25, 65.31, 72.41…
$ BiologicalMaterial04 <dbl> 12.74, 14.65, 14.65, 14.65, 14.02, 15.17, 13.82…
$ BiologicalMaterial05 <dbl> 19.51, 19.36, 19.36, 19.36, 17.91, 21.79, 17.71…
$ BiologicalMaterial06 <dbl> 43.73, 53.14, 53.14, 53.14, 54.66, 51.23, 54.45…
$ BiologicalMaterial07 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
$ BiologicalMaterial08 <dbl> 16.66, 19.04, 19.04, 19.04, 18.22, 18.30, 18.72…
$ BiologicalMaterial09 <dbl> 11.44, 12.55, 12.55, 12.55, 12.80, 12.13, 12.95…
$ BiologicalMaterial10 <dbl> 3.46, 3.46, 3.46, 3.46, 3.05, 3.78, 3.04, 3.85,…
$ BiologicalMaterial11 <dbl> 138.09, 153.67, 153.67, 153.67, 147.61, 151.88,…
$ BiologicalMaterial12 <dbl> 18.83, 21.05, 21.05, 21.05, 21.05, 20.76, 20.75…
$ ManufacturingProcess01 <dbl> NA, 0.0, 0.0, 0.0, 10.7, 12.0, 11.5, 12.0, 12.0…
$ ManufacturingProcess02 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ManufacturingProcess03 <dbl> NA, NA, NA, NA, NA, NA, 1.56, 1.55, 1.56, 1.55,…
$ ManufacturingProcess04 <dbl> NA, 917, 912, 911, 918, 924, 933, 929, 928, 938…
$ ManufacturingProcess05 <dbl> NA, 1032.2, 1003.6, 1014.6, 1027.5, 1016.8, 988…
$ ManufacturingProcess06 <dbl> NA, 210.0, 207.1, 213.3, 205.7, 208.9, 210.0, 2…
$ ManufacturingProcess07 <dbl> NA, 177, 178, 177, 178, 178, 177, 178, 177, 177…
$ ManufacturingProcess08 <dbl> NA, 178, 178, 177, 178, 178, 178, 178, 177, 177…
$ ManufacturingProcess09 <dbl> 43.00, 46.57, 45.07, 44.92, 44.96, 45.32, 49.36…
$ ManufacturingProcess10 <dbl> NA, NA, NA, NA, NA, NA, 11.6, 10.2, 9.7, 10.1, …
$ ManufacturingProcess11 <dbl> NA, NA, NA, NA, NA, NA, 11.5, 11.3, 11.1, 10.2,…
$ ManufacturingProcess12 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ManufacturingProcess13 <dbl> 35.5, 34.0, 34.8, 34.8, 34.6, 34.0, 32.4, 33.6,…
$ ManufacturingProcess14 <dbl> 4898, 4869, 4878, 4897, 4992, 4985, 4745, 4854,…
$ ManufacturingProcess15 <dbl> 6108, 6095, 6087, 6102, 6233, 6222, 5999, 6105,…
$ ManufacturingProcess16 <dbl> 4682, 4617, 4617, 4635, 4733, 4786, 4486, 4626,…
$ ManufacturingProcess17 <dbl> 35.5, 34.0, 34.8, 34.8, 33.9, 33.4, 33.8, 33.6,…
$ ManufacturingProcess18 <dbl> 4865, 4867, 4877, 4872, 4886, 4862, 4758, 4766,…
$ ManufacturingProcess19 <dbl> 6049, 6097, 6078, 6073, 6102, 6115, 6013, 6022,…
$ ManufacturingProcess20 <dbl> 4665, 4621, 4621, 4611, 4659, 4696, 4522, 4552,…
$ ManufacturingProcess21 <dbl> 0.0, 0.0, 0.0, 0.0, -0.7, -0.6, 1.4, 0.0, 0.0, …
$ ManufacturingProcess22 <dbl> NA, 3, 4, 5, 8, 9, 1, 2, 3, 4, 6, 7, 8, 10, 11,…
$ ManufacturingProcess23 <dbl> NA, 0, 1, 2, 4, 1, 1, 2, 3, 1, 3, 4, 1, 2, 3, 4…
$ ManufacturingProcess24 <dbl> NA, 3, 4, 5, 18, 1, 1, 2, 3, 4, 6, 7, 8, 2, 15,…
$ ManufacturingProcess25 <dbl> 4873, 4869, 4897, 4892, 4930, 4871, 4795, 4806,…
$ ManufacturingProcess26 <dbl> 6074, 6107, 6116, 6111, 6151, 6128, 6057, 6059,…
$ ManufacturingProcess27 <dbl> 4685, 4630, 4637, 4630, 4684, 4687, 4572, 4586,…
$ ManufacturingProcess28 <dbl> 10.7, 11.2, 11.1, 11.1, 11.3, 11.4, 11.2, 11.1,…
$ ManufacturingProcess29 <dbl> 21.0, 21.4, 21.3, 21.3, 21.6, 21.7, 21.2, 21.2,…
$ ManufacturingProcess30 <dbl> 9.9, 9.9, 9.4, 9.4, 9.0, 10.1, 11.2, 10.9, 10.5…
$ ManufacturingProcess31 <dbl> 69.1, 68.7, 69.3, 69.3, 69.4, 68.2, 67.6, 67.9,…
$ ManufacturingProcess32 <dbl> 156, 169, 173, 171, 171, 173, 159, 161, 160, 16…
$ ManufacturingProcess33 <dbl> 66, 66, 66, 68, 70, 70, 65, 65, 65, 66, 67, 67,…
$ ManufacturingProcess34 <dbl> 2.4, 2.6, 2.6, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.…
$ ManufacturingProcess35 <dbl> 486, 508, 509, 496, 468, 490, 475, 478, 491, 48…
$ ManufacturingProcess36 <dbl> 0.019, 0.019, 0.018, 0.018, 0.017, 0.018, 0.019…
$ ManufacturingProcess37 <dbl> 0.5, 2.0, 0.7, 1.2, 0.2, 0.4, 0.8, 1.0, 1.2, 1.…
$ ManufacturingProcess38 <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 3,…
$ ManufacturingProcess39 <dbl> 7.2, 7.2, 7.2, 7.2, 7.3, 7.2, 7.3, 7.3, 7.4, 7.…
$ ManufacturingProcess40 <dbl> NA, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
$ ManufacturingProcess41 <dbl> NA, 0.15, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
$ ManufacturingProcess42 <dbl> 11.6, 11.1, 12.0, 10.6, 11.0, 11.5, 11.7, 11.4,…
$ ManufacturingProcess43 <dbl> 3.0, 0.9, 1.0, 1.1, 1.1, 2.2, 0.7, 0.8, 0.9, 0.…
$ ManufacturingProcess44 <dbl> 1.8, 1.9, 1.8, 1.8, 1.7, 1.8, 2.0, 2.0, 1.9, 1.…
$ ManufacturingProcess45 <dbl> 2.4, 2.2, 2.3, 2.1, 2.1, 2.0, 2.2, 2.2, 2.1, 2.…
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
6.3.b
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).
<- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
pp <- predict(pp, ChemicalManufacturingProcess)
df |>
df select(Yield, BiologicalMaterial01:BiologicalMaterial10, ManufacturingProcess01:ManufacturingProcess10) |>
summary()
Yield BiologicalMaterial01 BiologicalMaterial02
Min. :-2.6692 Min. :-2.5653 Min. :-2.1858
1st Qu.:-0.7716 1st Qu.:-0.6078 1st Qu.:-0.7457
Median :-0.1119 Median :-0.1491 Median :-0.1484
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.7035 3rd Qu.: 0.6423 3rd Qu.: 0.7557
Max. : 3.3394 Max. : 3.3597 Max. : 2.2459
BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
Min. :-2.6830 Min. :-1.6731 Min. :-2.90576
1st Qu.:-0.6811 1st Qu.:-0.6222 1st Qu.:-0.73944
Median :-0.1212 Median :-0.1405 Median :-0.05891
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.6804 3rd Qu.: 0.4907 3rd Qu.: 0.70568
Max. : 2.6355 Max. : 6.0523 Max. : 3.38985
BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
Min. :-2.2184 Min. :-0.1313 Min. :-2.38535
1st Qu.:-0.7622 1st Qu.:-0.1313 1st Qu.:-0.64225
Median :-0.1202 Median :-0.1313 Median : 0.02249
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.6499 3rd Qu.:-0.1313 3rd Qu.: 0.56906
Max. : 2.7948 Max. : 7.5723 Max. : 2.43034
BiologicalMaterial09 BiologicalMaterial10 ManufacturingProcess01
Min. :-3.39629 Min. :-1.7202 Min. :-6.149703
1st Qu.:-0.59627 1st Qu.:-0.5685 1st Qu.:-0.223563
Median :-0.03627 Median :-0.1513 Median : 0.105667
Mean : 0.00000 Mean : 0.0000 Mean : 0.001224
3rd Qu.: 0.67428 3rd Qu.: 0.3161 3rd Qu.: 0.503487
Max. : 2.96246 Max. : 6.7920 Max. : 1.587202
ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
Min. :-1.969253 Min. :-3.10582 Min. :-3.323233
1st Qu.: 0.308956 1st Qu.:-0.42705 1st Qu.:-0.613828
Median : 0.509627 Median : 0.37658 Median : 0.342432
Mean : 0.009518 Mean : 0.04123 Mean : 0.003213
3rd Qu.: 0.568648 3rd Qu.: 0.46587 3rd Qu.: 0.661186
Max. : 0.686690 Max. : 2.69818 Max. : 2.254953
ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
Min. :-2.577803 Min. :-1.630631 Min. :-0.9580199
1st Qu.:-0.487046 1st Qu.:-0.630408 1st Qu.:-0.9580199
Median :-0.086583 Median :-0.222910 Median :-0.9580199
Mean :-0.002534 Mean :-0.006574 Mean :-0.0009072
3rd Qu.: 0.230347 3rd Qu.: 0.480950 3rd Qu.: 1.0378549
Max. : 5.686954 Max. : 7.408415 Max. : 1.0378549
ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
Min. :-1.111973 Min. :-4.37787 Min. :-2.18999
1st Qu.:-1.111973 1st Qu.:-0.49799 1st Qu.:-0.62482
Median : 0.894164 Median : 0.04519 Median :-0.10310
Mean :-0.001759 Mean : 0.00000 Mean : 0.02156
3rd Qu.: 0.894164 3rd Qu.: 0.55281 3rd Qu.: 0.54906
Max. : 0.894164 Max. : 2.39252 Max. : 3.15768
6.3.c
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
First, we shall get rid of near-zero variance columns
<- nearZeroVar(df, saveMetrics = FALSE)
nz <- df[, -nz] df
We shall split the data into a training and test set. We will use 70% of the data for training and 30% for testing.
<- df[, -1]
x <- df$Yield
y <- createDataPartition(y, p = 0.7, list = FALSE)
train_index <- x[train_index, ]
train_x <- y[train_index]
train_y <- x[-train_index, ]
test_x <- y[-train_index] test_y
Now we can pre-process the data. We will use the preProcess
function from the caret package to center and scale the data.
= preProcess(x = train_x, method = c("center", "scale"))
pp = predict(pp, train_x)
train_x = predict(pp, test_x) test_x
Now we shall fit a model using plain old ordinary linear regression. The reason being, because I don’t have a very good command of regression, nor do I know the various types. So think it’s best to start with the simplest model and work up from there.
<- train(
ols_model x = train_x,
y = train_y,
method = "lm"
)summary(ols_model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.13026 -0.30489 -0.01857 0.28034 1.19769
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0113158 0.0540546 0.209 0.8348
BiologicalMaterial01 -0.0029958 0.2095800 -0.014 0.9886
BiologicalMaterial02 -0.3271515 0.3512272 -0.931 0.3549
BiologicalMaterial03 0.4123031 0.6980873 0.591 0.5567
BiologicalMaterial04 0.2243085 0.8179353 0.274 0.7847
BiologicalMaterial05 0.0670686 0.1639000 0.409 0.6837
BiologicalMaterial06 0.1113903 0.8229524 0.135 0.8927
BiologicalMaterial08 0.1853403 0.3461276 0.535 0.5941
BiologicalMaterial09 -0.1787805 0.3884041 -0.460 0.6468
BiologicalMaterial10 -0.1445224 0.7103661 -0.203 0.8394
BiologicalMaterial11 -0.3537948 0.3169114 -1.116 0.2682
BiologicalMaterial12 0.2345370 0.3834923 0.612 0.5429
ManufacturingProcess01 -0.0383975 0.1338948 -0.287 0.7752
ManufacturingProcess02 0.0200441 0.3087179 0.065 0.9484
ManufacturingProcess03 -0.0743048 0.0914729 -0.812 0.4194
ManufacturingProcess04 0.3410369 0.1351014 2.524 0.0139 *
ManufacturingProcess05 -0.0911598 0.0987751 -0.923 0.3593
ManufacturingProcess06 -0.0171473 0.0905210 -0.189 0.8503
ManufacturingProcess07 -0.0213464 0.0835230 -0.256 0.7991
ManufacturingProcess08 0.0052438 0.1045576 0.050 0.9601
ManufacturingProcess09 -0.0903041 0.1955562 -0.462 0.6457
ManufacturingProcess10 -0.3779426 0.3449609 -1.096 0.2771
ManufacturingProcess11 0.3806085 0.4542818 0.838 0.4051
ManufacturingProcess12 0.2209485 0.1273567 1.735 0.0873 .
ManufacturingProcess13 -0.2941555 0.3317023 -0.887 0.3783
ManufacturingProcess14 -0.3065264 0.6877300 -0.446 0.6572
ManufacturingProcess15 0.0991896 0.5841598 0.170 0.8657
ManufacturingProcess16 0.1475591 0.2834882 0.521 0.6044
ManufacturingProcess17 -0.0173720 0.3078581 -0.056 0.9552
ManufacturingProcess18 1.5885318 1.5474785 1.027 0.3083
ManufacturingProcess19 0.1135276 0.2707592 0.419 0.6763
ManufacturingProcess20 -1.5347109 1.5432315 -0.994 0.3235
ManufacturingProcess21 NA NA NA NA
ManufacturingProcess22 -0.0406860 0.1005572 -0.405 0.6870
ManufacturingProcess23 0.0159373 0.1092315 0.146 0.8844
ManufacturingProcess24 -0.0862917 0.1038739 -0.831 0.4090
ManufacturingProcess25 -4.3655538 5.1343559 -0.850 0.3982
ManufacturingProcess26 4.5547304 5.0330211 0.905 0.3687
ManufacturingProcess27 -1.2390901 2.5722956 -0.482 0.6316
ManufacturingProcess28 -0.2998207 0.1243015 -2.412 0.0186 *
ManufacturingProcess29 1.2347236 1.2754306 0.968 0.3364
ManufacturingProcess30 -0.3458454 0.5978704 -0.578 0.5649
ManufacturingProcess31 0.2113403 0.5236676 0.404 0.6878
ManufacturingProcess32 0.7822739 0.2610018 2.997 0.0038 **
ManufacturingProcess33 -0.4434718 0.2366853 -1.874 0.0653 .
ManufacturingProcess34 0.0002518 0.1217067 0.002 0.9984
ManufacturingProcess35 -0.1582873 0.1378626 -1.148 0.2549
ManufacturingProcess36 0.0641419 0.2164028 0.296 0.7678
ManufacturingProcess37 -0.1276992 0.1119200 -1.141 0.2579
ManufacturingProcess38 0.0379693 0.1129182 0.336 0.7377
ManufacturingProcess39 0.0796509 0.1507351 0.528 0.5989
ManufacturingProcess40 -0.0189772 0.1948236 -0.097 0.9227
ManufacturingProcess41 -0.0559618 0.1955561 -0.286 0.7756
ManufacturingProcess42 0.1584192 0.3413444 0.464 0.6441
ManufacturingProcess43 0.1586926 0.0816100 1.945 0.0560 .
ManufacturingProcess44 -0.1994278 0.3221932 -0.619 0.5380
ManufacturingProcess45 0.1364654 0.1827490 0.747 0.4578
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6019 on 68 degrees of freedom
Multiple R-squared: 0.8065, Adjusted R-squared: 0.65
F-statistic: 5.153 on 55 and 68 DF, p-value: 2.407e-10
Let’e find the optimal value of the performance metric. We shall use the R2 value as the performance metric.
print(paste("Optimal R-Squared", ols_model$results$Rsquared))
[1] "Optimal R-Squared 0.111745777763209"
Oh dear, that’s not very good, is it.
6.3.d
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
<- predict(ols_model, newdata = test_x)
predictions <- cor(test_y, predictions)^2
test_r2 paste("Test set estimate of R2:", test_r2)
[1] "Test set estimate of R2: 0.127782678908639"
Well, that is double the R2 value of the training set. It’s still pretty crummy though.
6.3.e
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Let’s see. We can use the varImp
function to get the variable importance of the predictors.
varImp(ols_model)
lm variable importance
only 20 most important variables shown (out of 55)
Overall
ManufacturingProcess32 100.00
ManufacturingProcess04 84.21
ManufacturingProcess28 80.46
ManufacturingProcess43 64.85
ManufacturingProcess33 62.49
ManufacturingProcess12 57.85
ManufacturingProcess35 38.26
ManufacturingProcess37 38.03
BiologicalMaterial11 37.20
ManufacturingProcess10 36.51
ManufacturingProcess18 34.20
ManufacturingProcess20 33.13
ManufacturingProcess29 32.25
BiologicalMaterial02 31.03
ManufacturingProcess05 30.74
ManufacturingProcess26 30.15
ManufacturingProcess13 29.54
ManufacturingProcess25 28.32
ManufacturingProcess11 27.90
ManufacturingProcess24 27.67
Wow, how interesting and amazing. ManufacturingProcess32
scored 100. That’s high, but I’m not sure the value is meaningful. I read that caret
uses the f-score for linear models? I have my doubts. Nonetheless, ManufacturingProcess32
is our champion.
Who dominates? Let’s see:
<- varImp(ols_model)$importance |>
df_imp arrange(desc(Overall)) |>
mutate(Index = row_number()) |>
print()
Overall Index
ManufacturingProcess32 100.0000000 1
ManufacturingProcess04 84.2112599 2
ManufacturingProcess28 80.4631816 3
ManufacturingProcess43 64.8538418 4
ManufacturingProcess33 62.4884204 5
ManufacturingProcess12 57.8543025 6
ManufacturingProcess35 38.2649373 7
ManufacturingProcess37 38.0256614 8
BiologicalMaterial11 37.2042557 9
ManufacturingProcess10 36.5106556 10
ManufacturingProcess18 34.2042194 11
ManufacturingProcess20 33.1341354 12
ManufacturingProcess29 32.2528714 13
BiologicalMaterial02 31.0298498 14
ManufacturingProcess05 30.7444067 15
ManufacturingProcess26 30.1456411 16
ManufacturingProcess13 29.5391977 17
ManufacturingProcess25 28.3191327 18
ManufacturingProcess11 27.9038396 19
ManufacturingProcess24 27.6671361 20
ManufacturingProcess03 27.0521286 21
ManufacturingProcess45 24.8626451 22
ManufacturingProcess44 20.5968078 23
BiologicalMaterial12 20.3501518 24
BiologicalMaterial03 19.6502236 25
ManufacturingProcess30 19.2443663 26
BiologicalMaterial08 17.8089030 27
ManufacturingProcess39 17.5734565 28
ManufacturingProcess16 17.3095533 29
ManufacturingProcess27 16.0139109 30
ManufacturingProcess42 15.4262121 31
ManufacturingProcess09 15.3486614 32
BiologicalMaterial09 15.2990524 33
ManufacturingProcess14 14.8120100 34
ManufacturingProcess19 13.9301138 35
BiologicalMaterial05 13.5932682 36
ManufacturingProcess22 13.4397047 37
ManufacturingProcess31 13.4053833 38
ManufacturingProcess38 11.1576650 39
ManufacturingProcess36 9.8270195 40
ManufacturingProcess01 9.5055929 41
ManufacturingProcess41 9.4853534 42
BiologicalMaterial04 9.0870487 43
ManufacturingProcess07 8.4639529 44
BiologicalMaterial10 6.7235521 45
ManufacturingProcess06 6.2554956 46
ManufacturingProcess15 5.6000945 47
ManufacturingProcess23 4.8022988 48
BiologicalMaterial06 4.4500872 49
ManufacturingProcess40 3.1831096 50
ManufacturingProcess02 2.0986780 51
ManufacturingProcess17 1.8149466 52
ManufacturingProcess08 1.6053969 53
BiologicalMaterial01 0.4081823 54
ManufacturingProcess34 0.0000000 55
We know that there are more manufacturing process predictors than there are biological material predictors. So let’s consider the ratio of manufacturing process predictors in the first half compared to the second half. There are 4 biological material predictors in the first half and 7 in the second half. It is clear that the manufacturing process predictors dominate the list.
6.3.f
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Let’s see how much correlation there is between the correlation of the ranking of the manufacturing-process/biological-material importance list and the Yield
variable.
map(rownames(df_imp), function(name) {
c(name, cor(df$Yield, df[[name]]))
|>
}) head(n = 20)
[[1]]
[1] "ManufacturingProcess32" "0.608332149747848"
[[2]]
[1] "ManufacturingProcess04" "-0.26607332782866"
[[3]]
[1] "ManufacturingProcess28" "0.267564814390742"
[[4]]
[1] "ManufacturingProcess43" "0.159268513493375"
[[5]]
[1] "ManufacturingProcess33" "0.425916169727246"
[[6]]
[1] "ManufacturingProcess12" "0.351303654798767"
[[7]]
[1] "ManufacturingProcess35" "-0.168589183302461"
[[8]]
[1] "ManufacturingProcess37" "-0.159314119899988"
[[9]]
[1] "BiologicalMaterial11" "0.354914346197819"
[[10]]
[1] "ManufacturingProcess10" "0.216704136578715"
[[11]]
[1] "ManufacturingProcess18" "-0.0589292500070705"
[[12]]
[1] "ManufacturingProcess20" "-0.0655114643844646"
[[13]]
[1] "ManufacturingProcess29" "0.153857214490886"
[[14]]
[1] "BiologicalMaterial02" "0.481515794019022"
[[15]]
[1] "ManufacturingProcess05" "0.11287519567964"
[[16]]
[1] "ManufacturingProcess26" "0.0385058771120283"
[[17]]
[1] "ManufacturingProcess13" "-0.503679718408746"
[[18]]
[1] "ManufacturingProcess25" "0.00701190869255503"
[[19]]
[1] "ManufacturingProcess11" "0.354100992969725"
[[20]]
[1] "ManufacturingProcess24" "-0.214890881693355"
I am noticing that there is more correlation (0.608) between the first importance predictor ManufacturingProcess32
and the Yield
variable than the R2 value of the model (0.37). Actually, it is a hair under. I was thinking that we could do better by using the ManufacturingProcess32
alone. It would be simpler, but not better. We’ll come back to this if we have time.