library(tidyverse)
library(AppliedPredictiveModeling)
library(skimr)
library(caret)
library(GGally)
set.seed(813098)Homework 7
6.2
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
- Start R and use these commands to load the data:
data(permeability)The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
- The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
n_predictors_left <- ncol(filtered_fingerprints)
n_predictors_left[1] 388
There are 388 predictors left for modeling after filtering out the near-zero variance predictors.
- Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
# Split the data into a training and a test set
trainingRows <- createDataPartition(
permeability,
p = .80,
list = FALSE
)
head(trainingRows) Resample1
[1,] 1
[2,] 4
[3,] 5
[4,] 6
[5,] 7
[6,] 8
# Subset the data into objects for training using integer sub-setting.
trainPredictors <- filtered_fingerprints[trainingRows, ]
trainClasses <- permeability[trainingRows]
# Do the same for the test set using negative integers.
testPredictors <- filtered_fingerprints[-trainingRows, ]
testClasses <- permeability[-trainingRows]
# Pre-process the data and tune a PLS model
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- train(
trainPredictors,
trainClasses,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
plsTunePartial Least Squares
133 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 119, 118, 120, 121, 120, 120, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 13.04495 0.3889410 9.855879
2 12.07025 0.4167704 8.578359
3 11.76663 0.4344682 8.818290
4 11.82458 0.4049298 9.049706
5 11.50111 0.4259716 8.743398
6 11.81385 0.4098471 9.017622
7 11.72507 0.4174966 9.009777
8 11.79303 0.4172816 8.951220
9 11.83927 0.4234698 8.873060
10 11.96716 0.4186799 8.918212
11 12.34722 0.4011802 9.121489
12 12.33673 0.4030007 9.230783
13 12.58661 0.3882106 9.417677
14 12.80737 0.3829754 9.531202
15 12.86936 0.3874828 9.690045
16 13.22190 0.3753557 9.977137
17 13.34250 0.3748184 10.015780
18 13.53016 0.3825441 10.148752
19 13.61111 0.3886850 10.214828
20 13.65062 0.3864161 10.207473
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 5.
- Predict the response for the test set. What is the test set estimate of R2?
# Predict the response for the test set
predictions <- predict(plsTune, newdata = testPredictors)
# Get test set estimate of R^2
plsValues <- data.frame(obs = testClasses, pred = predictions)
defaultSummaryValues <- defaultSummary(plsValues)
Rsquared <- defaultSummaryValues["Rsquared"]The test set estimate of R2 is 0.5815661.
- Try building other models discussed in this chapter. Do any have better predictive performance?
# Ordinary Linear Regression model
lmFit <- train(
x = trainPredictors,
y = trainClasses,
method = "lm",
trControl = ctrl
)
lmFitLinear Regression
133 samples
388 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 119, 119, 120, 121, 118, 120, ...
Resampling results:
RMSE Rsquared MAE
42.3375 0.2665748 27.43291
Tuning parameter 'intercept' was held constant at a value of TRUE
lm_predictions <- predict(lmFit, newdata = testPredictors)
lm_values <- data.frame(obs = testClasses, pred = lm_predictions)
lm_defaultSummaryValues <- defaultSummary(lm_values)
lm_Rsquared <- lm_defaultSummaryValues["Rsquared"]
# Robust linear regression model
rlmPCA <- train(
trainPredictors,
trainClasses,
method = "rlm",
preProcess = "pca",
trControl = ctrl
)
rlmPCARobust Linear Model
133 samples
388 predictors
Pre-processing: principal component signal extraction (388), centered
(388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 119, 121, 121, 118, 121, 119, ...
Resampling results across tuning parameters:
intercept psi RMSE Rsquared MAE
FALSE psi.huber 17.20786 0.4532193 13.683070
FALSE psi.hampel 17.20015 0.4663720 13.885365
FALSE psi.bisquare 17.15836 0.4501079 13.611705
TRUE psi.huber 11.85322 0.5044184 8.740722
TRUE psi.hampel 12.33251 0.4653272 9.248913
TRUE psi.bisquare 12.60861 0.5347240 8.304136
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were intercept = TRUE and psi = psi.huber.
rlm_predictions <- predict(rlmPCA, newdata = testPredictors)
rlm_values <- data.frame(obs = testClasses, pred = rlm_predictions)
rlm_defaultSummaryValues <- defaultSummary(rlm_values)
rlm_Rsquared <- rlm_defaultSummaryValues["Rsquared"]
# Elastic net model
enetGrid <- expand.grid(
.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20)
)
enetTune <- train(
trainPredictors,
trainClasses,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale")
)
enetTuneElasticnet
133 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 120, 120, 119, 120, 120, ...
Resampling results across tuning parameters:
lambda fraction RMSE Rsquared MAE
0.00 0.05 12.26290 0.5122591 9.240691
0.00 0.10 12.04153 0.5026511 8.860188
0.00 0.15 11.89661 0.5102722 8.687994
0.00 0.20 11.74986 0.5202451 8.574839
0.00 0.25 11.87210 0.5118214 8.665897
0.00 0.30 11.97120 0.5119878 8.862492
0.00 0.35 12.22086 0.5016257 9.094315
0.00 0.40 12.53193 0.4896627 9.265691
0.00 0.45 12.77545 0.4798137 9.452927
0.00 0.50 13.01862 0.4712435 9.597859
0.00 0.55 13.32409 0.4625014 9.827843
0.00 0.60 13.62263 0.4568342 10.057174
0.00 0.65 14.07472 0.4458742 10.334246
0.00 0.70 14.59850 0.4326191 10.613620
0.00 0.75 14.91685 0.4230139 10.780312
0.00 0.80 15.19519 0.4145982 10.936113
0.00 0.85 15.45128 0.4077523 11.048518
0.00 0.90 15.73520 0.4023391 11.231894
0.00 0.95 15.96312 0.4009913 11.330718
0.00 1.00 16.18310 0.3998642 11.426169
0.01 0.05 46.69467 0.4231756 29.412260
0.01 0.10 79.96583 0.4187060 49.968999
0.01 0.15 113.31488 0.4240885 70.378471
0.01 0.20 145.27725 0.4157101 90.274706
0.01 0.25 177.47351 0.4053567 110.837883
0.01 0.30 210.07390 0.3965803 131.557844
0.01 0.35 242.90840 0.3851313 152.421193
0.01 0.40 275.92797 0.3749525 173.360531
0.01 0.45 308.96992 0.3702313 194.258573
0.01 0.50 342.05749 0.3706891 215.181464
0.01 0.55 375.61419 0.3713373 236.419309
0.01 0.60 409.24263 0.3712460 257.661287
0.01 0.65 442.97486 0.3664618 279.082377
0.01 0.70 476.78166 0.3604166 300.572167
0.01 0.75 510.64767 0.3545795 322.060905
0.01 0.80 544.56064 0.3482289 343.570192
0.01 0.85 576.88024 0.3431127 364.598634
0.01 0.90 608.83972 0.3372916 385.518813
0.01 0.95 640.78670 0.3328813 406.396622
0.01 1.00 672.70170 0.3284291 427.229363
0.10 0.05 12.32161 0.4789667 9.235035
0.10 0.10 11.84629 0.4930441 8.596401
0.10 0.15 11.84183 0.5007407 8.631976
0.10 0.20 11.71222 0.5114511 8.618887
0.10 0.25 11.64390 0.5146486 8.626041
0.10 0.30 11.64322 0.5136992 8.618967
0.10 0.35 11.59209 0.5168968 8.542478
0.10 0.40 11.61510 0.5170314 8.548679
0.10 0.45 11.63622 0.5186747 8.584014
0.10 0.50 11.70063 0.5168669 8.666544
0.10 0.55 11.75876 0.5148511 8.743543
0.10 0.60 11.82424 0.5128873 8.841233
0.10 0.65 11.89822 0.5105326 8.936577
0.10 0.70 11.95469 0.5104441 9.009389
0.10 0.75 12.01456 0.5102456 9.072426
0.10 0.80 12.07212 0.5098505 9.131490
0.10 0.85 12.13026 0.5095282 9.190574
0.10 0.90 12.19561 0.5082315 9.255955
0.10 0.95 12.26695 0.5065606 9.316829
0.10 1.00 12.34175 0.5048552 9.378287
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.35 and lambda = 0.1.
enet_predictions <- predict(enetTune, newdata = testPredictors)
enet_values <- data.frame(obs = testClasses, pred = enet_predictions)
enet_defaultSummaryValues <- defaultSummary(enet_values)
enet_Rsquared <- enet_defaultSummaryValues["Rsquared"]
# Create a data frame to compare R-squared values
model_comparison <- data.frame(
model = c("PLS", "Linear", "Robust Linear", "Elastic Net"),
r_squared = c(
Rsquared,
lm_Rsquared,
rlm_Rsquared,
enet_Rsquared
)
) |>
arrange(desc(r_squared))
model_comparison model r_squared
1 PLS 0.58156609
2 Elastic Net 0.56599643
3 Robust Linear 0.48831242
4 Linear 0.04017562
The model with the highest R2 was PLS with an R2 of 0.5815661.
- Would you recommend any of your models to replace the permeability laboratory experiment?
# PLS
defaultSummaryValues RMSE Rsquared MAE
9.9102548 0.5815661 6.3187502
# LM
lm_defaultSummaryValues RMSE Rsquared MAE
28.42499691 0.04017562 15.47919899
# RLM
rlm_defaultSummaryValues RMSE Rsquared MAE
11.5870686 0.4883124 7.4740797
# Elastic net
enet_defaultSummaryValues RMSE Rsquared MAE
10.0095069 0.5659964 7.1854308
No, I would not recommend any of my models (LM, RLM, or elastic net) over the permeability laboratory experiment PLS model since the latter has the lowest RMSE, highest R2, and lowest MAE.
6.3
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
- Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
set.seed(813098)The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
- A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
# Impute missing values using KNN imputation
ChemicalManufacturingProcess |> skim()| Name | ChemicalManufacturingProc… |
| Number of rows | 176 |
| Number of columns | 58 |
| _______________________ | |
| Column type frequency: | |
| numeric | 58 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Yield | 0 | 1.00 | 40.18 | 1.85 | 35.25 | 38.75 | 39.97 | 41.48 | 46.34 | ▁▇▇▃▁ |
| BiologicalMaterial01 | 0 | 1.00 | 6.41 | 0.71 | 4.58 | 5.98 | 6.30 | 6.87 | 8.81 | ▂▇▇▂▁ |
| BiologicalMaterial02 | 0 | 1.00 | 55.69 | 4.03 | 46.87 | 52.68 | 55.09 | 58.74 | 64.75 | ▂▇▆▅▃ |
| BiologicalMaterial03 | 0 | 1.00 | 67.70 | 4.00 | 56.97 | 64.98 | 67.22 | 70.43 | 78.25 | ▂▅▇▆▁ |
| BiologicalMaterial04 | 0 | 1.00 | 12.35 | 1.77 | 9.38 | 11.25 | 12.10 | 13.22 | 23.09 | ▇▆▁▁▁ |
| BiologicalMaterial05 | 0 | 1.00 | 18.60 | 1.84 | 13.24 | 17.24 | 18.49 | 19.90 | 24.85 | ▁▅▇▃▁ |
| BiologicalMaterial06 | 0 | 1.00 | 48.91 | 3.75 | 40.60 | 46.06 | 48.46 | 51.34 | 59.38 | ▂▇▆▅▁ |
| BiologicalMaterial07 | 0 | 1.00 | 100.01 | 0.11 | 100.00 | 100.00 | 100.00 | 100.00 | 100.83 | ▇▁▁▁▁ |
| BiologicalMaterial08 | 0 | 1.00 | 17.49 | 0.68 | 15.88 | 17.06 | 17.51 | 17.88 | 19.14 | ▁▅▇▃▂ |
| BiologicalMaterial09 | 0 | 1.00 | 12.85 | 0.42 | 11.44 | 12.60 | 12.84 | 13.13 | 14.08 | ▁▃▇▅▁ |
| BiologicalMaterial10 | 0 | 1.00 | 2.80 | 0.60 | 1.77 | 2.46 | 2.71 | 2.99 | 6.87 | ▇▅▁▁▁ |
| BiologicalMaterial11 | 0 | 1.00 | 146.95 | 4.82 | 135.81 | 143.82 | 146.08 | 149.60 | 158.73 | ▂▆▇▃▂ |
| BiologicalMaterial12 | 0 | 1.00 | 20.20 | 0.77 | 18.35 | 19.73 | 20.12 | 20.75 | 22.21 | ▂▆▇▃▂ |
| ManufacturingProcess01 | 1 | 0.99 | 11.21 | 1.82 | 0.00 | 10.80 | 11.40 | 12.15 | 14.10 | ▁▁▁▅▇ |
| ManufacturingProcess02 | 3 | 0.98 | 16.68 | 8.47 | 0.00 | 19.30 | 21.00 | 21.50 | 22.50 | ▂▁▁▁▇ |
| ManufacturingProcess03 | 15 | 0.91 | 1.54 | 0.02 | 1.47 | 1.53 | 1.54 | 1.55 | 1.60 | ▁▃▆▇▁ |
| ManufacturingProcess04 | 1 | 0.99 | 931.85 | 6.27 | 911.00 | 928.00 | 934.00 | 936.00 | 946.00 | ▁▂▃▇▁ |
| ManufacturingProcess05 | 1 | 0.99 | 1001.69 | 30.53 | 923.00 | 986.75 | 999.20 | 1008.85 | 1175.30 | ▁▇▁▁▁ |
| ManufacturingProcess06 | 2 | 0.99 | 207.40 | 2.70 | 203.00 | 205.70 | 206.80 | 208.70 | 227.40 | ▇▃▁▁▁ |
| ManufacturingProcess07 | 1 | 0.99 | 177.48 | 0.50 | 177.00 | 177.00 | 177.00 | 178.00 | 178.00 | ▇▁▁▁▇ |
| ManufacturingProcess08 | 1 | 0.99 | 177.55 | 0.50 | 177.00 | 177.00 | 178.00 | 178.00 | 178.00 | ▆▁▁▁▇ |
| ManufacturingProcess09 | 0 | 1.00 | 45.66 | 1.55 | 38.89 | 44.89 | 45.73 | 46.52 | 49.36 | ▁▁▅▇▂ |
| ManufacturingProcess10 | 9 | 0.95 | 9.18 | 0.77 | 7.50 | 8.70 | 9.10 | 9.55 | 11.60 | ▂▇▆▂▁ |
| ManufacturingProcess11 | 10 | 0.94 | 9.39 | 0.72 | 7.50 | 9.00 | 9.40 | 9.90 | 11.50 | ▂▆▇▅▁ |
| ManufacturingProcess12 | 1 | 0.99 | 857.81 | 1784.53 | 0.00 | 0.00 | 0.00 | 0.00 | 4549.00 | ▇▁▁▁▂ |
| ManufacturingProcess13 | 0 | 1.00 | 34.51 | 1.02 | 32.10 | 33.90 | 34.60 | 35.20 | 38.60 | ▃▇▇▁▁ |
| ManufacturingProcess14 | 1 | 0.99 | 4853.87 | 54.52 | 4701.00 | 4828.00 | 4856.00 | 4882.50 | 5055.00 | ▁▅▇▂▁ |
| ManufacturingProcess15 | 0 | 1.00 | 6038.92 | 58.31 | 5904.00 | 6010.00 | 6031.50 | 6061.00 | 6233.00 | ▂▇▆▂▁ |
| ManufacturingProcess16 | 0 | 1.00 | 4565.80 | 351.70 | 0.00 | 4560.75 | 4588.00 | 4619.00 | 4852.00 | ▁▁▁▁▇ |
| ManufacturingProcess17 | 0 | 1.00 | 34.34 | 1.25 | 31.30 | 33.50 | 34.40 | 35.10 | 40.00 | ▂▇▆▁▁ |
| ManufacturingProcess18 | 0 | 1.00 | 4809.68 | 367.48 | 0.00 | 4813.00 | 4835.00 | 4862.00 | 4971.00 | ▁▁▁▁▇ |
| ManufacturingProcess19 | 0 | 1.00 | 6028.20 | 45.58 | 5890.00 | 6000.75 | 6022.00 | 6050.25 | 6146.00 | ▁▃▇▃▂ |
| ManufacturingProcess20 | 0 | 1.00 | 4556.46 | 349.01 | 0.00 | 4552.75 | 4582.00 | 4609.50 | 4759.00 | ▁▁▁▁▇ |
| ManufacturingProcess21 | 0 | 1.00 | -0.16 | 0.78 | -1.80 | -0.60 | -0.30 | 0.00 | 3.60 | ▂▇▂▁▁ |
| ManufacturingProcess22 | 1 | 0.99 | 5.41 | 3.33 | 0.00 | 3.00 | 5.00 | 8.00 | 12.00 | ▇▇▇▅▅ |
| ManufacturingProcess23 | 1 | 0.99 | 3.02 | 1.66 | 0.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▆▇▆▇ |
| ManufacturingProcess24 | 1 | 0.99 | 8.83 | 5.80 | 0.00 | 4.00 | 8.00 | 14.00 | 23.00 | ▇▇▅▆▁ |
| ManufacturingProcess25 | 5 | 0.97 | 4828.18 | 373.48 | 0.00 | 4832.00 | 4855.00 | 4877.00 | 4990.00 | ▁▁▁▁▇ |
| ManufacturingProcess26 | 5 | 0.97 | 6015.60 | 464.87 | 0.00 | 6019.50 | 6047.00 | 6070.50 | 6161.00 | ▁▁▁▁▇ |
| ManufacturingProcess27 | 5 | 0.97 | 4562.51 | 353.98 | 0.00 | 4560.00 | 4587.00 | 4609.00 | 4710.00 | ▁▁▁▁▇ |
| ManufacturingProcess28 | 5 | 0.97 | 6.59 | 5.25 | 0.00 | 0.00 | 10.40 | 10.75 | 11.50 | ▅▁▁▁▇ |
| ManufacturingProcess29 | 5 | 0.97 | 20.01 | 1.66 | 0.00 | 19.70 | 19.90 | 20.40 | 22.00 | ▁▁▁▁▇ |
| ManufacturingProcess30 | 5 | 0.97 | 9.16 | 0.98 | 0.00 | 8.80 | 9.10 | 9.70 | 11.20 | ▁▁▁▅▇ |
| ManufacturingProcess31 | 5 | 0.97 | 70.18 | 5.56 | 0.00 | 70.10 | 70.80 | 71.40 | 72.50 | ▁▁▁▁▇ |
| ManufacturingProcess32 | 0 | 1.00 | 158.47 | 5.40 | 143.00 | 155.00 | 158.00 | 162.00 | 173.00 | ▁▃▇▃▁ |
| ManufacturingProcess33 | 5 | 0.97 | 63.54 | 2.48 | 56.00 | 62.00 | 64.00 | 65.00 | 70.00 | ▁▃▇▅▁ |
| ManufacturingProcess34 | 5 | 0.97 | 2.49 | 0.05 | 2.30 | 2.50 | 2.50 | 2.50 | 2.60 | ▁▂▁▇▁ |
| ManufacturingProcess35 | 5 | 0.97 | 495.60 | 10.82 | 463.00 | 490.00 | 495.00 | 501.50 | 522.00 | ▁▂▇▅▂ |
| ManufacturingProcess36 | 5 | 0.97 | 0.02 | 0.00 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 | ▂▇▇▁▃ |
| ManufacturingProcess37 | 0 | 1.00 | 1.01 | 0.45 | 0.00 | 0.70 | 1.00 | 1.30 | 2.30 | ▂▇▇▃▁ |
| ManufacturingProcess38 | 0 | 1.00 | 2.53 | 0.65 | 0.00 | 2.00 | 3.00 | 3.00 | 3.00 | ▁▁▁▅▇ |
| ManufacturingProcess39 | 0 | 1.00 | 6.85 | 1.51 | 0.00 | 7.10 | 7.20 | 7.30 | 7.50 | ▁▁▁▁▇ |
| ManufacturingProcess40 | 1 | 0.99 | 0.02 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | ▇▁▁▁▂ |
| ManufacturingProcess41 | 1 | 0.99 | 0.02 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.20 | ▇▁▁▁▁ |
| ManufacturingProcess42 | 0 | 1.00 | 11.21 | 1.94 | 0.00 | 11.40 | 11.60 | 11.70 | 12.10 | ▁▁▁▁▇ |
| ManufacturingProcess43 | 0 | 1.00 | 0.91 | 0.87 | 0.00 | 0.60 | 0.80 | 1.02 | 11.00 | ▇▁▁▁▁ |
| ManufacturingProcess44 | 0 | 1.00 | 1.81 | 0.32 | 0.00 | 1.80 | 1.90 | 1.90 | 2.10 | ▁▁▁▁▇ |
| ManufacturingProcess45 | 0 | 1.00 | 2.14 | 0.41 | 0.00 | 2.10 | 2.20 | 2.30 | 2.60 | ▁▁▁▂▇ |
Some of the predictors have missing values. The skim() function shows that the ChemicalManufacturingProcess data set has 176 rows and 58 columns, with some columns having missing values.
impute <- preProcess(
ChemicalManufacturingProcess,
method = c("knnImpute")
)
imputeCreated from 152 samples and 58 variables
Pre-processing:
- centered (58)
- ignored (0)
- 5 nearest neighbor imputation (58)
- scaled (58)
chemical_impute <- predict(
impute,
ChemicalManufacturingProcess
)
# Browse the imputed data to confirm completeness.
chemical_impute |> skim()| Name | chemical_impute |
| Number of rows | 176 |
| Number of columns | 58 |
| _______________________ | |
| Column type frequency: | |
| numeric | 58 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Yield | 0 | 1 | 0.00 | 1.00 | -2.67 | -0.77 | -0.11 | 0.70 | 3.34 | ▁▇▇▃▁ |
| BiologicalMaterial01 | 0 | 1 | 0.00 | 1.00 | -2.57 | -0.61 | -0.15 | 0.64 | 3.36 | ▂▇▇▂▁ |
| BiologicalMaterial02 | 0 | 1 | 0.00 | 1.00 | -2.19 | -0.75 | -0.15 | 0.76 | 2.25 | ▂▇▆▅▃ |
| BiologicalMaterial03 | 0 | 1 | 0.00 | 1.00 | -2.68 | -0.68 | -0.12 | 0.68 | 2.64 | ▂▅▇▆▁ |
| BiologicalMaterial04 | 0 | 1 | 0.00 | 1.00 | -1.67 | -0.62 | -0.14 | 0.49 | 6.05 | ▇▆▁▁▁ |
| BiologicalMaterial05 | 0 | 1 | 0.00 | 1.00 | -2.91 | -0.74 | -0.06 | 0.71 | 3.39 | ▁▅▇▃▁ |
| BiologicalMaterial06 | 0 | 1 | 0.00 | 1.00 | -2.22 | -0.76 | -0.12 | 0.65 | 2.79 | ▂▇▆▅▁ |
| BiologicalMaterial07 | 0 | 1 | 0.00 | 1.00 | -0.13 | -0.13 | -0.13 | -0.13 | 7.57 | ▇▁▁▁▁ |
| BiologicalMaterial08 | 0 | 1 | 0.00 | 1.00 | -2.39 | -0.64 | 0.02 | 0.57 | 2.43 | ▁▅▇▃▂ |
| BiologicalMaterial09 | 0 | 1 | 0.00 | 1.00 | -3.40 | -0.60 | -0.04 | 0.67 | 2.96 | ▁▃▇▅▁ |
| BiologicalMaterial10 | 0 | 1 | 0.00 | 1.00 | -1.72 | -0.57 | -0.15 | 0.32 | 6.79 | ▇▅▁▁▁ |
| BiologicalMaterial11 | 0 | 1 | 0.00 | 1.00 | -2.31 | -0.65 | -0.18 | 0.55 | 2.44 | ▂▆▇▃▂ |
| BiologicalMaterial12 | 0 | 1 | 0.00 | 1.00 | -2.39 | -0.61 | -0.10 | 0.71 | 2.60 | ▂▆▇▃▂ |
| ManufacturingProcess01 | 0 | 1 | 0.00 | 1.00 | -6.15 | -0.22 | 0.11 | 0.50 | 1.59 | ▁▁▁▅▇ |
| ManufacturingProcess02 | 0 | 1 | 0.01 | 0.99 | -1.97 | 0.31 | 0.51 | 0.57 | 0.69 | ▂▁▁▁▇ |
| ManufacturingProcess03 | 0 | 1 | 0.04 | 0.97 | -3.11 | -0.43 | 0.38 | 0.47 | 2.70 | ▁▂▆▇▁ |
| ManufacturingProcess04 | 0 | 1 | 0.00 | 1.00 | -3.32 | -0.61 | 0.34 | 0.66 | 2.25 | ▁▂▃▇▂ |
| ManufacturingProcess05 | 0 | 1 | 0.00 | 1.00 | -2.58 | -0.49 | -0.09 | 0.23 | 5.69 | ▁▇▁▁▁ |
| ManufacturingProcess06 | 0 | 1 | -0.01 | 1.00 | -1.63 | -0.63 | -0.22 | 0.48 | 7.41 | ▇▃▁▁▁ |
| ManufacturingProcess07 | 0 | 1 | 0.00 | 1.00 | -0.96 | -0.96 | -0.96 | 1.04 | 1.04 | ▇▁▁▁▇ |
| ManufacturingProcess08 | 0 | 1 | 0.00 | 1.00 | -1.11 | -1.11 | 0.89 | 0.89 | 0.89 | ▆▁▁▁▇ |
| ManufacturingProcess09 | 0 | 1 | 0.00 | 1.00 | -4.38 | -0.50 | 0.05 | 0.55 | 2.39 | ▁▁▅▇▂ |
| ManufacturingProcess10 | 0 | 1 | 0.02 | 0.99 | -2.19 | -0.62 | -0.10 | 0.55 | 3.16 | ▂▇▇▂▁ |
| ManufacturingProcess11 | 0 | 1 | 0.03 | 1.00 | -2.63 | -0.54 | 0.02 | 0.72 | 2.95 | ▁▇▇▆▁ |
| ManufacturingProcess12 | 0 | 1 | 0.00 | 1.00 | -0.48 | -0.48 | -0.48 | -0.48 | 2.07 | ▇▁▁▁▂ |
| ManufacturingProcess13 | 0 | 1 | 0.00 | 1.00 | -2.37 | -0.60 | 0.09 | 0.68 | 4.03 | ▃▆▇▁▁ |
| ManufacturingProcess14 | 0 | 1 | 0.00 | 1.00 | -2.80 | -0.49 | 0.03 | 0.52 | 3.69 | ▁▅▇▂▁ |
| ManufacturingProcess15 | 0 | 1 | 0.00 | 1.00 | -2.31 | -0.50 | -0.13 | 0.38 | 3.33 | ▂▇▆▂▁ |
| ManufacturingProcess16 | 0 | 1 | 0.00 | 1.00 | -12.98 | -0.01 | 0.06 | 0.15 | 0.81 | ▁▁▁▁▇ |
| ManufacturingProcess17 | 0 | 1 | 0.00 | 1.00 | -2.44 | -0.68 | 0.05 | 0.61 | 4.53 | ▂▇▆▁▁ |
| ManufacturingProcess18 | 0 | 1 | 0.00 | 1.00 | -13.09 | 0.01 | 0.07 | 0.14 | 0.44 | ▁▁▁▁▇ |
| ManufacturingProcess19 | 0 | 1 | 0.00 | 1.00 | -3.03 | -0.60 | -0.14 | 0.48 | 2.58 | ▁▃▇▃▂ |
| ManufacturingProcess20 | 0 | 1 | 0.00 | 1.00 | -13.06 | -0.01 | 0.07 | 0.15 | 0.58 | ▁▁▁▁▇ |
| ManufacturingProcess21 | 0 | 1 | 0.00 | 1.00 | -2.10 | -0.56 | -0.17 | 0.21 | 4.84 | ▂▇▂▁▁ |
| ManufacturingProcess22 | 0 | 1 | 0.00 | 1.00 | -1.62 | -0.72 | -0.12 | 0.78 | 1.98 | ▇▇▇▅▅ |
| ManufacturingProcess23 | 0 | 1 | 0.00 | 1.00 | -1.81 | -0.61 | -0.01 | 0.59 | 1.79 | ▇▆▇▆▇ |
| ManufacturingProcess24 | 0 | 1 | 0.01 | 1.00 | -1.52 | -0.83 | -0.14 | 0.89 | 2.44 | ▇▇▅▆▁ |
| ManufacturingProcess25 | 0 | 1 | 0.00 | 0.99 | -12.93 | 0.00 | 0.07 | 0.13 | 0.43 | ▁▁▁▁▇ |
| ManufacturingProcess26 | 0 | 1 | 0.00 | 0.99 | -12.94 | 0.01 | 0.06 | 0.12 | 0.31 | ▁▁▁▁▇ |
| ManufacturingProcess27 | 0 | 1 | 0.00 | 0.99 | -12.89 | 0.00 | 0.07 | 0.13 | 0.42 | ▁▁▁▁▇ |
| ManufacturingProcess28 | 0 | 1 | -0.03 | 1.00 | -1.26 | -1.26 | 0.73 | 0.78 | 0.94 | ▅▁▁▁▇ |
| ManufacturingProcess29 | 0 | 1 | 0.00 | 0.99 | -12.03 | -0.19 | -0.07 | 0.23 | 1.20 | ▁▁▁▁▇ |
| ManufacturingProcess30 | 0 | 1 | 0.01 | 0.99 | -9.39 | -0.37 | 0.04 | 0.55 | 2.09 | ▁▁▁▅▇ |
| ManufacturingProcess31 | 0 | 1 | 0.00 | 0.99 | -12.63 | -0.02 | 0.11 | 0.22 | 0.42 | ▁▁▁▁▇ |
| ManufacturingProcess32 | 0 | 1 | 0.00 | 1.00 | -2.87 | -0.64 | -0.09 | 0.65 | 2.69 | ▁▃▇▃▁ |
| ManufacturingProcess33 | 0 | 1 | -0.01 | 0.99 | -3.04 | -0.62 | 0.18 | 0.59 | 2.60 | ▁▃▇▅▁ |
| ManufacturingProcess34 | 0 | 1 | -0.01 | 0.99 | -3.56 | 0.12 | 0.12 | 0.12 | 1.96 | ▁▂▁▇▁ |
| ManufacturingProcess35 | 0 | 1 | -0.02 | 1.00 | -3.01 | -0.52 | -0.06 | 0.50 | 2.44 | ▁▃▇▅▂ |
| ManufacturingProcess36 | 0 | 1 | -0.01 | 0.99 | -2.94 | -0.66 | -0.08 | 0.49 | 2.78 | ▂▇▁▇▃ |
| ManufacturingProcess37 | 0 | 1 | 0.00 | 1.00 | -2.28 | -0.70 | -0.03 | 0.64 | 2.89 | ▂▇▇▃▁ |
| ManufacturingProcess38 | 0 | 1 | 0.00 | 1.00 | -3.90 | -0.82 | 0.72 | 0.72 | 0.72 | ▁▁▁▅▇ |
| ManufacturingProcess39 | 0 | 1 | 0.00 | 1.00 | -4.55 | 0.17 | 0.23 | 0.30 | 0.43 | ▁▁▁▁▇ |
| ManufacturingProcess40 | 0 | 1 | 0.00 | 1.00 | -0.46 | -0.46 | -0.46 | -0.46 | 2.15 | ▇▁▁▁▂ |
| ManufacturingProcess41 | 0 | 1 | 0.00 | 1.00 | -0.44 | -0.44 | -0.44 | -0.44 | 3.28 | ▇▁▁▁▁ |
| ManufacturingProcess42 | 0 | 1 | 0.00 | 1.00 | -5.77 | 0.10 | 0.20 | 0.25 | 0.46 | ▁▁▁▁▇ |
| ManufacturingProcess43 | 0 | 1 | 0.00 | 1.00 | -1.05 | -0.36 | -0.13 | 0.13 | 11.62 | ▇▁▁▁▁ |
| ManufacturingProcess44 | 0 | 1 | 0.00 | 1.00 | -5.61 | -0.02 | 0.29 | 0.29 | 0.92 | ▁▁▁▁▇ |
| ManufacturingProcess45 | 0 | 1 | 0.00 | 1.00 | -5.25 | -0.09 | 0.15 | 0.40 | 1.14 | ▁▁▁▂▇ |
I used KNN to impute to fill in missing values. The data is now 100% complete.
- Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# remove nzv predictors
nzv <- nearZeroVar(chemical_impute)
filtered_chemical <- chemical_impute[, -nzv]
filtered_chemical |> ncol()[1] 57
# Split the data into a training and a test set
trainingRows <- createDataPartition(
filtered_chemical$Yield,
p = .80,
list = FALSE
)
chemical_train <- filtered_chemical[trainingRows, ]
chemical_test <- filtered_chemical[-trainingRows, ]
# Preprocess and tune a PLS model
plsTune <- train(
chemical_train[
,
!names(chemical_train) %in% "Yield"
],
chemical_train$Yield,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
plsTunePartial Least Squares
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 129, 129, 131, 130, 130, 129, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 0.7735224 0.4192347 0.6342773
2 0.8287404 0.5360269 0.6014268
3 0.7082333 0.5900823 0.5361202
4 0.8332205 0.5694967 0.5802218
5 0.8900838 0.5486265 0.6036344
6 0.8909049 0.5520740 0.6064889
7 0.9067980 0.5417347 0.6193941
8 0.9717348 0.5325744 0.6454026
9 0.9587565 0.5346397 0.6445537
10 0.9692858 0.5230863 0.6510060
11 0.9455037 0.5193371 0.6492300
12 0.9189663 0.5335323 0.6425027
13 0.9533670 0.5408934 0.6468623
14 1.0262358 0.5237221 0.6747596
15 1.0751780 0.5114347 0.6884491
16 1.0517792 0.5036778 0.6841665
17 1.0757681 0.4974535 0.6884337
18 1.1338358 0.4942898 0.7045341
19 1.1811687 0.4917023 0.7160439
20 1.2733157 0.4895829 0.7401724
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 3.
RMSE <- plsTune$results$RMSE[plsTune$bestTune$ncomp]The optimal value of the performance metric RMSE was 0.7082333. Note that this value is likely to be highly optimistic as it has been derived by re-predicting the training set data.
- Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# Predict the response for the test set
predictions <- predict(plsTune, newdata = chemical_test)
# Get value of the performance metric
plsValues <- data.frame(obs = chemical_test$Yield, pred = predictions)
performance_metric <- defaultSummary(plsValues)["RMSE"]The value of the performance metric RMSE is 0.8726064, which is higher (worse) than the resampled performance metric on the training set.
- Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
importance <- varImp(plsTune, scale = FALSE)$importance |> arrange(-Overall)
top_predictors <- head(importance)
top_predictors Overall
ManufacturingProcess32 0.06759997
ManufacturingProcess36 0.05713581
ManufacturingProcess13 0.05702366
ManufacturingProcess17 0.05467426
ManufacturingProcess09 0.05089636
ManufacturingProcess06 0.04741952
The most important predictors in this model are all manufacturing processes, as printed above.
- Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
top_predictors_data <- ChemicalManufacturingProcess[, c(
top_predictors |> rownames(),
"Yield"
)]
# Create correlation plot
ggpairs(
top_predictors_data
) +
theme_classic()I used ggpairs to plot the relationships between each of the top predictors and the response in the bottom row, and the rightmost column provides the correlation between the predictors and the response variable. This information can help identify which predictors have the strongest effects on the yield, guiding future improvements in the manufacturing process. For example, ManufacutringProcess32 has the highest positive correlation with Yield (as ManufacturingProcess32 increases, Yield increases), so this would be a measure to aim to increase for a greater yield.