library(tidyverse)
library(caret)
library(pls)
library(glmnet)
library(corrplot)
library(e1071)
library(lattice)
library(car)
library(RANN)
library(AppliedPredictiveModeling)
Assignment 7: Linear Regression and Its Cousins
This assignment covers Exercises 6.2 and 6.3 from Kuhn and Johnson’s Applied Predictive Modeling. While there are only two problems, each includes multiple steps and requires building and evaluating several models. Link to Applied Predictive Modeling for reference.
6.2.
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a)
Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability)
data(permeability)
head(permeability)
permeability
1 12.520
2 1.120
3 19.405
4 1.730
5 1.680
6 0.510
glimpse(permeability)
num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:165] "1" "2" "3" "4" ...
..$ : chr "permeability"
glimpse(fingerprints)
num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:165] "1" "2" "3" "4" ...
..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability
contains permeability response.
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar
function from the caret package. How many predictors are left for modeling?
<- nearZeroVar(fingerprints)
nearzerovar_fingerprints <- fingerprints[, -nearzerovar_fingerprints]
filtered_fingerprints ncol(filtered_fingerprints)
[1] 388
After filtering out near-zero variance predictors using the nearZeroVar
function from the caret package, 388 predictors remain for modeling.
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?
First we’ll start with splitting the data into a training and a test set.
set.seed(5889)
<- createDataPartition(permeability, p = 0.8, list = FALSE)
trainIndex <- filtered_fingerprints[trainIndex, ]
trainX <- filtered_fingerprints[-trainIndex, ]
testX <- permeability[trainIndex]
trainY <- permeability[-trainIndex] testY
Now we’ll continue with pre-processing the data. Preprocessing is not required because the fingerprint predictors are binary, but centering and scaling are still applied within the model training step.
<- trainControl(method = "repeatedcv", number = 10, repeats = 5) ctrl
The ctrl
from above will be passed to train() below as trContrl = ctrl
. As noted previously, since these are binary predictors, there’s no need for manual scaling outside the model pipeline.
We’re now tuning a PLS model using repeated cross-validation. Since the data consists of binary predictors, we don’t need external transformations, but we still apply centering and scaling during model training.
<- train(trainX, trainY,
plsFit method = "pls",
preProc = c("center", "scale"),
tuneLength = 20,
trControl = ctrl)
Now we’ll evaluate the model:
<- predict(plsFit, testX)
predictions <- postResample(predictions, testY)
results results
RMSE Rsquared MAE
13.9266124 0.2791087 10.7374415
$bestTune plsFit
ncomp
3 3
This indicates that the optimal number of latent variables (PLS components) is 3, based on repeated cross-validation.
max(plsFit$results$Rsquared)
[1] 0.5114109
This indicates that the corresponding resampled estimate of R-squared is 0.5046502.
So, to summarize the answer to (c), the optimal number of latent variables is 3, and the corresponding R-squared estimate is 0.50465.
The best cross-validated R-squared was 0.5047 using 3 components, while the test set R-squared was 0.2791. This difference shows the importance in validating on unseen data to estimate model performance.
ggplot(plsFit$results, aes(x = ncomp, y = Rsquared)) +
geom_line() +
geom_point() +
theme_minimal() +
labs(title = "Cross-Validated R-squared vs. Number of PLS Components",
x = "Number of Components",
y = "R-squared")
We can see here why the 3 components is optimal with the highest R-squared value among the number of PLS components.
(d) Predict the response for the test set. What is the test set estimate of R^2?
results
RMSE Rsquared MAE
13.9266124 0.2791087 10.7374415
Previously, we used the fitted PLS model with 3 components to predict permeability on the held-out test set. The test set R-squared — computed by comparing the model’s predictions against the true test outcomes — was 0.2791.
This is notably lower than the resampled R-squared from training (0.5047), indicating that the model may not generalize as well as suggested by cross-validation. This performance gap highlights the importance of using an independent test set to assess prediction accuracy.
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
In the below models we’ll continue to use the ctrl
object created previously.
Ridge Regression (alpha = 0):
set.seed(5889)
<- train(trainX, trainY,
ridgeFit method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 20)),
preProc = c("center", "scale"),
trControl = ctrl)
Lasso Regression (alpha = 1):
set.seed(5889)
<- train(trainX, trainY,
lassoFit method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 20)),
preProc = c("center", "scale"),
trControl = ctrl)
Elastic Net (alpha between 0 and 1):
set.seed(5889)
<- train(trainX, trainY,
enetFit method = "glmnet",
tuneLength = 10, # will search over alpha and lambda
preProc = c("center", "scale"),
trControl = ctrl)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
Evaluation of models on the test set:
# Predictions
<- predict(ridgeFit, testX)
ridgePred <- predict(lassoFit, testX)
lassoPred <- predict(enetFit, testX)
enetPred
# R-squared and RMSE
<- postResample(ridgePred, testY)
ridgePerf <- postResample(lassoPred, testY)
lassoPerf <- postResample(enetPred, testY)
enetPerf
<- tibble(
model_results Model = c("PLS", "Ridge", "Lasso", "Elastic Net"),
Rsq = c(results["Rsquared"], ridgePerf["Rsquared"],
"Rsquared"], enetPerf["Rsquared"]),
lassoPerf[RMSE = c(results["RMSE"], ridgePerf["RMSE"],
"RMSE"], enetPerf["RMSE"])
lassoPerf[
)
model_results
# A tibble: 4 × 3
Model Rsq RMSE
<chr> <dbl> <dbl>
1 PLS 0.279 13.9
2 Ridge 0.369 12.6
3 Lasso 0.445 11.9
4 Elastic Net 0.469 11.5
We fit ridge, lasso, and elastic net models using 10-fold cross-validation repeated 5 times. All models used the same training and test splits as the PLS model. Test set performance is the best estimate of generalization. The test set R-squared values are below:
PLS: R-squared = 0.2791 Ridge: R-squared = 0.3686 Lasso: R-squared = 0.4446 Elastic Net: R-squared = 0.46898
Elastic Net regression performed the best, resulting in the highest R-squared value and the lowest RMSE. This indicates that it explains the most variability in permeability and had the most accurate predictions. The Lasso regression performed second-best, with better R-squared and RMSE scores than Ridge or PLS. Ridge regression performed better than PLS but not better than Lasso or Elastic Net regression.
These results show that models which automatically focus on the most useful predictors – and avoid getting overwhelmed by too many of them – worked better in this case. The techniques to reduce the number of predictors and prevent overfitting helped improve accuracy, especially since the dataset has so many binary (0/1) variables, and not all of them are important for predicting permeability.
Elastic net seems to have benefitted from its hybrid penalty (L1 and L2), balancing feature selection (like Lasso) with stability in the presence of multicollinearity (like Ridge).
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
While some of the models (especially Elastic Net regression) performed well, I would be cautious about fully replacing the laboratory experiment at this stage. The best model achieved an R-squared of 0.469, meaning it explains just under half of the variability in permeability. While that’s a promising result, it also shows that more than half of the variation remains unexplained.
These models could still be useful in practice, for example to screen compounds before running a lab test. If model accuracy improves (like with more training data, more informative predictors, or better feature engineering, for instance), then a stronger case could be made for partial or full replacement. But at present, I would recommend using the model as a complement to (not a substitute for) the lab process.
6.3.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling) data(ChemicalManufacturingProcess)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
[1] 176 58
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
head(ChemicalManufacturingProcess)
Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
1 38.00 6.25 49.58 56.97
2 42.44 8.01 60.97 67.48
3 42.03 8.01 60.97 67.48
4 41.42 8.01 60.97 67.48
5 42.49 7.47 63.33 72.25
6 43.57 6.12 58.36 65.31
BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
1 12.74 19.51 43.73
2 14.65 19.36 53.14
3 14.65 19.36 53.14
4 14.65 19.36 53.14
5 14.02 17.91 54.66
6 15.17 21.79 51.23
BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
1 100 16.66 11.44
2 100 19.04 12.55
3 100 19.04 12.55
4 100 19.04 12.55
5 100 18.22 12.80
6 100 18.30 12.13
BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
1 3.46 138.09 18.83
2 3.46 153.67 21.05
3 3.46 153.67 21.05
4 3.46 153.67 21.05
5 3.05 147.61 21.05
6 3.78 151.88 20.76
ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
1 NA NA NA
2 0.0 0 NA
3 0.0 0 NA
4 0.0 0 NA
5 10.7 0 NA
6 12.0 0 NA
ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
1 NA NA NA
2 917 1032.2 210.0
3 912 1003.6 207.1
4 911 1014.6 213.3
5 918 1027.5 205.7
6 924 1016.8 208.9
ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
1 NA NA 43.00
2 177 178 46.57
3 178 178 45.07
4 177 177 44.92
5 178 178 44.96
6 178 178 45.32
ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
1 NA NA NA
2 NA NA 0
3 NA NA 0
4 NA NA 0
5 NA NA 0
6 NA NA 0
ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
1 35.5 4898 6108
2 34.0 4869 6095
3 34.8 4878 6087
4 34.8 4897 6102
5 34.6 4992 6233
6 34.0 4985 6222
ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
1 4682 35.5 4865
2 4617 34.0 4867
3 4617 34.8 4877
4 4635 34.8 4872
5 4733 33.9 4886
6 4786 33.4 4862
ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
1 6049 4665 0.0
2 6097 4621 0.0
3 6078 4621 0.0
4 6073 4611 0.0
5 6102 4659 -0.7
6 6115 4696 -0.6
ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
1 NA NA NA
2 3 0 3
3 4 1 4
4 5 2 5
5 8 4 18
6 9 1 1
ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
1 4873 6074 4685
2 4869 6107 4630
3 4897 6116 4637
4 4892 6111 4630
5 4930 6151 4684
6 4871 6128 4687
ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
1 10.7 21.0 9.9
2 11.2 21.4 9.9
3 11.1 21.3 9.4
4 11.1 21.3 9.4
5 11.3 21.6 9.0
6 11.4 21.7 10.1
ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
1 69.1 156 66
2 68.7 169 66
3 69.3 173 66
4 69.3 171 68
5 69.4 171 70
6 68.2 173 70
ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
1 2.4 486 0.019
2 2.6 508 0.019
3 2.6 509 0.018
4 2.5 496 0.018
5 2.5 468 0.017
6 2.5 490 0.018
ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
1 0.5 3 7.2
2 2.0 2 7.2
3 0.7 2 7.2
4 1.2 2 7.2
5 0.2 2 7.3
6 0.4 2 7.2
ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
1 NA NA 11.6
2 0.1 0.15 11.1
3 0.0 0.00 12.0
4 0.0 0.00 10.6
5 0.0 0.00 11.0
6 0.0 0.00 11.5
ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
1 3.0 1.8 2.4
2 0.9 1.9 2.2
3 1.0 1.8 2.3
4 1.1 1.8 2.1
5 1.1 1.7 2.1
6 2.2 1.8 2.0
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
We’ll use KNN (K nearest neighbors) to impute missing values.
First, we’ll separate predictors and response variable, yield
.
<- ChemicalManufacturingProcess$Yield
cmpY <- ChemicalManufacturingProcess |> select(-Yield) cmpX
The prompt told us there is missing data, but let’s check just to be sure.
<- sapply(cmpX, function(x) sum(is.na(x)))
cmp_missing_summary > 0] cmp_missing_summary[cmp_missing_summary
ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
1 3 15
ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
1 1 2
ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10
1 1 9
ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14
10 1 1
ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
1 1 1
ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
5 5 5
ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
5 5 5
ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34
5 5 5
ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40
5 5 1
ManufacturingProcess41
1
We can now validate that there is indeed missing data that needs to be imputed. Let’s go ahead and do that using KNN imputation.
set.seed(5889)
<- preProcess(cmpX, method = "knnImpute") cmp_impute_model
Now we will apply the imputation to our predictors.
<- predict(cmp_impute_model, cmpX) cmpX_imputed
KNN fills in missing values based on the most similar rows (nearest neighbors). This is recommended by Applied Predictive Modeling as a practical and effective method for imputation in our scenario where missingness is relatively low and predictors are numeric variables.
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
We’ll use the imputed predictor data to fit a regression model and then figure out which variables matter most.
First, we split the data into a training and a test set:
set.seed(5889)
<- createDataPartition(cmpY, p = 0.8, list = FALSE)
cmp_split <- cmpX_imputed[cmp_split, ]
cmp_trainX <- cmpY[cmp_split]
cmp_trainY <- cmpX_imputed[-cmp_split, ]
cmp_testX <- cmpY[-cmp_split] cmp_testY
In the below cross-validation and training of a PLS model we can use the same ctrl
we created previously in 6.2. We’ll use Box-Cox method for transformation to normalize.
<- train(cmp_trainX, cmp_trainY,
cmp_plsFit method = "pls",
preProc = c("center", "scale", "BoxCox"),
tuneLength = 20,
trControl = ctrl)
To find out how many PLS components were optimal:
$bestTune cmp_plsFit
ncomp
3 3
$results |>
cmp_plsFitfilter(ncomp == cmp_plsFit$bestTune$ncomp)
ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 3 1.446177 0.5462859 1.085197 0.6617454 0.1551433 0.2762623
As with 6.2, the best number of components is 3, as found through cross-validation on the training set. The resampled R-squared at that best number of components (3) is 0.5463.
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Now we’ll predict on the test set and evaluate:
<- predict(cmp_plsFit, cmp_testX)
cmp_preds <- postResample(cmp_preds, cmp_testY)
cmp_results cmp_results
RMSE Rsquared MAE
0.9518583 0.7325815 0.7905189
The R-squared on the test set is 0.7326, which is actually higher than the resampled R-squared of 0.5463 on the training set. This suggests the model may have generalized better than expected, possibly due to a favorable test split or reduced noise in the test data.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
<- varImp(cmp_plsFit, scale = TRUE)
cmp_importance plot(cmp_importance, top = 20, main = "Top 20 Most Important Predictors for Yield")
The manufacturing process predictors dominate this view of predictor importance. ManufacturingProcess32 is the most imporant, followed by ManufacturingProcess13, ManufacturingProcess09, ManufacturingProcess17, and ManufacturingProcess36.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
First we’ll bring the yield and predictors back into one data frame for plotting
<- cmpX_imputed |>
cmp_data_full mutate(Yield = cmpY)
Now we’ll pick the top 3 predictors, ManufacturingProcess32, ManufacturingProcess13 and ManufacturingProcess09, and plot each against yield
. This will allow us to help make the relationships between the top 3 predictors and yield visually interpretable.
<- c("ManufacturingProcess32", "ManufacturingProcess13",
top_vars "ManufacturingProcess09")
for (var in top_vars) {
print(
ggplot(cmp_data_full, aes_string(x = var, y = "Yield")) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, color = "goldenrod") +
labs(title = paste("Yield vs.", var))
) }
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
We created scatterplots with smooth trend lines to explore how the most important process variables influence product yield for the top three predictors: ManufacturingProcess32, ManufacturingProcess13, and ManufacturingProcess09. For ManufacturingProcess32, we observed a nonlinear relationship, where yield appears to increase as the process value increases from around 0 to 1.5, and then levels off. In contrast, ManufacturingProcess13 showed a negative association with yield — as values increase, yield tends to decrease. With ManufacturingProcess09 we see a positive linear relationship, where higher values are associated with higher yield. Because these are process predictors, they could be adjusted in future manufacturing processes and tests. Understanding these relationships provides insights and opportunities to fine-tune specific steps in production to maximize yield.