Assignment 7: Linear Regression and Its Cousins

Author

Amanda Rose Knudsen

Published

April 6, 2025

This assignment covers Exercises 6.2 and 6.3 from Kuhn and Johnson’s Applied Predictive Modeling. While there are only two problems, each includes multiple steps and requires building and evaluating several models. Link to Applied Predictive Modeling for reference.

library(tidyverse)
library(caret)
library(pls)
library(glmnet)
library(corrplot)
library(e1071)
library(lattice)
library(car)
library(RANN)
library(AppliedPredictiveModeling)

6.2.

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a)

Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability)

data(permeability)
head(permeability)
  permeability
1       12.520
2        1.120
3       19.405
4        1.730
5        1.680
6        0.510
glimpse(permeability)
 num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:165] "1" "2" "3" "4" ...
  ..$ : chr "permeability"
glimpse(fingerprints)
 num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:165] "1" "2" "3" "4" ...
  ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nearzerovar_fingerprints <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nearzerovar_fingerprints]
ncol(filtered_fingerprints)
[1] 388

After filtering out near-zero variance predictors using the nearZeroVar function from the caret package, 388 predictors remain for modeling.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

First we’ll start with splitting the data into a training and a test set.

set.seed(5889)
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)
trainX <- filtered_fingerprints[trainIndex, ]
testX  <- filtered_fingerprints[-trainIndex, ]
trainY <- permeability[trainIndex]
testY  <- permeability[-trainIndex]

Now we’ll continue with pre-processing the data. Preprocessing is not required because the fingerprint predictors are binary, but centering and scaling are still applied within the model training step.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

The ctrl from above will be passed to train() below as trContrl = ctrl. As noted previously, since these are binary predictors, there’s no need for manual scaling outside the model pipeline.

We’re now tuning a PLS model using repeated cross-validation. Since the data consists of binary predictors, we don’t need external transformations, but we still apply centering and scaling during model training.

plsFit <- train(trainX, trainY,
                method = "pls",
                preProc = c("center", "scale"),
                tuneLength = 20,
                trControl = ctrl)

Now we’ll evaluate the model:

predictions <- predict(plsFit, testX)
results <- postResample(predictions, testY)
results
      RMSE   Rsquared        MAE 
13.9266124  0.2791087 10.7374415 
plsFit$bestTune
  ncomp
3     3

This indicates that the optimal number of latent variables (PLS components) is 3, based on repeated cross-validation.

max(plsFit$results$Rsquared)
[1] 0.5114109

This indicates that the corresponding resampled estimate of R-squared is 0.5046502.

So, to summarize the answer to (c), the optimal number of latent variables is 3, and the corresponding R-squared estimate is 0.50465.

The best cross-validated R-squared was 0.5047 using 3 components, while the test set R-squared was 0.2791. This difference shows the importance in validating on unseen data to estimate model performance.

ggplot(plsFit$results, aes(x = ncomp, y = Rsquared)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(title = "Cross-Validated R-squared vs. Number of PLS Components",
       x = "Number of Components",
       y = "R-squared")

We can see here why the 3 components is optimal with the highest R-squared value among the number of PLS components.

(d) Predict the response for the test set. What is the test set estimate of R^2?

results
      RMSE   Rsquared        MAE 
13.9266124  0.2791087 10.7374415 

Previously, we used the fitted PLS model with 3 components to predict permeability on the held-out test set. The test set R-squared — computed by comparing the model’s predictions against the true test outcomes — was 0.2791.

This is notably lower than the resampled R-squared from training (0.5047), indicating that the model may not generalize as well as suggested by cross-validation. This performance gap highlights the importance of using an independent test set to assess prediction accuracy.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

In the below models we’ll continue to use the ctrl object created previously.

Ridge Regression (alpha = 0):

set.seed(5889)
ridgeFit <- train(trainX, trainY,
                  method = "glmnet",
                  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 20)),
                  preProc = c("center", "scale"),
                  trControl = ctrl)

Lasso Regression (alpha = 1):

set.seed(5889)
lassoFit <- train(trainX, trainY,
                  method = "glmnet",
                  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 20)),
                  preProc = c("center", "scale"),
                  trControl = ctrl)

Elastic Net (alpha between 0 and 1):

set.seed(5889)
enetFit <- train(trainX, trainY,
                 method = "glmnet",
                 tuneLength = 10, # will search over alpha and lambda
                 preProc = c("center", "scale"),
                 trControl = ctrl)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

Evaluation of models on the test set:

# Predictions
ridgePred <- predict(ridgeFit, testX)
lassoPred <- predict(lassoFit, testX)
enetPred  <- predict(enetFit, testX)

# R-squared and RMSE
ridgePerf <- postResample(ridgePred, testY)
lassoPerf <- postResample(lassoPred, testY)
enetPerf  <- postResample(enetPred, testY)

model_results <- tibble(
  Model  = c("PLS", "Ridge", "Lasso", "Elastic Net"),
  Rsq    = c(results["Rsquared"], ridgePerf["Rsquared"], 
             lassoPerf["Rsquared"], enetPerf["Rsquared"]),
  RMSE   = c(results["RMSE"], ridgePerf["RMSE"], 
             lassoPerf["RMSE"], enetPerf["RMSE"])
)

model_results
# A tibble: 4 × 3
  Model         Rsq  RMSE
  <chr>       <dbl> <dbl>
1 PLS         0.279  13.9
2 Ridge       0.369  12.6
3 Lasso       0.445  11.9
4 Elastic Net 0.469  11.5

We fit ridge, lasso, and elastic net models using 10-fold cross-validation repeated 5 times. All models used the same training and test splits as the PLS model. Test set performance is the best estimate of generalization. The test set R-squared values are below:

PLS: R-squared = 0.2791 Ridge: R-squared = 0.3686 Lasso: R-squared = 0.4446 Elastic Net: R-squared = 0.46898

Elastic Net regression performed the best, resulting in the highest R-squared value and the lowest RMSE. This indicates that it explains the most variability in permeability and had the most accurate predictions. The Lasso regression performed second-best, with better R-squared and RMSE scores than Ridge or PLS. Ridge regression performed better than PLS but not better than Lasso or Elastic Net regression.

These results show that models which automatically focus on the most useful predictors – and avoid getting overwhelmed by too many of them – worked better in this case. The techniques to reduce the number of predictors and prevent overfitting helped improve accuracy, especially since the dataset has so many binary (0/1) variables, and not all of them are important for predicting permeability.

Elastic net seems to have benefitted from its hybrid penalty (L1 and L2), balancing feature selection (like Lasso) with stability in the presence of multicollinearity (like Ridge).

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

While some of the models (especially Elastic Net regression) performed well, I would be cautious about fully replacing the laboratory experiment at this stage. The best model achieved an R-squared of 0.469, meaning it explains just under half of the variability in permeability. While that’s a promising result, it also shows that more than half of the variation remains unexplained.

These models could still be useful in practice, for example to screen compounds before running a lab test. If model accuracy improves (like with more training data, more informative predictors, or better feature engineering, for instance), then a stronger case could be made for partial or full replacement. But at present, I would recommend using the model as a complement to (not a substitute for) the lab process.

6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(ChemicalManufacturingProcess)

data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
[1] 176  58

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

head(ChemicalManufacturingProcess)
  Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
1 38.00                 6.25                49.58                56.97
2 42.44                 8.01                60.97                67.48
3 42.03                 8.01                60.97                67.48
4 41.42                 8.01                60.97                67.48
5 42.49                 7.47                63.33                72.25
6 43.57                 6.12                58.36                65.31
  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
1                12.74                19.51                43.73
2                14.65                19.36                53.14
3                14.65                19.36                53.14
4                14.65                19.36                53.14
5                14.02                17.91                54.66
6                15.17                21.79                51.23
  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
1                  100                16.66                11.44
2                  100                19.04                12.55
3                  100                19.04                12.55
4                  100                19.04                12.55
5                  100                18.22                12.80
6                  100                18.30                12.13
  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
1                 3.46               138.09                18.83
2                 3.46               153.67                21.05
3                 3.46               153.67                21.05
4                 3.46               153.67                21.05
5                 3.05               147.61                21.05
6                 3.78               151.88                20.76
  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
1                     NA                     NA                     NA
2                    0.0                      0                     NA
3                    0.0                      0                     NA
4                    0.0                      0                     NA
5                   10.7                      0                     NA
6                   12.0                      0                     NA
  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
1                     NA                     NA                     NA
2                    917                 1032.2                  210.0
3                    912                 1003.6                  207.1
4                    911                 1014.6                  213.3
5                    918                 1027.5                  205.7
6                    924                 1016.8                  208.9
  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
1                     NA                     NA                  43.00
2                    177                    178                  46.57
3                    178                    178                  45.07
4                    177                    177                  44.92
5                    178                    178                  44.96
6                    178                    178                  45.32
  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
1                     NA                     NA                     NA
2                     NA                     NA                      0
3                     NA                     NA                      0
4                     NA                     NA                      0
5                     NA                     NA                      0
6                     NA                     NA                      0
  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
1                   35.5                   4898                   6108
2                   34.0                   4869                   6095
3                   34.8                   4878                   6087
4                   34.8                   4897                   6102
5                   34.6                   4992                   6233
6                   34.0                   4985                   6222
  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
1                   4682                   35.5                   4865
2                   4617                   34.0                   4867
3                   4617                   34.8                   4877
4                   4635                   34.8                   4872
5                   4733                   33.9                   4886
6                   4786                   33.4                   4862
  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
1                   6049                   4665                    0.0
2                   6097                   4621                    0.0
3                   6078                   4621                    0.0
4                   6073                   4611                    0.0
5                   6102                   4659                   -0.7
6                   6115                   4696                   -0.6
  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
1                     NA                     NA                     NA
2                      3                      0                      3
3                      4                      1                      4
4                      5                      2                      5
5                      8                      4                     18
6                      9                      1                      1
  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
1                   4873                   6074                   4685
2                   4869                   6107                   4630
3                   4897                   6116                   4637
4                   4892                   6111                   4630
5                   4930                   6151                   4684
6                   4871                   6128                   4687
  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
1                   10.7                   21.0                    9.9
2                   11.2                   21.4                    9.9
3                   11.1                   21.3                    9.4
4                   11.1                   21.3                    9.4
5                   11.3                   21.6                    9.0
6                   11.4                   21.7                   10.1
  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
1                   69.1                    156                     66
2                   68.7                    169                     66
3                   69.3                    173                     66
4                   69.3                    171                     68
5                   69.4                    171                     70
6                   68.2                    173                     70
  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
1                    2.4                    486                  0.019
2                    2.6                    508                  0.019
3                    2.6                    509                  0.018
4                    2.5                    496                  0.018
5                    2.5                    468                  0.017
6                    2.5                    490                  0.018
  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
1                    0.5                      3                    7.2
2                    2.0                      2                    7.2
3                    0.7                      2                    7.2
4                    1.2                      2                    7.2
5                    0.2                      2                    7.3
6                    0.4                      2                    7.2
  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
1                     NA                     NA                   11.6
2                    0.1                   0.15                   11.1
3                    0.0                   0.00                   12.0
4                    0.0                   0.00                   10.6
5                    0.0                   0.00                   11.0
6                    0.0                   0.00                   11.5
  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
1                    3.0                    1.8                    2.4
2                    0.9                    1.9                    2.2
3                    1.0                    1.8                    2.3
4                    1.1                    1.8                    2.1
5                    1.1                    1.7                    2.1
6                    2.2                    1.8                    2.0

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We’ll use KNN (K nearest neighbors) to impute missing values.

First, we’ll separate predictors and response variable, yield.

cmpY <- ChemicalManufacturingProcess$Yield
cmpX <- ChemicalManufacturingProcess |> select(-Yield)

The prompt told us there is missing data, but let’s check just to be sure.

cmp_missing_summary <- sapply(cmpX, function(x) sum(is.na(x)))
cmp_missing_summary[cmp_missing_summary > 0]
ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
                     1                      3                     15 
ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
                     1                      1                      2 
ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 
                     1                      1                      9 
ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 
                    10                      1                      1 
ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
                     1                      1                      1 
ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
                     5                      5                      5 
ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
                     5                      5                      5 
ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 
                     5                      5                      5 
ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 
                     5                      5                      1 
ManufacturingProcess41 
                     1 

We can now validate that there is indeed missing data that needs to be imputed. Let’s go ahead and do that using KNN imputation.

set.seed(5889)  
cmp_impute_model <- preProcess(cmpX, method = "knnImpute")

Now we will apply the imputation to our predictors.

cmpX_imputed <- predict(cmp_impute_model, cmpX)

KNN fills in missing values based on the most similar rows (nearest neighbors). This is recommended by Applied Predictive Modeling as a practical and effective method for imputation in our scenario where missingness is relatively low and predictors are numeric variables.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

We’ll use the imputed predictor data to fit a regression model and then figure out which variables matter most.

First, we split the data into a training and a test set:

set.seed(5889)
cmp_split <- createDataPartition(cmpY, p = 0.8, list = FALSE)
cmp_trainX <- cmpX_imputed[cmp_split, ]
cmp_trainY <- cmpY[cmp_split]
cmp_testX  <- cmpX_imputed[-cmp_split, ]
cmp_testY  <- cmpY[-cmp_split]

In the below cross-validation and training of a PLS model we can use the same ctrl we created previously in 6.2. We’ll use Box-Cox method for transformation to normalize.

cmp_plsFit <- train(cmp_trainX, cmp_trainY,
                    method = "pls",
                    preProc = c("center", "scale", "BoxCox"),  
                    tuneLength = 20,
                    trControl = ctrl)

To find out how many PLS components were optimal:

cmp_plsFit$bestTune
  ncomp
3     3
cmp_plsFit$results |> 
  filter(ncomp == cmp_plsFit$bestTune$ncomp)
  ncomp     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1     3 1.446177 0.5462859 1.085197 0.6617454  0.1551433 0.2762623

As with 6.2, the best number of components is 3, as found through cross-validation on the training set. The resampled R-squared at that best number of components (3) is 0.5463.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Now we’ll predict on the test set and evaluate:

cmp_preds <- predict(cmp_plsFit, cmp_testX)
cmp_results <- postResample(cmp_preds, cmp_testY)
cmp_results
     RMSE  Rsquared       MAE 
0.9518583 0.7325815 0.7905189 

The R-squared on the test set is 0.7326, which is actually higher than the resampled R-squared of 0.5463 on the training set. This suggests the model may have generalized better than expected, possibly due to a favorable test split or reduced noise in the test data.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

cmp_importance <- varImp(cmp_plsFit, scale = TRUE)
plot(cmp_importance, top = 20, main = "Top 20 Most Important Predictors for Yield")

The manufacturing process predictors dominate this view of predictor importance. ManufacturingProcess32 is the most imporant, followed by ManufacturingProcess13, ManufacturingProcess09, ManufacturingProcess17, and ManufacturingProcess36.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

First we’ll bring the yield and predictors back into one data frame for plotting

cmp_data_full <- cmpX_imputed |> 
  mutate(Yield = cmpY)

Now we’ll pick the top 3 predictors, ManufacturingProcess32, ManufacturingProcess13 and ManufacturingProcess09, and plot each against yield. This will allow us to help make the relationships between the top 3 predictors and yield visually interpretable.

top_vars <- c("ManufacturingProcess32", "ManufacturingProcess13", 
              "ManufacturingProcess09")

for (var in top_vars) {
  print(
    ggplot(cmp_data_full, aes_string(x = var, y = "Yield")) +
      geom_point(alpha = 0.6) +
      geom_smooth(method = "loess", se = FALSE, color = "goldenrod") +
      labs(title = paste("Yield vs.", var))
  )
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

We created scatterplots with smooth trend lines to explore how the most important process variables influence product yield for the top three predictors: ManufacturingProcess32, ManufacturingProcess13, and ManufacturingProcess09. For ManufacturingProcess32, we observed a nonlinear relationship, where yield appears to increase as the process value increases from around 0 to 1.5, and then levels off. In contrast, ManufacturingProcess13 showed a negative association with yield — as values increase, yield tends to decrease. With ManufacturingProcess09 we see a positive linear relationship, where higher values are associated with higher yield. Because these are process predictors, they could be adjusted in future manufacturing processes and tests. Understanding these relationships provides insights and opportunities to fine-tune specific steps in production to maximize yield.