LFMG-Lab-7

Author

Luis Munoz Grass

Linear Regression

The following assignment will explore exercises 6.2 and 6.3 from Kuhn and Johnson’s book: Applied Predictive Modeling.

6.2

Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug

a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)
Loading required package: ggplot2
Loading required package: lattice
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ purrr::lift()   masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# identifying low-frequency predictors
nzv <- nearZeroVar(fingerprints)

# Subset to keep only predictors that are NOT near-zero variance
filtered_fingerprints <- fingerprints[, -nzv]

# Check how many predictors are left
num_predictors_left <- ncol(filtered_fingerprints)
num_predictors_left
[1] 388

Filtering out the predictors with low frequency leaves 388 predictors for modeling.

c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
set.seed(89)

# Splitting the data into training and testing sets (in this case 75% train)
trainIndex <- createDataPartition(permeability, p = 0.75, list = FALSE)
trainX <- filtered_fingerprints[trainIndex, ]
trainY <- permeability[trainIndex]
testX <- filtered_fingerprints[-trainIndex, ]
testY <- permeability[-trainIndex]

# Training PLS model with preprocessing and tuning
pls_fit <- train(
  x = trainX,
  y = trainY,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,  
  trControl = trainControl(method = "cv")
)


print(pls_fit)
Partial Least Squares 

125 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 113, 113, 113, 113, 111, 113, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE      
   1     13.15321  0.4045443  10.187733
   2     12.22625  0.4972448   8.533606
   3     12.82276  0.4544406   9.463389
   4     13.34343  0.4391333   9.942130
   5     13.40057  0.4409403  10.056353
   6     13.34910  0.4544959   9.986091
   7     13.18515  0.4656899   9.839149
   8     13.16804  0.4672193  10.011044
   9     13.32517  0.4492940   9.821597
  10     13.54787  0.4390452   9.977466
  11     13.69196  0.4427291  10.173060
  12     13.98956  0.4296119  10.484832
  13     14.35397  0.4141411  10.598345
  14     14.72419  0.3958930  10.793907
  15     14.91636  0.3909869  10.813344
  16     15.15453  0.3868005  10.930828
  17     15.13150  0.3938837  11.021855
  18     15.00725  0.4015361  10.901139
  19     15.16523  0.3933851  11.004089
  20     15.14499  0.3896002  11.166589

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 2.

The results tell us that the PLS model with 2 components provided the best predictive performance (lowest RMSE), and it explains about 49.7% of the variance in the permeability values based on cross-validation

d)

Predict the response for the test set. What is the test set estimate of \(R^2\)?

# Predict on the test set
test_pred <- predict(pls_fit, newdata = testX)

# Calculate R-squared manually
test_r2 <- cor(test_pred, testY)^2
test_r2
[1] 0.4062325

The PLS model with 2 latent variables, which explained about 49.7% of the variance during cross-validation, explains around 40.6% of the variance on the unseen test set. This drop is expected and reflects generalization performance.

e)

Try building other models discussed in this chapter. Do any have better predictive performance?

# lets have the set up for all models
ctrl <- trainControl(method = "cv")
# Linear Regression
lm_fit <- train(
  x = trainX, y = trainY,
  method = "lm",
  preProcess = c("center", "scale"),
  trControl = ctrl
)
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
lm_pred <- predict(lm_fit, newdata = testX)
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
lm_r2 <- cor(lm_pred, testY)^2
lm_r2
[1] 0.05946262
#Principal Component Regression
pcr_fit <- train(
  x = trainX, y = trainY,
  method = "pcr",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)

pcr_pred <- predict(pcr_fit, newdata = testX)
pcr_r2 <- cor(pcr_pred, testY)^2
pcr_r2
[1] 0.3937476
results <- data.frame(
  Model = c("PLS", "Linear Regression", "PCR"),
  R2 = c(0.406, lm_r2, pcr_r2)
)
print(results)
              Model         R2
1               PLS 0.40600000
2 Linear Regression 0.05946262
3               PCR 0.39374761

For this case PLS gets the best performance because it handles collinearity well and uses supervised dimension reduction.

PCR is weaker because its components are chosen without regard to the response variable

OLS is not suitable in this case — far too many predictors for the number of samples, leading to unstable estimates

f)

Would you recommend any of your models to replace the permeability laboratory experiment?

The best performing model was PLS, so compared to the tested ones it should be the chosen one. However, other models could be explored to help replace or greatly reduce the need for lab experiments such as random forests, or gradient boosting.


6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

a)

Start R and use these commands to load the data

data(ChemicalManufacturingProcess)
cmp <- data.frame(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values

library(RANN)

# Impute missing values 
preProc <- preProcess(cmp, method = "knnImpute")  # or "medianImpute"

# Applying the preprocessing to fill in missing values
cmp_imputed <- predict(preProc, newdata = cmp)

# Checking if any NAs remain
sum(is.na(cmp_imputed))
[1] 0

c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# Use the imputed data from earlier
df <- cmp_imputed

# Remove near-zero variance predictors
nzv <- nearZeroVar(df)
df <- df[, -nzv]

# Splitting data into training and testing
set.seed(48)
split_index <- createDataPartition(df$Yield, p = 0.8, list = FALSE)
train_data <- df[split_index, ]
test_data <- df[-split_index, ]

# Train PLS model
pls_model <- train(
  Yield ~ .,
  data = train_data,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 25,
  trControl = trainControl(method = "cv", number = 10)
)

# View model summary
print(pls_model)
Partial Least Squares 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 129, 131, 131, 128, 129, 130, ... 
Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared   MAE      
   1     0.8696566  0.4485956  0.6747876
   2     0.8832284  0.5141262  0.6352635
   3     0.7321021  0.5737668  0.5693147
   4     0.7121333  0.5901837  0.5764981
   5     0.7128960  0.5910616  0.5759662
   6     0.7034105  0.5947756  0.5811536
   7     0.7171826  0.5934955  0.5846424
   8     0.7227329  0.5947471  0.5883194
   9     0.7551031  0.5786290  0.6035384
  10     0.7719882  0.5680701  0.6121337
  11     0.7835313  0.5601828  0.6189402
  12     0.8058889  0.5597877  0.6245392
  13     0.8498842  0.5546695  0.6400986
  14     0.8906845  0.5481590  0.6520406
  15     0.8964256  0.5415421  0.6560256
  16     0.8611798  0.5389956  0.6467946
  17     0.8508748  0.5325659  0.6444111
  18     0.8483101  0.5323715  0.6423289
  19     0.8775443  0.5277776  0.6547670
  20     0.9064226  0.5160710  0.6675249
  21     0.9246043  0.5077374  0.6763689
  22     0.9375248  0.5002282  0.6839633
  23     0.9415472  0.4967013  0.6870277
  24     0.9571943  0.4932545  0.6927797
  25     0.9846519  0.4887565  0.7037988

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 6.

The partial least squares model with 6 components explains about 59.5% of the variance in yield across cross-validation folds, which is quite strong for a manufacturing prediction model.

d)

Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# we can predict on the test set
test_preds <- predict(pls_model, newdata = test_data)

# R-squared on test set
test_r2 <- cor(test_preds, test_data$Yield)^2

test_rmse <- sqrt(mean((test_preds - test_data$Yield)^2))
test_mae <- mean(abs(test_preds - test_data$Yield))

# Show results
test_r2
[1] 0.04955763
test_rmse
[1] 3.500394
test_mae
[1] 1.191646

There’s a big drop in performance on the test set. The model explained around 59% of the variance on the training data, but only about 5% on the test data. This suggests the model may be overfitting, performing well on seen data but failing to generalize.

e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# organizing variables by importance
importance <- varImp(pls_model, scale = TRUE)

Attaching package: 'pls'
The following object is masked from 'package:caret':

    R2
The following object is masked from 'package:stats':

    loadings
#  top 10:
top_vars <- importance$importance
top_vars_sorted <- top_vars[order(-top_vars$Overall), , drop = FALSE]
head(top_vars_sorted, 10)
                         Overall
ManufacturingProcess32 100.00000
ManufacturingProcess36  86.10209
ManufacturingProcess13  82.45127
ManufacturingProcess17  80.07164
ManufacturingProcess09  79.83923
ManufacturingProcess06  61.91819
ManufacturingProcess33  61.32494
BiologicalMaterial02    58.37694
BiologicalMaterial06    56.67930
BiologicalMaterial08    54.42899

7 out of the top 10 predictors are manufacturing process variables, meaning process variables dominate the most influential predictors in the PLS model. In practical terms, this is great because we can actively adjust these variables to improve product yield.

f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

# Define top variables
top_vars <- c("ManufacturingProcess32", "ManufacturingProcess36",
              "ManufacturingProcess13", "ManufacturingProcess17",
              "ManufacturingProcess09", "ManufacturingProcess06",
              "BiologicalMaterial02", "BiologicalMaterial06", "BiologicalMaterial08")

# Select only Yield and top variables
df_subset <- df %>% select(Yield, all_of(top_vars))

# Reshaping into long format
df_long <- df_subset %>%
  pivot_longer(cols = -Yield, names_to = "Variable", values_to = "Value")

# Faceted scatter plots with linear trend lines
ggplot(df_long, aes(x = Value, y = Yield)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", size = 0.8) +
  facet_wrap(~ Variable, scales = "free_x", ncol = 3) +
  theme_minimal() +
  labs(title = "Yield vs Top Predictors",
       x = "Predictor Value",
       y = "Yield")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'

Several variables such as ManufacturingProcess32, ManufacturingProcess06, and BiologicalMaterial06 show positive relationships with yield, suggesting that increasing these values may lead to improved performance. For biological variables, this insight can be used to screen raw materials before processing.

In contrast, ManufacturingProcess13, 17, and 36 show negative trends, indicating they may be over-applied or poorly tuned in current operations. Reducing or optimizing these could help boost yield.