library(AppliedPredictiveModeling)
data(permeability)
LFMG-Lab-7
Linear Regression
The following assignment will explore exercises 6.2 and 6.3 from Kuhn and Johnson’s book: Applied Predictive Modeling.
6.2
Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug
a)
Start R and use these commands to load the data:
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
b)
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
library(caret)
Loading required package: ggplot2
Loading required package: lattice
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::lift() masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# identifying low-frequency predictors
<- nearZeroVar(fingerprints)
nzv
# Subset to keep only predictors that are NOT near-zero variance
<- fingerprints[, -nzv]
filtered_fingerprints
# Check how many predictors are left
<- ncol(filtered_fingerprints)
num_predictors_left num_predictors_left
[1] 388
Filtering out the predictors with low frequency leaves 388 predictors for modeling.
c)
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
<- nearZeroVar(fingerprints)
nzv <- fingerprints[, -nzv]
filtered_fingerprints set.seed(89)
# Splitting the data into training and testing sets (in this case 75% train)
<- createDataPartition(permeability, p = 0.75, list = FALSE)
trainIndex <- filtered_fingerprints[trainIndex, ]
trainX <- permeability[trainIndex]
trainY <- filtered_fingerprints[-trainIndex, ]
testX <- permeability[-trainIndex]
testY
# Training PLS model with preprocessing and tuning
<- train(
pls_fit x = trainX,
y = trainY,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = trainControl(method = "cv")
)
print(pls_fit)
Partial Least Squares
125 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 113, 113, 113, 113, 111, 113, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 13.15321 0.4045443 10.187733
2 12.22625 0.4972448 8.533606
3 12.82276 0.4544406 9.463389
4 13.34343 0.4391333 9.942130
5 13.40057 0.4409403 10.056353
6 13.34910 0.4544959 9.986091
7 13.18515 0.4656899 9.839149
8 13.16804 0.4672193 10.011044
9 13.32517 0.4492940 9.821597
10 13.54787 0.4390452 9.977466
11 13.69196 0.4427291 10.173060
12 13.98956 0.4296119 10.484832
13 14.35397 0.4141411 10.598345
14 14.72419 0.3958930 10.793907
15 14.91636 0.3909869 10.813344
16 15.15453 0.3868005 10.930828
17 15.13150 0.3938837 11.021855
18 15.00725 0.4015361 10.901139
19 15.16523 0.3933851 11.004089
20 15.14499 0.3896002 11.166589
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 2.
The results tell us that the PLS model with 2 components provided the best predictive performance (lowest RMSE), and it explains about 49.7% of the variance in the permeability values based on cross-validation
d)
Predict the response for the test set. What is the test set estimate of \(R^2\)?
# Predict on the test set
<- predict(pls_fit, newdata = testX)
test_pred
# Calculate R-squared manually
<- cor(test_pred, testY)^2
test_r2 test_r2
[1] 0.4062325
The PLS model with 2 latent variables, which explained about 49.7% of the variance during cross-validation, explains around 40.6% of the variance on the unseen test set. This drop is expected and reflects generalization performance.
e)
Try building other models discussed in this chapter. Do any have better predictive performance?
# lets have the set up for all models
<- trainControl(method = "cv") ctrl
# Linear Regression
<- train(
lm_fit x = trainX, y = trainY,
method = "lm",
preProcess = c("center", "scale"),
trControl = ctrl
)
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
<- predict(lm_fit, newdata = testX) lm_pred
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
<- cor(lm_pred, testY)^2
lm_r2 lm_r2
[1] 0.05946262
#Principal Component Regression
<- train(
pcr_fit x = trainX, y = trainY,
method = "pcr",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = ctrl
)
<- predict(pcr_fit, newdata = testX)
pcr_pred <- cor(pcr_pred, testY)^2
pcr_r2 pcr_r2
[1] 0.3937476
<- data.frame(
results Model = c("PLS", "Linear Regression", "PCR"),
R2 = c(0.406, lm_r2, pcr_r2)
)print(results)
Model R2
1 PLS 0.40600000
2 Linear Regression 0.05946262
3 PCR 0.39374761
For this case PLS gets the best performance because it handles collinearity well and uses supervised dimension reduction.
PCR is weaker because its components are chosen without regard to the response variable
OLS is not suitable in this case — far too many predictors for the number of samples, leading to unstable estimates
f)
Would you recommend any of your models to replace the permeability laboratory experiment?
The best performing model was PLS, so compared to the tested ones it should be the chosen one. However, other models could be explored to help replace or greatly reduce the need for lab experiments such as random forests, or gradient boosting.
6.3
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
a)
Start R and use these commands to load the data
data(ChemicalManufacturingProcess)
<- data.frame(ChemicalManufacturingProcess) cmp
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
b)
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values
library(RANN)
# Impute missing values
<- preProcess(cmp, method = "knnImpute") # or "medianImpute"
preProc
# Applying the preprocessing to fill in missing values
<- predict(preProc, newdata = cmp)
cmp_imputed
# Checking if any NAs remain
sum(is.na(cmp_imputed))
[1] 0
c)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# Use the imputed data from earlier
<- cmp_imputed
df
# Remove near-zero variance predictors
<- nearZeroVar(df)
nzv <- df[, -nzv]
df
# Splitting data into training and testing
set.seed(48)
<- createDataPartition(df$Yield, p = 0.8, list = FALSE)
split_index <- df[split_index, ]
train_data <- df[-split_index, ]
test_data
# Train PLS model
<- train(
pls_model ~ .,
Yield data = train_data,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 25,
trControl = trainControl(method = "cv", number = 10)
)
# View model summary
print(pls_model)
Partial Least Squares
144 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 129, 131, 131, 128, 129, 130, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 0.8696566 0.4485956 0.6747876
2 0.8832284 0.5141262 0.6352635
3 0.7321021 0.5737668 0.5693147
4 0.7121333 0.5901837 0.5764981
5 0.7128960 0.5910616 0.5759662
6 0.7034105 0.5947756 0.5811536
7 0.7171826 0.5934955 0.5846424
8 0.7227329 0.5947471 0.5883194
9 0.7551031 0.5786290 0.6035384
10 0.7719882 0.5680701 0.6121337
11 0.7835313 0.5601828 0.6189402
12 0.8058889 0.5597877 0.6245392
13 0.8498842 0.5546695 0.6400986
14 0.8906845 0.5481590 0.6520406
15 0.8964256 0.5415421 0.6560256
16 0.8611798 0.5389956 0.6467946
17 0.8508748 0.5325659 0.6444111
18 0.8483101 0.5323715 0.6423289
19 0.8775443 0.5277776 0.6547670
20 0.9064226 0.5160710 0.6675249
21 0.9246043 0.5077374 0.6763689
22 0.9375248 0.5002282 0.6839633
23 0.9415472 0.4967013 0.6870277
24 0.9571943 0.4932545 0.6927797
25 0.9846519 0.4887565 0.7037988
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 6.
The partial least squares model with 6 components explains about 59.5% of the variance in yield across cross-validation folds, which is quite strong for a manufacturing prediction model.
d)
Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# we can predict on the test set
<- predict(pls_model, newdata = test_data)
test_preds
# R-squared on test set
<- cor(test_preds, test_data$Yield)^2
test_r2
<- sqrt(mean((test_preds - test_data$Yield)^2))
test_rmse <- mean(abs(test_preds - test_data$Yield))
test_mae
# Show results
test_r2
[1] 0.04955763
test_rmse
[1] 3.500394
test_mae
[1] 1.191646
There’s a big drop in performance on the test set. The model explained around 59% of the variance on the training data, but only about 5% on the test data. This suggests the model may be overfitting, performing well on seen data but failing to generalize.
e)
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
# organizing variables by importance
<- varImp(pls_model, scale = TRUE) importance
Attaching package: 'pls'
The following object is masked from 'package:caret':
R2
The following object is masked from 'package:stats':
loadings
# top 10:
<- importance$importance
top_vars <- top_vars[order(-top_vars$Overall), , drop = FALSE]
top_vars_sorted head(top_vars_sorted, 10)
Overall
ManufacturingProcess32 100.00000
ManufacturingProcess36 86.10209
ManufacturingProcess13 82.45127
ManufacturingProcess17 80.07164
ManufacturingProcess09 79.83923
ManufacturingProcess06 61.91819
ManufacturingProcess33 61.32494
BiologicalMaterial02 58.37694
BiologicalMaterial06 56.67930
BiologicalMaterial08 54.42899
7 out of the top 10 predictors are manufacturing process variables, meaning process variables dominate the most influential predictors in the PLS model. In practical terms, this is great because we can actively adjust these variables to improve product yield.
f)
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
# Define top variables
<- c("ManufacturingProcess32", "ManufacturingProcess36",
top_vars "ManufacturingProcess13", "ManufacturingProcess17",
"ManufacturingProcess09", "ManufacturingProcess06",
"BiologicalMaterial02", "BiologicalMaterial06", "BiologicalMaterial08")
# Select only Yield and top variables
<- df %>% select(Yield, all_of(top_vars))
df_subset
# Reshaping into long format
<- df_subset %>%
df_long pivot_longer(cols = -Yield, names_to = "Variable", values_to = "Value")
# Faceted scatter plots with linear trend lines
ggplot(df_long, aes(x = Value, y = Yield)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "blue", size = 0.8) +
facet_wrap(~ Variable, scales = "free_x", ncol = 3) +
theme_minimal() +
labs(title = "Yield vs Top Predictors",
x = "Predictor Value",
y = "Yield")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
Several variables such as ManufacturingProcess32, ManufacturingProcess06, and BiologicalMaterial06 show positive relationships with yield, suggesting that increasing these values may lead to improved performance. For biological variables, this insight can be used to screen raw materials before processing.
In contrast, ManufacturingProcess13, 17, and 36 show negative trends, indicating they may be over-applied or poorly tuned in current operations. Reducing or optimizing these could help boost yield.