## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:caret':
##
## R2
##
## The following object is masked from 'package:stats':
##
## loadings
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Warning: package 'questionr' was built under R version 4.4.3
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
##
## Attaching package: 'corrplot'
##
## The following object is masked from 'package:pls':
##
## corrplot
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling) data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
## [1] 165 1107
# Identify near-zero variance predictors
nzv <- nearZeroVar(fingerprints)
# Number of predictors removed
length(nzv)
## [1] 719
# Create a new matrix without near-zero variance predictors
filtered_fingerprints <- fingerprints[, -nzv]
# Number of predictors left
ncol(filtered_fingerprints)
## [1] 388
Using the nearZeroVar
function from the caret
package, I identified and removed 719 predictors with near-zero
variance, which are likely uninformative due to their low frequency
across the compounds. After filtering, 388 predictors remained, which
will be used for building the predictive models.
library(caret)
# Step 1: Split the data
set.seed(123)
X <- as.data.frame(filtered_fingerprints)
y <- permeability
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
# Step 2: Train the PLS model with preprocessing and tuning
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 3,
allowParallel = FALSE
)
pls_model <- train(
x = X_train,
y = y_train,
method = "pls",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 20
)
# Step 3: Print model results
print(pls_model)
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.32803 0.3508009 10.265049
## 2 11.79962 0.5048932 8.512469
## 3 11.83009 0.5075425 9.089975
## 4 11.97583 0.5081234 9.426044
## 5 11.64485 0.5301825 9.040893
## 6 11.41754 0.5315204 8.669031
## 7 11.60589 0.5152487 8.980866
## 8 11.65838 0.5092180 9.150813
## 9 11.76861 0.5125644 9.149243
## 10 12.20179 0.4886723 9.420890
## 11 12.31785 0.4805831 9.532834
## 12 12.52519 0.4665283 9.694581
## 13 12.70958 0.4509723 9.700061
## 14 12.87680 0.4383729 9.782727
## 15 13.05306 0.4262476 9.873035
## 16 13.30800 0.4138615 10.055077
## 17 13.44408 0.4085319 10.178847
## 18 13.70015 0.3940378 10.398658
## 19 13.78761 0.3902866 10.452242
## 20 13.95208 0.3857356 10.602233
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
o build the PLS model, I performed 10-fold cross-validation repeated 3
times while tuning the number of latent variables from 1 to 20. Based on
the results, the optimal model was selected at 6
components, which yielded the lowest RMSE of approximately
11.42 and a cross-validated \(R^2\) of about 0.5315.
This suggests that the model was able to explain roughly 53% of the
variance in the training data. The tuning plot further supports this
selection, showing that RMSE decreased as components were added up to 6,
after which model performance began to decline. Overall, this tuning
process helped balance model complexity with predictive accuracy,
avoiding overfitting while retaining good explanatory power.
# Make predictions on the test set
pls_predictions <- predict(pls_model, newdata = X_test)
# Calculate performance metrics on test set
test_metrics <- postResample(pred = pls_predictions, obs = y_test)
# View test-set R-squared
test_metrics["Rsquared"]
## Rsquared
## 0.3244542
pls_results <- data.frame(
Actual = y_test,
Predicted = pls_predictions
)
ggplot(pls_results, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
ggtitle("PLS Model: Predicted vs. Actual Permeability") +
theme_minimal()
To evaluate the model’s ability to generalize, I tested the PLS model on
the held-out test set. The test-set \(R^2\) was approximately
0.32, indicating that the model was able to explain
about 32% of the variance in the unseen data. While this is lower than
the cross-validated \(R^2\) of 0.53
observed during training, it still shows some predictive capability. The
predicted vs. actual plot supports this, with points generally following
the expected trend but showing noticeable scatter, particularly for
higher permeability values. This suggests that while the model captures
the overall relationship, it may struggle with extreme values and could
benefit from further refinement or additional data.
pca_mod <- train(
X_train,
y_train,
method = "pcr",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 20
)
plot(pca_mod) # Shows RMSE/R² vs number of components
To evaluate the effectiveness of the PLS model, I examined its performance on both the training and test sets. The model achieved a cross-validated RMSE of 11.42 and an \(R^2\) of approximately 0.53 on the training data, indicating strong predictive ability during model tuning. On the test set, the RMSE was higher and the \(R^2\) dropped to 0.32, suggesting that while the model captures meaningful patterns, its performance decreases slightly when applied to unseen data. This drop is expected in real-world scenarios and does not indicate overfitting. Given these results, the PLS model appears to generalize reasonably well and remains a strong candidate for predicting permeability in this context.
# Train a simple linear regression model
lm_model <- train(
x = X_train,
y = y_train,
method = "lm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "none") # no resampling, just fit
)
# Predict on the test set
lm_predictions <- predict(lm_model, newdata = X_test)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
# Evaluate test performance
lm_metrics <- postResample(pred = lm_predictions, obs = y_test)
# View test-set R-squared
lm_metrics["Rsquared"]
## Rsquared
## 0.07853504
model_comparison <- data.frame(
Model = c("PLS", "Linear Regression"),
Rsquared = c(test_metrics["Rsquared"], lm_metrics["Rsquared"])
)
ggplot(model_comparison, aes(x = Model, y = Rsquared, fill = Model)) +
geom_bar(stat = "identity", width = 0.6) +
ggtitle("Test-Set R² Comparison: PLS vs. Linear Regression") +
theme_minimal() +
ylim(0, 1)
To compare the PLS model with a simpler baseline, I trained a standard linear regression model using the same predictors. The model produced a test-set \(R^2\) of approximately 0.08, which is significantly lower than the 0.32 achieved by the PLS model. Additionally, a warning was generated indicating a rank-deficient fit, likely due to multicollinearity or redundant predictors in the high-dimensional dataset. This is expected given the binary and sparse nature of the molecular fingerprint features. The comparison plot highlights this performance gap, clearly showing that the PLS model captures more meaningful variance in the permeability data. Based on this comparison, the linear regression model is not recommended as a substitute for the laboratory experiment, whereas the PLS model demonstrates more potential as a predictive tool.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors) measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
## [1] 176 58
## [1] 106
knn_impute <- preProcess(
ChemicalManufacturingProcess[, -c(1)],
method = "knnImpute"
)
cmp_independent <- predict(
knn_impute,
ChemicalManufacturingProcess[, -c(1)]
)
cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop=FALSE]
sum(is.na(cmp_independent[, -c(1)]))
## [1] 0
# Separate response and predictors
yield <- ChemicalManufacturingProcess$Yield
predictors <- ChemicalManufacturingProcess[, -1] # remove 'Yield'
# Handle missing values using median imputation
preProc <- preProcess(predictors, method = "medianImpute")
predictors_imputed <- predict(preProc, predictors)
# Split data into training (80%) and testing (20%) sets
set.seed(19940211)
train_index <- createDataPartition(predictors_imputed$BiologicalMaterial01, p = 0.8, list = FALSE)
X_train <- predictors_imputed[train_index, ]
X_test <- predictors_imputed[-train_index, ]
y_train <- yield[train_index]
y_test <- yield[-train_index]
# Train and tune a Partial Least Squares (PLS) regression model
set.seed(19940211)
pls_model <- train(
x = X_train,
y = y_train,
method = "pls",
metric = "Rsquared",
tuneLength = 20,
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10, allowParallel = FALSE)
)
# Extract optimal result
pls_optimal <- pls_model$results %>%
arrange(desc(Rsquared)) %>%
slice(1)
# Output results
print(pls_model)
## Partial Least Squares
##
## 143 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 128, 128, 130, 127, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.508845 0.4372147 1.156529
## 2 1.905536 0.5182542 1.196363
## 3 1.551984 0.5424755 1.082999
## 4 1.717136 0.5432908 1.168567
## 5 1.923156 0.5327576 1.222648
## 6 1.957574 0.5213663 1.238775
## 7 1.979112 0.5167924 1.271330
## 8 2.086025 0.5107705 1.323336
## 9 2.151975 0.5041849 1.341312
## 10 2.311532 0.4909551 1.407259
## 11 2.505450 0.4756624 1.475288
## 12 2.632880 0.4625950 1.517593
## 13 2.762860 0.4577664 1.547332
## 14 2.756858 0.4554283 1.542380
## 15 2.725724 0.4635361 1.513460
## 16 2.676922 0.4635087 1.489720
## 17 2.637168 0.4620516 1.469289
## 18 2.674401 0.4572934 1.477633
## 19 2.662912 0.4559475 1.473053
## 20 2.647243 0.4476091 1.471752
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 4.
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 4 1.717136 0.5432908 1.168567 1.396869 0.2232473 0.6699024
To model product yield, I trained a Partial Least Squares (PLS) regression model using 10-fold cross-validation on the training data. Before training, the predictors were preprocessed with median imputation, centering, and scaling. The model was tuned across 20 components, and the optimal number of components selected was 4. At this setting, the model achieved a cross-validated \(R^2\) of approximately 0.543, meaning it explained about 54.3% of the variance in the training data. The corresponding RMSE was 1.72, indicating the average prediction error on the training set. As shown in the tuning plot, model performance peaked at 4 components and declined afterward, helping to justify the choice of the optimal complexity.
# Make predictions on the test set using the optimal number of components
pls_test_pred <- predict(
pls_model,
newdata = X_test,
ncomp = pls_optimal$ncomp
)
# Evaluate model performance on the test set
pls_test_results <- postResample(
pred = pls_test_pred,
obs = y_test
)
# Extract test-set R-squared
pls_test_results[["Rsquared"]]
## [1] 0.5928885
After training the PLS model with 4 components, I evaluated its performance on the test set. The model achieved a test-set \(R^2\) of approximately 0.593, which is slightly higher than the cross-validated training \(R^2\) of 0.543. This indicates that the model generalized well to unseen data and was able to capture meaningful patterns in the predictors related to yield. The consistency between training and test performance suggests that the model is neither overfitting nor underfitting, making it a strong candidate for future prediction tasks.
# Calculate variable importance from the PLS model
pls_var_importance <- varImp(pls_model, scale = FALSE)
# Plot the variable importance
plot(pls_var_importance, top = 20, main = "Top 20 Most Important Predictors (PLS)")
To better
understand which predictors had the strongest influence on yield, I used
the
varImp()
function to extract variable importance from
the trained PLS model. The plot of the top 20 predictors reveals that
manufacturing process variables dominated the top of
the list, with features like ManufacturingProcess32
,
ManufacturingProcess09
, and
ManufacturingProcess36
showing the highest importance
scores. While a few biological variables (such as
BiologicalMaterial02
and BiologicalMaterial06
)
also appeared in the top 20, their importance was generally lower. This
suggests that yield is more heavily influenced by factors within the
manufacturing process, which are potentially actionable
and can be optimized for better outcomes.
## , , 4 comps
##
## .outcome
## BiologicalMaterial01 -0.011721494
## BiologicalMaterial02 0.068153677
## BiologicalMaterial03 0.123961274
## BiologicalMaterial04 0.049567330
## BiologicalMaterial05 0.013274467
## BiologicalMaterial06 0.114898541
## BiologicalMaterial07 -0.129001355
## BiologicalMaterial08 -0.017637719
## BiologicalMaterial09 -0.104984874
## BiologicalMaterial10 -0.071502259
## BiologicalMaterial11 -0.039431064
## BiologicalMaterial12 0.021577880
## ManufacturingProcess01 0.042448826
## ManufacturingProcess02 -0.037351561
## ManufacturingProcess03 -0.056590916
## ManufacturingProcess04 0.151453875
## ManufacturingProcess05 -0.007649450
## ManufacturingProcess06 0.189083872
## ManufacturingProcess07 -0.007515804
## ManufacturingProcess08 -0.071315288
## ManufacturingProcess09 0.305608134
## ManufacturingProcess10 0.059320024
## ManufacturingProcess11 0.143265386
## ManufacturingProcess12 0.026449179
## ManufacturingProcess13 -0.214402348
## ManufacturingProcess14 0.034409025
## ManufacturingProcess15 0.136407668
## ManufacturingProcess16 -0.054003482
## ManufacturingProcess17 -0.222009179
## ManufacturingProcess18 0.053348689
## ManufacturingProcess19 0.068584583
## ManufacturingProcess20 0.040350618
## ManufacturingProcess21 -0.077412939
## ManufacturingProcess22 -0.080826415
## ManufacturingProcess23 -0.026132274
## ManufacturingProcess24 -0.044538349
## ManufacturingProcess25 -0.023449098
## ManufacturingProcess26 -0.006007568
## ManufacturingProcess27 -0.028950475
## ManufacturingProcess28 -0.110626348
## ManufacturingProcess29 0.039985617
## ManufacturingProcess30 0.071797707
## ManufacturingProcess31 -0.044995830
## ManufacturingProcess32 0.420125182
## ManufacturingProcess33 0.122880696
## ManufacturingProcess34 0.326795856
## ManufacturingProcess35 -0.024791751
## ManufacturingProcess36 -0.309155264
## ManufacturingProcess37 -0.239733864
## ManufacturingProcess38 -0.104833706
## ManufacturingProcess39 0.086784096
## ManufacturingProcess40 -0.048963105
## ManufacturingProcess41 -0.033124317
## ManufacturingProcess42 0.056723540
## ManufacturingProcess43 0.095680369
## ManufacturingProcess44 0.086464047
## ManufacturingProcess45 0.061543135
To further explore the relationship between predictors and yield, I
extracted the coefficients from the final PLS model. These coefficients
show both the magnitude and direction
of each predictor’s impact on yield. Among the most influential
variables were ManufacturingProcess32
(+0.420), ManufacturingProcess34
(+0.327), and ManufacturingProcess09
(+0.306), all of which had strong positive
coefficients. This suggests that increases in these process variables
are associated with higher product yield. Conversely, variables like
ManufacturingProcess17
(−0.222) and
ManufacturingProcess36
(−0.309) had strong
negative coefficients, indicating that their higher values may reduce
yield. This information is highly actionable, as it points to specific
stages in the manufacturing process that could be optimized or more
closely monitored to improve future outcomes. Since process variables
are adjustable, focusing on the most impactful ones provides a clear
path toward yield enhancement.