Data 624 - HW 7

library(tidyverse)

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(pls)

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(questionr)

## Warning: package 'questionr' was built under R version 4.4.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded
## 
## Attaching package: 'corrplot'
## 
## The following object is masked from 'package:pls':
## 
##     corrplot

library(glue)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a): Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data(permeability)

(b): The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

dim(fingerprints)

## [1]  165 1107

# Identify near-zero variance predictors
nzv <- nearZeroVar(fingerprints)

# Number of predictors removed
length(nzv)

## [1] 719

# Create a new matrix without near-zero variance predictors
filtered_fingerprints <- fingerprints[, -nzv]

# Number of predictors left
ncol(filtered_fingerprints)

## [1] 388

Using the nearZeroVar function from the caret package, I identified and removed 719 predictors with near-zero variance, which are likely uninformative due to their low frequency across the compounds. After filtering, 388 predictors remained, which will be used for building the predictive models.

(c): Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

library(caret)

# Step 1: Split the data
set.seed(123)
X <- as.data.frame(filtered_fingerprints)
y <- permeability
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test  <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test  <- y[-trainIndex]

# Step 2: Train the PLS model with preprocessing and tuning
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 3,
  allowParallel = FALSE
)

pls_model <- train(
  x = X_train,
  y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 20
)

# Step 3: Print model results
print(pls_model)

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.32803  0.3508009  10.265049
##    2     11.79962  0.5048932   8.512469
##    3     11.83009  0.5075425   9.089975
##    4     11.97583  0.5081234   9.426044
##    5     11.64485  0.5301825   9.040893
##    6     11.41754  0.5315204   8.669031
##    7     11.60589  0.5152487   8.980866
##    8     11.65838  0.5092180   9.150813
##    9     11.76861  0.5125644   9.149243
##   10     12.20179  0.4886723   9.420890
##   11     12.31785  0.4805831   9.532834
##   12     12.52519  0.4665283   9.694581
##   13     12.70958  0.4509723   9.700061
##   14     12.87680  0.4383729   9.782727
##   15     13.05306  0.4262476   9.873035
##   16     13.30800  0.4138615  10.055077
##   17     13.44408  0.4085319  10.178847
##   18     13.70015  0.3940378  10.398658
##   19     13.78761  0.3902866  10.452242
##   20     13.95208  0.3857356  10.602233
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

ggplot(pls_model) +
  ggtitle("PLS Model Tuning: R² vs. Number of Components") +
  theme_minimal()

o build the PLS model, I performed 10-fold cross-validation repeated 3 times while tuning the number of latent variables from 1 to 20. Based on the results, the optimal model was selected at 6 components, which yielded the lowest RMSE of approximately 11.42 and a cross-validated \(R^2\) of about 0.5315. This suggests that the model was able to explain roughly 53% of the variance in the training data. The tuning plot further supports this selection, showing that RMSE decreased as components were added up to 6, after which model performance began to decline. Overall, this tuning process helped balance model complexity with predictive accuracy, avoiding overfitting while retaining good explanatory power.

(d): Predict the response for the test set. What is the test set estimate of \(R^2\)?

# Make predictions on the test set
pls_predictions <- predict(pls_model, newdata = X_test)

# Calculate performance metrics on test set
test_metrics <- postResample(pred = pls_predictions, obs = y_test)

# View test-set R-squared
test_metrics["Rsquared"]

##  Rsquared 
## 0.3244542

pls_results <- data.frame(
  Actual = y_test,
  Predicted = pls_predictions
)

ggplot(pls_results, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  ggtitle("PLS Model: Predicted vs. Actual Permeability") +
  theme_minimal()

To evaluate the model’s ability to generalize, I tested the PLS model on the held-out test set. The test-set \(R^2\) was approximately 0.32, indicating that the model was able to explain about 32% of the variance in the unseen data. While this is lower than the cross-validated \(R^2\) of 0.53 observed during training, it still shows some predictive capability. The predicted vs. actual plot supports this, with points generally following the expected trend but showing noticeable scatter, particularly for higher permeability values. This suggests that while the model captures the overall relationship, it may struggle with extreme values and could benefit from further refinement or additional data.

(e): Try building other models discussed in this chapter. Do any have better predictive performance?

pca_mod <- train(
  X_train,
  y_train,
  method = "pcr",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 20
)

plot(pca_mod)  # Shows RMSE/R² vs number of components

To evaluate the effectiveness of the PLS model, I examined its performance on both the training and test sets. The model achieved a cross-validated RMSE of 11.42 and an \(R^2\) of approximately 0.53 on the training data, indicating strong predictive ability during model tuning. On the test set, the RMSE was higher and the \(R^2\) dropped to 0.32, suggesting that while the model captures meaningful patterns, its performance decreases slightly when applied to unseen data. This drop is expected in real-world scenarios and does not indicate overfitting. Given these results, the PLS model appears to generalize reasonably well and remains a strong candidate for predicting permeability in this context.

(f): Would you recommend any of your models to replace the permeability laboratory experiment?

# Train a simple linear regression model
lm_model <- train(
  x = X_train,
  y = y_train,
  method = "lm",
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "none")  # no resampling, just fit
)

# Predict on the test set
lm_predictions <- predict(lm_model, newdata = X_test)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

# Evaluate test performance
lm_metrics <- postResample(pred = lm_predictions, obs = y_test)

# View test-set R-squared
lm_metrics["Rsquared"]

##   Rsquared 
## 0.07853504

model_comparison <- data.frame(
  Model = c("PLS", "Linear Regression"),
  Rsquared = c(test_metrics["Rsquared"], lm_metrics["Rsquared"])
)

ggplot(model_comparison, aes(x = Model, y = Rsquared, fill = Model)) +
  geom_bar(stat = "identity", width = 0.6) +
  ggtitle("Test-Set R² Comparison: PLS vs. Linear Regression") +
  theme_minimal() +
  ylim(0, 1)

To compare the PLS model with a simpler baseline, I trained a standard linear regression model using the same predictors. The model produced a test-set \(R^2\) of approximately 0.08, which is significantly lower than the 0.32 achieved by the PLS model. Additionally, a warning was generated indicating a rank-deficient fit, likely due to multicollinearity or redundant predictors in the high-dimensional dataset. This is expected given the binary and sparse nature of the molecular fingerprint features. The comparison plot highlights this performance gap, clearly showing that the PLS model captures more meaningful variance in the permeability data. Based on this comparison, the linear regression model is not recommended as a substitute for the laboratory experiment, whereas the PLS model demonstrates more potential as a predictive tool.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors) measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a): ) Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

dim(ChemicalManufacturingProcess)

## [1] 176  58

(b): A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

sum(is.na(ChemicalManufacturingProcess[, -c(1)]))

## [1] 106

knn_impute <- preProcess(
  ChemicalManufacturingProcess[, -c(1)],
  method = "knnImpute"
)

cmp_independent <- predict(
  knn_impute,
  ChemicalManufacturingProcess[, -c(1)]
)

cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop=FALSE]

sum(is.na(cmp_independent[, -c(1)]))

## [1] 0

(c): Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# Separate response and predictors
yield <- ChemicalManufacturingProcess$Yield
predictors <- ChemicalManufacturingProcess[, -1]  # remove 'Yield'

# Handle missing values using median imputation
preProc <- preProcess(predictors, method = "medianImpute")
predictors_imputed <- predict(preProc, predictors)

# Split data into training (80%) and testing (20%) sets
set.seed(19940211)
train_index <- createDataPartition(predictors_imputed$BiologicalMaterial01, p = 0.8, list = FALSE)

X_train <- predictors_imputed[train_index, ]
X_test  <- predictors_imputed[-train_index, ]
y_train <- yield[train_index]
y_test  <- yield[-train_index]

# Train and tune a Partial Least Squares (PLS) regression model
set.seed(19940211)
pls_model <- train(
  x = X_train,
  y = y_train,
  method = "pls",
  metric = "Rsquared",
  tuneLength = 20,
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "cv", number = 10, allowParallel = FALSE)
)

# Extract optimal result
pls_optimal <- pls_model$results %>%
  arrange(desc(Rsquared)) %>%
  slice(1)

# Output results
print(pls_model)

## Partial Least Squares 
## 
## 143 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 128, 128, 130, 127, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.508845  0.4372147  1.156529
##    2     1.905536  0.5182542  1.196363
##    3     1.551984  0.5424755  1.082999
##    4     1.717136  0.5432908  1.168567
##    5     1.923156  0.5327576  1.222648
##    6     1.957574  0.5213663  1.238775
##    7     1.979112  0.5167924  1.271330
##    8     2.086025  0.5107705  1.323336
##    9     2.151975  0.5041849  1.341312
##   10     2.311532  0.4909551  1.407259
##   11     2.505450  0.4756624  1.475288
##   12     2.632880  0.4625950  1.517593
##   13     2.762860  0.4577664  1.547332
##   14     2.756858  0.4554283  1.542380
##   15     2.725724  0.4635361  1.513460
##   16     2.676922  0.4635087  1.489720
##   17     2.637168  0.4620516  1.469289
##   18     2.674401  0.4572934  1.477633
##   19     2.662912  0.4559475  1.473053
##   20     2.647243  0.4476091  1.471752
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 4.

plot(pls_model)

print(pls_optimal)

##   ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1     4 1.717136 0.5432908 1.168567 1.396869  0.2232473 0.6699024

To model product yield, I trained a Partial Least Squares (PLS) regression model using 10-fold cross-validation on the training data. Before training, the predictors were preprocessed with median imputation, centering, and scaling. The model was tuned across 20 components, and the optimal number of components selected was 4. At this setting, the model achieved a cross-validated \(R^2\) of approximately 0.543, meaning it explained about 54.3% of the variance in the training data. The corresponding RMSE was 1.72, indicating the average prediction error on the training set. As shown in the tuning plot, model performance peaked at 4 components and declined afterward, helping to justify the choice of the optimal complexity.

(d): Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Make predictions on the test set using the optimal number of components
pls_test_pred <- predict(
  pls_model,
  newdata = X_test,
  ncomp = pls_optimal$ncomp
)

# Evaluate model performance on the test set
pls_test_results <- postResample(
  pred = pls_test_pred,
  obs = y_test
)

# Extract test-set R-squared
pls_test_results[["Rsquared"]]

## [1] 0.5928885

After training the PLS model with 4 components, I evaluated its performance on the test set. The model achieved a test-set \(R^2\) of approximately 0.593, which is slightly higher than the cross-validated training \(R^2\) of 0.543. This indicates that the model generalized well to unseen data and was able to capture meaningful patterns in the predictors related to yield. The consistency between training and test performance suggests that the model is neither overfitting nor underfitting, making it a strong candidate for future prediction tasks.

(e): Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# Calculate variable importance from the PLS model
pls_var_importance <- varImp(pls_model, scale = FALSE)

# Plot the variable importance
plot(pls_var_importance, top = 20, main = "Top 20 Most Important Predictors (PLS)")

To better understand which predictors had the strongest influence on yield, I used the varImp() function to extract variable importance from the trained PLS model. The plot of the top 20 predictors reveals that manufacturing process variables dominated the top of the list, with features like ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess36 showing the highest importance scores. While a few biological variables (such as BiologicalMaterial02 and BiologicalMaterial06) also appeared in the top 20, their importance was generally lower. This suggests that yield is more heavily influenced by factors within the manufacturing process, which are potentially actionable and can be optimized for better outcomes.

(f): ) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

coef(pls_model$finalModel)

## , , 4 comps
## 
##                            .outcome
## BiologicalMaterial01   -0.011721494
## BiologicalMaterial02    0.068153677
## BiologicalMaterial03    0.123961274
## BiologicalMaterial04    0.049567330
## BiologicalMaterial05    0.013274467
## BiologicalMaterial06    0.114898541
## BiologicalMaterial07   -0.129001355
## BiologicalMaterial08   -0.017637719
## BiologicalMaterial09   -0.104984874
## BiologicalMaterial10   -0.071502259
## BiologicalMaterial11   -0.039431064
## BiologicalMaterial12    0.021577880
## ManufacturingProcess01  0.042448826
## ManufacturingProcess02 -0.037351561
## ManufacturingProcess03 -0.056590916
## ManufacturingProcess04  0.151453875
## ManufacturingProcess05 -0.007649450
## ManufacturingProcess06  0.189083872
## ManufacturingProcess07 -0.007515804
## ManufacturingProcess08 -0.071315288
## ManufacturingProcess09  0.305608134
## ManufacturingProcess10  0.059320024
## ManufacturingProcess11  0.143265386
## ManufacturingProcess12  0.026449179
## ManufacturingProcess13 -0.214402348
## ManufacturingProcess14  0.034409025
## ManufacturingProcess15  0.136407668
## ManufacturingProcess16 -0.054003482
## ManufacturingProcess17 -0.222009179
## ManufacturingProcess18  0.053348689
## ManufacturingProcess19  0.068584583
## ManufacturingProcess20  0.040350618
## ManufacturingProcess21 -0.077412939
## ManufacturingProcess22 -0.080826415
## ManufacturingProcess23 -0.026132274
## ManufacturingProcess24 -0.044538349
## ManufacturingProcess25 -0.023449098
## ManufacturingProcess26 -0.006007568
## ManufacturingProcess27 -0.028950475
## ManufacturingProcess28 -0.110626348
## ManufacturingProcess29  0.039985617
## ManufacturingProcess30  0.071797707
## ManufacturingProcess31 -0.044995830
## ManufacturingProcess32  0.420125182
## ManufacturingProcess33  0.122880696
## ManufacturingProcess34  0.326795856
## ManufacturingProcess35 -0.024791751
## ManufacturingProcess36 -0.309155264
## ManufacturingProcess37 -0.239733864
## ManufacturingProcess38 -0.104833706
## ManufacturingProcess39  0.086784096
## ManufacturingProcess40 -0.048963105
## ManufacturingProcess41 -0.033124317
## ManufacturingProcess42  0.056723540
## ManufacturingProcess43  0.095680369
## ManufacturingProcess44  0.086464047
## ManufacturingProcess45  0.061543135

To further explore the relationship between predictors and yield, I extracted the coefficients from the final PLS model. These coefficients show both the magnitude and direction of each predictor’s impact on yield. Among the most influential variables were ManufacturingProcess32 (+0.420), ManufacturingProcess34 (+0.327), and ManufacturingProcess09 (+0.306), all of which had strong positive coefficients. This suggests that increases in these process variables are associated with higher product yield. Conversely, variables like ManufacturingProcess17 (−0.222) and ManufacturingProcess36 (−0.309) had strong negative coefficients, indicating that their higher values may reduce yield. This information is highly actionable, as it points to specific stages in the manufacturing process that could be optimized or more closely monitored to improve future outcomes. Since process variables are adjustable, focusing on the most impactful ones provides a clear path toward yield enhancement.