Homework07

Introduction

Load Libraries

library(tidyverse)
library(caret)
library(elasticnet)
library(ggfortify)
library(ggplot2)
library(RANN)

6.2

Prompt: Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

# Provided code
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

# Filter fingerprints using nearZeroVar
near_zero_fprint <- nearZeroVar(fingerprints)
filtered_fprint <- fingerprints[, -near_zero_fprint]

# Original count
dim(fingerprints)

## [1]  165 1107

# nearZeroVar count
ncol(filtered_fprint)

## [1] 388

The original fingerprints data has 1107 predictors. Filtering the predictors with low frequencies gives 388 predictors left for modelling.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

# Set seed for reproducability 
set.seed(64)

# Create partition of 80-20
index <- createDataPartition(permeability, p = .8, list = FALSE)

# Split X into training and testing
X_train <- filtered_fprint[index, ]
X_test <- filtered_fprint[-index, ]

# Split y into training and testing
y_train <- permeability[index, ]
y_test <- permeability[-index, ]

# Train the pls model
pls_finger <- train(
  x = X_train,
  y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl(method = "cv", number = 10) 
)

# Visualize the tuned data
plot(pls_finger)

# Get the best tuned value by filtering the results to the best tuned row
pls_finger$results[pls_finger$results$ncomp == pls_finger$bestTune$ncomp, ]

##   ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 2     2 11.66697 0.4816941 8.311813 3.312448  0.2278563 2.057526

The optimal number of latent variables is 2 based on the graph and the lowest RMSE score. The corresponding resampled estimate of R^2 is 0.482.

(d) Predict the response for the test set. What is the test set estimate of R^2?

# Predict on the test data
predicted_finger <- predict(pls_finger, X_test)

# Get results of the tests and use carets defaultSummary()
predicted_finger_y <- data.frame(obs = y_test, pred = predicted_finger)
defaultSummary(predicted_finger_y)

##       RMSE   Rsquared        MAE 
## 11.9475566  0.4092106  7.5129098

The prediction results in the RMSE = 11.95, R^2 = 0.409, and MAE = 7.513. The The R^2 of 0.409 represents the model explaining 40.9% of the variability in the test set values.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

“In addition to ordinary linear regression, these types of mod- els include partial least squares (PLS) and penalized models such as ridge regression, the lasso, and the elastic net.” (KJ 101)

Ordinary Linear Regression

# Find multicollinear columns
combo_info <- findLinearCombos(X_train)

# Remove those from both sets
olr_X_train <- X_train[, -combo_info$remove]
olr_X_test  <- X_test[, -combo_info$remove]

# Ordinary Linear Regression model
olr_fit_finger <- train(
  x = olr_X_train,
  y = y_train,
  method = "lm",
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "cv", number = 10) 
)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

# Extract the final lm model from caret
final_model <- olr_fit_finger$finalModel

# autoplot supports base lm objects nicely
autoplot(olr_fit_finger$finalModel, which = 1)

# Predict on the test data
olr_predicted_finger <- predict(olr_fit_finger, olr_X_test)

# Get results of the tests and use carets defaultSummary()
olr_pred_finger_y <- data.frame(obs = y_test, pred = olr_predicted_finger)
defaultSummary(olr_pred_finger_y)

##        RMSE    Rsquared         MAE 
## 24.63281224  0.08197337 17.10432209

Ridge Regression

# Grid Search for optimal Lambda minimizing RMSE
ridge_grid_finger <- data.frame(.lambda = seq(0, .1, length = 15))

# Train the ridge model
ridge_fit_finger <- train(
  X_train, 
  y_train,
  method = "ridge",
  tuneGrid = ridge_grid_finger,
  trControl = trainControl(method = "cv", number = 10),
  preProc = c("center", "scale")
)

# Visualize the data
plot(ridge_fit_finger)

# Predict on the test data
ridge_finger <- predict(ridge_fit_finger, X_test)

# Get results of the tests and use carets defaultSummary()
ridge_pred_finger_y <- data.frame(obs = y_test, pred = ridge_finger)
defaultSummary(ridge_pred_finger_y)

##       RMSE   Rsquared        MAE 
## 12.4402796  0.4489676  8.9131521

Elastic Net

# Grid Search for optimal Lambda minimizing RMSE
elastic_grid_finger <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

# Train the elastic net model
elastic_fit_finger <- train(
  X_train, 
  y_train,
  method = "enet",
  tuneGrid = elastic_grid_finger,
  trControl = trainControl(method = "cv", number = 10),
  preProc = c("center", "scale")
)

# Visualize the data
plot(elastic_fit_finger)

# Predict on the test data
elastic_finger <- predict(elastic_fit_finger, X_test)

# Get results of the tests and use carets defaultSummary()
elastic_pred_finger_y <- data.frame(obs = y_test, pred = elastic_finger)
defaultSummary(elastic_pred_finger_y)

##       RMSE   Rsquared        MAE 
## 10.4792876  0.5469005  7.3754463

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

print("Permeability Laboratory Experiment Results:")

## [1] "Permeability Laboratory Experiment Results:"

defaultSummary(predicted_finger_y)

##       RMSE   Rsquared        MAE 
## 11.9475566  0.4092106  7.5129098

print("Ordinary Linear Regression Results:")

## [1] "Ordinary Linear Regression Results:"

defaultSummary(olr_pred_finger_y)

##        RMSE    Rsquared         MAE 
## 24.63281224  0.08197337 17.10432209

print("Ridge Regression Results:")

## [1] "Ridge Regression Results:"

defaultSummary(ridge_pred_finger_y)

##       RMSE   Rsquared        MAE 
## 12.4402796  0.4489676  8.9131521

print("Elastic Net Regression Results:")

## [1] "Elastic Net Regression Results:"

defaultSummary(elastic_pred_finger_y)

##       RMSE   Rsquared        MAE 
## 10.4792876  0.5469005  7.3754463

The ideal model is the one with the lowest RMSE, highest R^2, and lowest MAE.

The PLS model has the second lowest RMSE, second lowest R^2 and the lowest MAE.
The Elastic Net model has the lowest RMSE, lowest R^2 and the second lowest MAE.

For the Elastic Net model, the trade off between the MAE and the lower RMS and R^2 makes sense. Because of this, I would reccommend the Elastic Net regression model over the PLS, just barely.

6.3.

Prompt: A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# Predict missing values using KNN imputation function
chem_impute_vals <- preProcess(
    ChemicalManufacturingProcess,
    method = c("knnImpute")
)

# Fill the predicted values
chem_imputed <- predict(
    chem_impute_vals,
    ChemicalManufacturingProcess
)

# Check for any remaining missing values
sum(is.na(chem_imputed))

## [1] 0

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# Like in 6.2 b, filter using nearZeroVar
near_zero_chem <- nearZeroVar(chem_imputed)
filtered_chem <- chem_imputed[, -near_zero_chem]

# Original count
dim(chem_imputed)

## [1] 176  58

# nearZeroVar count
ncol(filtered_chem)

## [1] 57

# Set seed for reproducability 
set.seed(64)

# Create partition of 80-20
index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)

# Split X into training and testing (exclude 'Yield' column)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem  <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]

# Split y into training and testing (only the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem  <- filtered_chem[-index_chem, "Yield", drop = TRUE]

# Train the pls model
pls_chem <- train(
  x = X_train_chem,
  y = y_train_chem,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl(method = "cv", number = 10) 
)

# Visualize the tuned data
plot(pls_chem)

# Get the best tuned value by filtering the results to the best tuned row
pls_chem$results[pls_chem$results$ncomp == pls_chem$bestTune$ncomp, ]

##   ncomp      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1     1 0.7505754 0.4490387 0.6030918 0.07953122  0.1446725 0.08384717

The optimal value of the performance metric is 1. The first row corresponds to the lowest number of components, suggesting a single component provided the best performance for the model with RMSE. The RMSE of that optimal value of the performance metric is 0.751.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Predict on the test data
predicted_chem <- predict(pls_chem, X_test_chem)

# Get results of the tests and use carets defaultSummary()
predicted_chem_y <- data.frame(obs = y_test_chem, pred = predicted_chem)
defaultSummary(predicted_chem_y)

##      RMSE  Rsquared       MAE 
## 1.3147047 0.2071462 0.7344713

The value of the performance metric RMSE is 1.315. This value is worse than the training set RMSE of 0.751, almost double the training value. This model is likely overfitting the data based on that RMSE change between training and testing.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# Get the most important predictors with varImp
chem_importance <- varImp(pls_chem, scale = FALSE)$importance |>
  arrange(-Overall)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

# See the features
head(chem_importance, 10)

##                           Overall
## ManufacturingProcess32 0.07978184
## ManufacturingProcess09 0.07155332
## ManufacturingProcess36 0.06864897
## ManufacturingProcess13 0.06703950
## BiologicalMaterial02   0.06037569
## BiologicalMaterial06   0.06024218
## BiologicalMaterial03   0.05943981
## ManufacturingProcess17 0.05737441
## ManufacturingProcess33 0.05634232
## ManufacturingProcess12 0.05183224

It looks like the highest 4 values, and 7 of the top 10 values for predictors most important in the model are manufacturing process predictors. Manufacturing process predictors are most important.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

# Get the top 10 important predictors
top10 <- head(chem_importance, 10)

# Select Yield and the top 10 predictors
correlation_matrix <- filtered_chem |>
  select(Yield, all_of(rownames(top10))) |>
  cor(use = "complete.obs")

# Visualize the correlation matrix using corrplot
corrplot::corrplot(correlation_matrix, 
                   method = "color", 
                   type = "upper", 
                   order = "hclust",
                   tl.cex = 0.8)

# Loop through top 10 predictors and create scatter plots
top10_predictors <- rownames(top10)

for (predictor in top10_predictors) {
  p <- ggplot(filtered_chem, aes(x = .data[[predictor]], y = Yield)) + 
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "salmon") +  # Add a linear regression line
    labs(x = predictor, y = "Yield", title = paste("Relationship between", predictor, "and Yield")) +
    theme_minimal()
    
  print(p)
}

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Positive relationships:

ManufacturingProcess32 - Stronger
ManufacturingProcess09 - Stronger
BiologicalMaterial02 - Weaker
BiologicalMaterial06 - Weaker
BiologicalMaterial03 - Weaker
ManufacturingProcess33 - Weaker

Negative relationships:

ManufacturingProcess36 - Stronger
ManufacturingProcess13 - Stronger
ManufacturingProcess17 - Stronger
ManufacturingProcess12** - Weaker, data does not have a solid linear trend

There are 4 groups: strong postive, strong negative, weak postive, weak negative. Within these groups, the stronger relationships will be most helpful in improving yield in future runs of the manufacturing process.

ManufacturingProcess32 and ManufacturingProcess09 have strong postive relationships with Yeild. This means increased numerical values for those processes correlate with higher yeild. This is good information as increasing their values within an efficient range could result in higher yields.

ManufacturingProcess36, ManufacturingProcess13 and ManufacturingProcess17 have strong negative relationships with Yeild. This means increased numerical values for those processes correlate with lower yeild. This is good information as more regulation to these processes to decrease their values could result in higher yields.

Homework07

Kaylie Evans

2025-04-06

Introduction

Load Libraries

6.2

(a) Start R and use these commands to load the data:

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

(d) Predict the response for the test set. What is the test set estimate of R^2?

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

6.3.

(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?