HW7

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a)Start R and use these commands to load the data. The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
data(permeability)
nrow(fingerprints)

## [1] 165

ncol(fingerprints)

## [1] 1107

b)The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

After using nearZeroVar function, 388 predictors left for modeling.

library(caret)

# nearZeroVar
nzv_result <- nearZeroVar(fingerprints, saveMetrics = TRUE)

# Summary of the filtering
table(nzv_result$nzv)

## 
## FALSE  TRUE 
##   388   719

# Final answer
predictors_left <- sum(!nzv_result$nzv)
cat("Predictors left:", predictors_left, "\n")

## Predictors left: 388

# Create the new dataset
fingerprints_nearZeroVar  <- fingerprints[, !nzv_result$nzv]

C)Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

With 80/20 training and test splits. The training Set attains its optimal model with the smallest RMSE at 2 components, this captures close to 51% of the variation in the permeability.

perm <- data.frame(permeability = permeability, fingerprints_nearZeroVar)

set.seed(88)

train_index <- createDataPartition(perm$permeability, times = 1, p = 0.8, list = FALSE)
train_df <- perm[train_index, ]
test_df <- perm[-train_index, ]

pls_model <- train(permeability ~ ., data = train_df, method = "pls", tuneLength = 24, 
    trControl = trainControl(method = "repeatedcv", repeats = 8), 
    preProcess = c("center", "scale"))

# PlotRMSE
ggplot(pls_model) + 
    xlab("Number of Components") + 
    theme(plot.title = element_blank())

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the caret package.
##   Please report the issue at <https://github.com/topepo/caret/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

cat("Training Set at optimal model with the smallest RMSE at", 
    pls_model$bestTune$ncomp, "components\n")

## Training Set at optimal model with the smallest RMSE at 2 components

pls_model$results |>
  filter(ncomp == pls_model$bestTune$ncomp) |>
  select(ncomp, RMSE, Rsquared) |>
  print(row.names = FALSE)

##  ncomp    RMSE  Rsquared
##      2 11.1205 0.5096598

(d)Predict the response for the test set. What is the test set estimate of R2?

The test set estimate of R2 is 0.34.

x_test <- test_df[, -1]  # Remove the first column (permeability)
y_test <- test_df$permeability

pls_predictions <- predict(pls_model, newdata = x_test)
pls_results <- data.frame(
  Model = "PLS",
  RMSE = caret::RMSE(pls_predictions, y_test),
  Rsquared = caret::R2(pls_predictions, y_test)
)

print(pls_results, row.names = FALSE)

##  Model     RMSE  Rsquared
##    PLS 13.87306 0.3462492

(e)Try building other models discussed in this chapter. Do any have better predictive performance?

First I tried elastic net regression model, it gives R2 of 0.38.

elas_net_model <- train(permeability ~ ., data = train_df, method = "glmnet", 
    trControl = trainControl("repeatedcv", repeats = 8), 
    tuneLength = 4)

# Make predictions with Elastic Net
elas_net_pred <- predict(elas_net_model, x_test)

# Model performance metrics using postResample
elas_net_performance <- postResample(pred = elas_net_pred, obs = test_df$permeability)

print(elas_net_performance)

##       RMSE   Rsquared        MAE 
## 13.5393766  0.3865869  9.1680644

elas_net_results <- data.frame(
  Model = "Elastic Net",
  RMSE = elas_net_performance["RMSE"],
  Rsquared = elas_net_performance["Rsquared"]
)

Secondly I used Lasso Regression. It gives R2 score of 0.31.

#lasso
x_train <- as.matrix(train_df[, -which(names(train_df) == "permeability")])
y_train <- train_df$permeability
x_test_mat <- as.matrix(x_test)

# Fit glmnet
cv_fit <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)
best_lambda <- cv_fit$lambda.min

#  predictions
lr_predictions <- predict(cv_fit, newx = x_test_mat, s = "lambda.min")


r_squared <- cor(lr_predictions, y_test)^2


lr_results <- data.frame(Model = "Lasso Regression",
                      RMSE = sqrt(mean((lr_predictions - y_test)^2)),
                      Rsquared = r_squared,
                      Lambda = best_lambda)

print(lr_results, row.names = FALSE)

##             Model     RMSE Rsquared   Lambda
##  Lasso Regression 14.24873 0.341191 2.537167

And lastly I used rigid fit, which gives R2 of 0.38

ridge_fit <- train(permeability ~ ., data = train_df, 
                  method = 'glmnet',
                  metric = 'Rsquared',
                  tuneGrid = expand.grid(
                    alpha = 0,  # 0 = Ridge, 1 = Lasso
                    lambda = 10^seq(-3, 2, length = 20)  # Wider range of lambda
                  ),
                  trControl = trainControl(method = 'cv', number = 10),
                  preProcess = c('center','scale'))

# Plot
plot(ridge_fit, main = "Ridge Regression Tuning (glmnet)")

# Predictions
ridge_predictions <- predict(ridge_fit, newdata = x_test)
ridge_results <- data.frame(Model = "Ridge Regression",
                      RMSE = caret::RMSE(ridge_predictions, y_test),
                      Rsquared = caret::R2(ridge_predictions, y_test))

print(ridge_results, row.names = FALSE)

##             Model     RMSE  Rsquared
##  Ridge Regression 13.66522 0.3799685

No, compare overall preditive models above, PLS, Redge regression , Lasso regression and elastic net regression, PLS returns betst RSquared value of close to 51% and lowest RMSE value.

Would you recommend any of your models to replace the permeability laboratory experiment?

NO, I wouldn’t recommend any other models. We can see from above testing results that PLS has the lowest RMSE and highest Rsquared value which indicates it’s the best model to use.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)

## [1] 176  58

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

There are 1% misssing values in the dataset. I’ll use KNN imputation method to fill in the missing data and it’s showing 0% of missing after the imputation.

table(is.na(ChemicalManufacturingProcess))

## 
## FALSE  TRUE 
## 10102   106

missmap(ChemicalManufacturingProcess, 
        main = "Missing Data Pattern - Before Imputation",
        col = c("yellow", "black"))

#pre
preproc <- preProcess(ChemicalManufacturingProcess, 
                      method = c("knnImpute"),
                      k = 5,  # Number of neighbors
                      knnSummary = mean)

# Apply imputation
ChemicalManufacturingProcess_imputed <- predict(preproc, ChemicalManufacturingProcess)

ChemicalManufacturingProcess_imputed <- predict(preproc, ChemicalManufacturingProcess, 
                                               type = "vector")

# Check missing values 
print(table(is.na(ChemicalManufacturingProcess_imputed)))

## 
## FALSE 
## 10208

cat("Total missing:", sum(is.na(ChemicalManufacturingProcess_imputed)), "\n")

## Total missing: 0

missmap(ChemicalManufacturingProcess_imputed, 
        main = "Missing Data Pattern - After KNN Imputation",
        col = c("yellow", "black"))

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I split the training and test set with 80/20.After turning model, optimal at 1 component. I’ll looking at RMSE and Rsqure value for the performance metric.

set.seed(88)
preproc_impute <- preProcess(ChemicalManufacturingProcess, method = "knnImpute", k = 5)
data_imputed <- predict(preproc_impute, ChemicalManufacturingProcess)
train_index <- createDataPartition(data_imputed$Yield, p = 0.8, list = FALSE)

train_data <- data_imputed[train_index, ]
test_data <- data_imputed[-train_index, ]

preproc_params <- preProcess(train_data, method = c("center", "scale"))
train_processed <- predict(preproc_params, train_data)
test_processed <- predict(preproc_params, test_data)

#Tune PLS 
pls_model <- train(Yield ~ ., 
                  data = train_processed,
                  method = "pls",
                  tuneLength = 20,  # Test 20 components
                  trControl = trainControl(
                    method = "repeatedcv",
                    number = 10,     # 10-fold cross-validation
                    repeats = 5,     # 5 repetitions
                    savePredictions = "final"
                  ),
                  preProcess = c("center", "scale"))

optimal_components <- pls_model$bestTune$ncomp
optimal_performance <- pls_model$results |>
  filter(ncomp == optimal_components)

cat("Optimal number of components:", optimal_components, "\n")

## Optimal number of components: 1

cat("Cross-Validated RMSE:", round(optimal_performance$RMSE, 4), "\n")

## Cross-Validated RMSE: 0.7809

cat("Cross-Validated R-squared:", round(optimal_performance$Rsquared, 4), "\n")

## Cross-Validated R-squared: 0.4798

cat("Cross-Validated MAE:", round(optimal_performance$MAE, 4), "\n")

## Cross-Validated MAE: 0.6164

#Plot
ggplot(pls_model) +
  ggtitle(paste("PLS Model Tuning - Optimal at", optimal_components, "Components")) +
  xlab("Number of Components") +
  ylab("RMSE (Cross-Validation)") +
  theme_minimal()

Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Optimal number of components was set to be 1, with cross-Validated RMSE 0.7809 and Rsquared 0.4798, but seems the model is less accurate on unseen data.The large performance gap between training and test sets is a bit concerning that needs to be addressed before trusting this model for real-world predictions. It’s a sign of overfitting - it performs much better on the training data than on the test data.

test_predictions <- predict(pls_model, newdata = test_processed)
test_performance <- postResample(pred = test_predictions, obs = test_processed$Yield)

cat("Test RMSE:", round(test_performance["RMSE"], 4), "\n")

## Test RMSE: 0.8906

cat("Test R-squared:", round(test_performance["Rsquared"], 4), "\n")

## Test R-squared: 0.2599

cat("Test MAE:", round(test_performance["MAE"], 4), "\n")

## Test MAE: 0.7183

#comparison
training_results <- data.frame(
  Metric = c("RMSE", "R-squared", "MAE"),
  Training_CV = c(
    round(optimal_performance$RMSE, 4),
    round(optimal_performance$Rsquared, 4),
    round(optimal_performance$MAE, 4)
  ),
  Test_Set = c(
    round(test_performance["RMSE"], 4),
    round(test_performance["Rsquared"], 4),
    round(test_performance["MAE"], 4)
  )
)

print(training_results, row.names = FALSE)

##     Metric Training_CV Test_Set
##       RMSE      0.7809   0.8906
##  R-squared      0.4798   0.2599
##        MAE      0.6164   0.7183

cat("The optimal PLS model uses", optimal_components, 
    "components with a cross-validated RMSE of", 
    round(optimal_performance$RMSE, 4), "and R² of", 
    round(optimal_performance$Rsquared, 4), "\n")

## The optimal PLS model uses 1 components with a cross-validated RMSE of 0.7809 and R² of 0.4798

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

From the list of top predictors, manufacturingProcess dominated the list.

varImp_2 <- varImp(pls_model)$importance |>
  as.data.frame() |>
  rownames_to_column(var = "Predictor") |>
  arrange(desc(Overall))|> 
  top_n(10, Overall)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

#  plot
varImp_2 |>
  mutate(Predictor = forcats::fct_inorder(Predictor)) |>
  ggplot(aes(x = Overall, y = Predictor)) + 
  geom_col(fill = "steelblue", alpha = 0.8) + 
  labs(
    title = "Top Important Predictors in PLS Model",
    x = "Variable Importance",
    y = "Predictor"
  ) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

From thie relationship chart below, seems there are strong relationships between the response Yield and the following predictors/variables: ManufacturingProcess 32, 36 and 09 has strong positive corolation, whereas inverse correlation can been seen with ManufacturingProcess36 and 13.

response_var <- colnames(train_processed)[1]

train_processed |>
  select(all_of(c(response_var, varImp_2$Predictor))) |>
  cor(use = "complete.obs") |> 
  corrplot::corrplot(tl.cex = 0.6, 
                     method = "color",
                     title = paste("relationship:", response_var, "vs Predictors"),
                     mar = c(0, 0, 1, 0))

HW7

XiaoFei Mei

2025-11-01