DATA624 HW7

6.2

Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a suﬃcient permeability to become a drug:

Start R and use these commands to load the data:

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
data(permeability)

dim(fingerprints)

## [1]  165 1107

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the near ZeroVar function from the caret package. How many predictors are left for modeling?

nzv <- nearZeroVar(fingerprints, saveMetrics = TRUE)

nzv_indices <- nearZeroVar(fingerprints)  
filtered_fingerprints <- fingerprints[, -nzv_indices]

dim(filtered_fingerprints)

## [1] 165 388

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

perm_data <- data.frame(permeability = permeability, fingerprints)

set.seed(623)  
trainIndex <- createDataPartition(perm_data$permeability, p = 0.7, list = FALSE)
train_data <- perm_data[trainIndex, ]
test_data <- perm_data[-trainIndex, ]

nzv <- nearZeroVar(train_data[, -1])  
train_filtered <- train_data[, -c(1, nzv + 1)]  
test_filtered <- test_data[, -c(1, nzv + 1)]   


pls_model <- train(
  x = train_filtered,              
  y = train_data$permeability,     
  method = "pls",                  
  trControl = trainControl(method = "cv", number = 10),  
  tuneLength = 20,                 
  preProc = c("center", "scale")   
)

optimal_components <- pls_model$bestTune$ncomp
cat("Optimal number of latent variables:", optimal_components, "\n")

## Optimal number of latent variables: 2

resampled_results <- pls_model$results  
resampled_r2 <- resampled_results[resampled_results$ncomp == optimal_components, "Rsquared"]
cat("Resampled estimate of R^2:", resampled_r2, "\n")

## Resampled estimate of R^2: 0.5120463

Predict the response for the test set. What is the test set estimate of R2?

pls_pred <- predict(pls_model, newdata = test_filtered)
test_r2 <- cor(test_data$permeability, pls_pred)^2
cat("PLS Test R^2:", test_r2, "\n")

## PLS Test R^2: 0.3603431

test_rmse <- sqrt(mean((test_data$permeability - pls_pred)^2))
cat("Test set RMSE:", test_rmse, "\n")

## Test set RMSE: 12.51836

Try building other models discussed in this chapter. Do any have better predictive performance?

Yes, Elastic Net shows the best predictive performance based on test set metrics. It had the lowest RMSE and highest R^2 on the test data compared to the others.

enet_model <- train(
  x = train_filtered, y = train_data$permeability,
  method = "glmnet", trControl = trainControl(method = "cv", number = 10),
  tuneGrid = expand.grid(alpha = seq(0, 1, length = 5),
                         lambda = seq(0, 1, length = 20)),
  preProc = c("center", "scale")
)
enet_pred <- predict(enet_model, newdata = test_filtered)
enet_test_r2 <- cor(test_data$permeability, enet_pred)^2
cat("Elastic Net Test R^2:", enet_test_r2, "\n")

## Elastic Net Test R^2: 0.4152464

test2_rmse <- sqrt(mean((test_data$permeability - enet_pred)^2))
cat("Test set RMSE:", test2_rmse, "\n")

## Test set RMSE: 12.11642

lasso_model <- train(
  x = train_filtered, y = train_data$permeability,
  method = "glmnet", trControl = trainControl(method = "cv", number = 10),
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0, 1, length = 20)),
  preProc = c("center", "scale")
)
lasso_pred <- predict(lasso_model, newdata = test_filtered)
lasso_test_r2 <- cor(test_data$permeability, lasso_pred)^2
cat("Lasso Test R^2:", lasso_test_r2, "\n")

## Lasso Test R^2: 0.4581254

test3_rmse <- sqrt(mean((test_data$permeability - lasso_pred)^2))
cat("Test set RMSE:", test3_rmse, "\n")

## Test set RMSE: 11.53593

Would you recommend any of your models to replace the permeability laboratory experiment?

I would not recommend any of the models to replace the permeability laboratory experiment. Their low R^2 values (below 0.5) indicate they explain less than 50% of the variance in permeability, and their high RMSE values (above 10) suggest prediction errors are too large.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

The dataset’s name from the book data(chemicalManufacturing) is incorrect or its been updated. I used the dataset data(ChemicalManufacturingProcess) from the book, which has one extra predictor.

library(AppliedPredictiveModeling)
#data(package = "AppliedPredictiveModeling")
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)

## [1] 176  58

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

sum(is.na(ChemicalManufacturingProcess))

## [1] 106

processPredictors <- ChemicalManufacturingProcess[, 2:58]
imputation_model <- preProcess(processPredictors, method = "knnImpute")
processPredictors_imputed <- predict(imputation_model, processPredictors)
ChemicalManufacturingProcess_imputed <- data.frame(Yield = ChemicalManufacturingProcess[, 1], processPredictors_imputed)

sum(is.na(processPredictors_imputed))

## [1] 0

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(633)

trainIndex <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.7, list = FALSE)
trainData <- ChemicalManufacturingProcess[trainIndex, ]
testData <- ChemicalManufacturingProcess[-trainIndex, ]

trainPredictors <- trainData[, 2:58] 
trainYield <- trainData[, 1]
testPredictors <- testData[, 2:58]
testYield <- testData[, 1]

preProc <- preProcess(trainPredictors, method = c("knnImpute", "center", "scale"))

trainPredictorsProc <- predict(preProc, trainPredictors)
testPredictorsProc <- predict(preProc, testPredictors)

trainDataProc <- data.frame(Yield = trainYield, trainPredictorsProc)
testDataProc <- data.frame(Yield = testYield, testPredictorsProc)

tuneGrid <- expand.grid(mtry = seq(5, 25, by = 5))

rfModel <- train(Yield ~ ., 
                 data = trainDataProc, 
                 method = "rf", 
                 trControl = trainControl(method = "cv", number = 5),
                 tuneGrid = tuneGrid,
                 ntree = 500) 

print(rfModel)

## Random Forest 
## 
## 124 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 100, 99, 100, 99, 98 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    5    1.240684  0.6303002  0.9718862
##   10    1.217709  0.6298186  0.9346331
##   15    1.189827  0.6433231  0.9051572
##   20    1.195483  0.6336262  0.9077871
##   25    1.193571  0.6320215  0.9023918
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 15.

testPredictions <- predict(rfModel, newdata = testDataProc)

testRMSE <- sqrt(mean((testYield - testPredictions)^2))
cat("Test Set RMSE:", testRMSE, "\n")

## Test Set RMSE: 1.084039

optimalRMSE <- min(rfModel$results$RMSE)
cat("Optimal Cross-Validated RMSE:", optimalRMSE, "\n")

## Optimal Cross-Validated RMSE: 1.189827

Predict the response for the test set. What is the value of the performance metric and how does this compare with the re sampled performance metric on the training set?

The test RMSE is higher, suggesting possible over fitting or variability in the test set.

resampledRMSE <- min(rfModel$results$RMSE)
cat("Resampled RMSE from Training Set (5-fold CV):", resampledRMSE, "\n")

## Resampled RMSE from Training Set (5-fold CV): 1.189827

testPredictions <- predict(rfModel, newdata = testDataProc)

testRMSE <- sqrt(mean((testYield - testPredictions)^2))
cat("Test Set RMSE:", testRMSE, "\n")

## Test Set RMSE: 1.084039

difference <- testRMSE - resampledRMSE
cat("Difference (Test RMSE - Resampled RMSE):", difference, "\n")

## Difference (Test RMSE - Resampled RMSE): -0.105788

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I will say Yes, Biological predictors dominate the list.

Top biological predictors (BiologicalMaterial01, BiologicalMaterial02) had high importance scores, meaning they contribute significantly to the model’s predictive accuracy.

Top process predictors ( ManufacturingProcess09, ManufacturingProcess32, ManufacturingProcess06) were also high-ranking, but slightly behind the biological ones.

importance <- varImp(rfModel, scale = FALSE)  
importance_df <- importance$importance
importance_df$Predictor <- rownames(importance_df)
colnames(importance_df)[1] <- "Importance"

importance_df <- importance_df[order(-importance_df$Importance), ]

biological_predictors <- paste0("BiologicalMaterial", sprintf("%02d", 1:12))
process_predictors <- setdiff(names(ChemicalManufacturingProcess)[2:58], biological_predictors)

importance_df$Category <- ifelse(importance_df$Predictor %in% biological_predictors, "Biological", "Process")

print(head(importance_df, 10))

##                        Importance              Predictor   Category
## ManufacturingProcess32  67.178963 ManufacturingProcess32    Process
## ManufacturingProcess13  37.171125 ManufacturingProcess13    Process
## BiologicalMaterial03    31.913118   BiologicalMaterial03 Biological
## BiologicalMaterial12    31.220375   BiologicalMaterial12 Biological
## BiologicalMaterial06    27.568548   BiologicalMaterial06 Biological
## ManufacturingProcess09  19.923338 ManufacturingProcess09    Process
## ManufacturingProcess17  18.214656 ManufacturingProcess17    Process
## BiologicalMaterial02    12.491904   BiologicalMaterial02 Biological
## BiologicalMaterial11    12.161797   BiologicalMaterial11 Biological
## ManufacturingProcess31   9.861275 ManufacturingProcess31    Process

bio_importance <- sum(importance_df$Importance[importance_df$Category == "Biological"])
proc_importance <- sum(importance_df$Importance[importance_df$Category == "Process"])
cat("Total Importance - Biological:", bio_importance, "\n")

## Total Importance - Biological: 149.2273

cat("Total Importance - Process:", proc_importance, "\n")

## Total Importance - Process: 283.5934

cat("Percentage Biological:", bio_importance / (bio_importance + proc_importance) * 100, "%\n")

## Percentage Biological: 34.47786 %

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Based on the PDP, most predictors show a rising curve, meaning that increasing those variables leads to higher predicted yield. However, one exception is ManufacturingProcess13, which shows a falling curve, indicating that increasing its value results in lower predicted yield.

These plots help us understand how each variable influences the model’s yield predictions. This information is valuable because it allows us to identify the variables that are most positively associated with high yield.

importance <- varImp(rfModel, scale = FALSE)
top_predictors <- rownames(importance$importance)[order(-importance$importance[,1])][1:5]

par(mfrow = c(3, 2))  
for (pred in top_predictors) {
  pd <- partial(rfModel$finalModel, pred.var = pred, train = trainDataProc)
  plot(pd, main = paste("PDP for", pred), xlab = pred, ylab = "Predicted Yield")
}

par(mfrow = c(3, 2))

for (pred in top_predictors) {
  plot(trainDataProc[[pred]], trainDataProc$Yield, 
       xlab = pred, ylab = "Yield", main = paste("Scatterplot for", pred))
  lines(lowess(trainDataProc[[pred]], trainDataProc$Yield), col = "blue")
}

DATA624 HW7

Benson T

2025-03-23

6.2

6.3