library(AppliedPredictiveModeling)
library(caret)
library(pls)
data(permeability)
nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
ncol(filtered_fingerprints)
## [1] 388
After filtering predictors with near-zero variance, we’re left with 388 predictors for modeling. This represents a substantial reduction from the original 1,107 predictors, which makes sense given the sparse nature of molecular fingerprints where many substructures are rarely present.
set.seed(123)
train_index <- createDataPartition(permeability, p = 0.75, list = FALSE)
train_x <- filtered_fingerprints[train_index, ]
train_y <- permeability[train_index]
test_x <- filtered_fingerprints[-train_index, ]
test_y <- permeability[-train_index]
ctrl <- trainControl(method = "cv", number = 10)
pls_model <- train(train_x, train_y,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProcess = c("center", "scale"))
plot(pls_model, main = "PLS Model Performance Across Components")
pls_model$bestTune
## ncomp
## 5 5
pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, ]
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 5 5 11.36902 0.5208998 8.524744 3.08177 0.254211 2.361927
The optimal PLS model uses 5 latent variables with a cross-validated R² of 0.521. It’s interesting that the model achieves reasonable performance with relatively few components, suggesting the molecular fingerprints contain structured information that can be efficiently compressed.
pls_pred <- predict(pls_model, test_x)
pls_test_r2 <- cor(pls_pred, test_y)^2
pls_test_rmse <- sqrt(mean((pls_pred - test_y)^2))
plot(test_y, pls_pred,
xlab = "Observed Permeability",
ylab = "Predicted Permeability",
main = "PLS: Predicted vs Observed on Test Set")
abline(0, 1, col = "red", lty = 2)
text(40, 5, paste("R² =", round(pls_test_r2, 3)))
pls_test_r2
## [1] 0.3593983
The test set R² is 0.359, which is comparable to the cross-validated performance, indicating the model generalizes reasonably well to new data.
pcr_model <- train(train_x, train_y,
method = "pcr",
tuneLength = 20,
trControl = ctrl,
preProcess = c("center", "scale"))
ridge_model <- train(train_x, train_y,
method = "ridge",
tuneLength = 15,
trControl = ctrl,
preProcess = c("center", "scale"))
lasso_model <- train(train_x, train_y,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.01, 2, length = 15)),
trControl = ctrl,
preProcess = c("center", "scale"))
elastic_model <- train(train_x, train_y,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0.5, lambda = seq(0.01, 2, length = 10)),
trControl = ctrl,
preProcess = c("center", "scale"))
pcr_pred <- predict(pcr_model, test_x)
ridge_pred <- predict(ridge_model, test_x)
lasso_pred <- predict(lasso_model, test_x)
elastic_pred <- predict(elastic_model, test_x)
pcr_test_r2 <- cor(pcr_pred, test_y)^2
ridge_test_r2 <- cor(ridge_pred, test_y)^2
lasso_test_r2 <- cor(lasso_pred, test_y)^2
elastic_test_r2 <- cor(elastic_pred, test_y)^2
results_df <- data.frame(
Model = c("PLS", "PCR", "Ridge", "Lasso", "Elastic Net"),
Test_R2 = c(pls_test_r2, pcr_test_r2, ridge_test_r2, lasso_test_r2, elastic_test_r2),
Test_RMSE = c(pls_test_rmse,
sqrt(mean((pcr_pred - test_y)^2)),
sqrt(mean((ridge_pred - test_y)^2)),
sqrt(mean((lasso_pred - test_y)^2)),
sqrt(mean((elastic_pred - test_y)^2)))
)
results_df
## Model Test_R2 Test_RMSE
## 1 PLS 0.3593983 12.16543
## 2 PCR 0.3157953 12.00417
## 3 Ridge 0.3886867 12.91751
## 4 Lasso 0.3390555 11.49045
## 5 Elastic Net 0.3542840 11.37516
Comparing the linear regression methods from Chapter 6, the Ridge model achieves the best test set R² of 0.389. The performance differences across models are relatively modest, suggesting all approaches capture similar underlying relationships in the molecular fingerprint data.
Based on the test set R² of approximately 0.39, the model explains about 39% of the variance in permeability. While this provides useful information, the remaining unexplained variance means predictions could be off by a meaningful amount. For initial screening to eliminate clearly poor candidates, this could be valuable. However, for final decision-making, laboratory experiments would still be necessary to confirm permeability values. The model serves best as a complement to reduce the number of compounds requiring full experimental testing rather than a complete replacement.
data(ChemicalManufacturingProcess)
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
preproc_impute <- preProcess(ChemicalManufacturingProcess[, -1],
method = c("knnImpute"))
imputed_predictors <- predict(preproc_impute,
ChemicalManufacturingProcess[, -1])
sum(is.na(imputed_predictors))
## [1] 0
The dataset initially contained 106 missing values. After k-nearest neighbors imputation, all missing values have been filled, allowing us to proceed with modeling.
set.seed(456)
train_idx <- createDataPartition(ChemicalManufacturingProcess$Yield,
p = 0.75, list = FALSE)
train_x_chem <- imputed_predictors[train_idx, ]
train_y_chem <- ChemicalManufacturingProcess$Yield[train_idx]
test_x_chem <- imputed_predictors[-train_idx, ]
test_y_chem <- ChemicalManufacturingProcess$Yield[-train_idx]
ctrl_chem <- trainControl(method = "cv", number = 10)
pls_chem <- train(train_x_chem, train_y_chem,
method = "pls",
tuneLength = 20,
trControl = ctrl_chem,
preProcess = c("center", "scale"))
plot(pls_chem, main = "Manufacturing Process: PLS Cross-Validation")
pls_chem$bestTune
## ncomp
## 1 1
max(pls_chem$results$Rsquared)
## [1] 0.5597893
The optimal model uses 1 components and achieves a cross-validated R² of 0.56. This indicates strong predictive capability for yield based on the manufacturing process and biological predictors.
pred_chem <- predict(pls_chem, test_x_chem)
test_r2_chem <- cor(pred_chem, test_y_chem)^2
test_rmse_chem <- sqrt(mean((pred_chem - test_y_chem)^2))
plot(test_y_chem, pred_chem,
xlab = "Actual Yield (%)",
ylab = "Predicted Yield (%)",
main = "Manufacturing Yield: Predicted vs Actual")
abline(0, 1, col = "blue", lty = 2)
text(42, 35, paste("Test R² =", round(test_r2_chem, 3)))
test_r2_chem
## [1] 0.2022605
test_rmse_chem
## [1] 1.899277
The test set R² is 0.202, which is notably lower than the cross-validated R² of 0.56, suggesting the model may be overfitting to the training data.
var_imp <- varImp(pls_chem)
plot(var_imp, top = 20, main = "Top 20 Most Influential Predictors")
top_predictors <- rownames(var_imp$importance)[order(var_imp$importance$Overall,
decreasing = TRUE)[1:20]]
manufacturing_vars <- grep("^Manufacturing", top_predictors, value = TRUE)
biological_vars <- grep("^Biological", top_predictors, value = TRUE)
Among the top 20 predictors, 12 are manufacturing process variables while 8 are biological material measurements. The manufacturing predictors clearly dominate, which is encouraging since these are the variables that can actually be controlled and optimized during production.
top_4_predictors <- rownames(var_imp$importance)[order(var_imp$importance$Overall,
decreasing = TRUE)[1:4]]
par(mfrow = c(2, 2))
for (pred in top_4_predictors) {
plot(imputed_predictors[, pred], ChemicalManufacturingProcess$Yield,
xlab = pred,
ylab = "Yield (%)",
main = paste("Yield vs", pred))
abline(lm(ChemicalManufacturingProcess$Yield ~ imputed_predictors[, pred]),
col = "red")
}
par(mfrow = c(1, 1))
The scatter plots show clear relationships between top predictors and yield. Most notably, several manufacturing process variables show strong linear or nonlinear patterns with yield. This suggests practical optimization strategies: monitoring these key process parameters closely and adjusting them within operational constraints could lead to meaningful yield improvements. Since manufacturing variables dominate the importance rankings and are controllable, focusing on optimizing these specific process conditions offers the most direct path to boosting yields and increasing revenue.