6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
a)Start R and use these commands to load the data. The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(AppliedPredictiveModeling)
data(permeability)
nrow(fingerprints)
## [1] 165
ncol(fingerprints)
## [1] 1107
b)The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
After using nearZeroVar function, 388 predictors left for modeling.
library(caret)
# nearZeroVar
nzv_result <- nearZeroVar(fingerprints, saveMetrics = TRUE)
# Summary of the filtering
table(nzv_result$nzv)
##
## FALSE TRUE
## 388 719
# Final answer
predictors_left <- sum(!nzv_result$nzv)
cat("Predictors left:", predictors_left, "\n")
## Predictors left: 388
# Create the new dataset
fingerprints_nearZeroVar <- fingerprints[, !nzv_result$nzv]
C)Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
With 80/20 training and test splits. The training Set attains its optimal model with the smallest RMSE at 2 components, this captures close to 51% of the variation in the permeability.
perm <- data.frame(permeability = permeability, fingerprints_nearZeroVar)
set.seed(88)
train_index <- createDataPartition(perm$permeability, times = 1, p = 0.8, list = FALSE)
train_df <- perm[train_index, ]
test_df <- perm[-train_index, ]
pls_model <- train(permeability ~ ., data = train_df, method = "pls", tuneLength = 24,
trControl = trainControl(method = "repeatedcv", repeats = 8),
preProcess = c("center", "scale"))
# PlotRMSE
ggplot(pls_model) +
xlab("Number of Components") +
theme(plot.title = element_blank())
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the caret package.
## Please report the issue at <https://github.com/topepo/caret/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cat("Training Set at optimal model with the smallest RMSE at",
pls_model$bestTune$ncomp, "components\n")
## Training Set at optimal model with the smallest RMSE at 2 components
pls_model$results |>
filter(ncomp == pls_model$bestTune$ncomp) |>
select(ncomp, RMSE, Rsquared) |>
print(row.names = FALSE)
## ncomp RMSE Rsquared
## 2 11.1205 0.5096598
(d)Predict the response for the test set. What is the test set estimate of R2?
The test set estimate of R2 is 0.34.
x_test <- test_df[, -1] # Remove the first column (permeability)
y_test <- test_df$permeability
pls_predictions <- predict(pls_model, newdata = x_test)
pls_results <- data.frame(
Model = "PLS",
RMSE = caret::RMSE(pls_predictions, y_test),
Rsquared = caret::R2(pls_predictions, y_test)
)
print(pls_results, row.names = FALSE)
## Model RMSE Rsquared
## PLS 13.87306 0.3462492
(e)Try building other models discussed in this chapter. Do any have better predictive performance?
First I tried elastic net regression model, it gives R2 of 0.38.
elas_net_model <- train(permeability ~ ., data = train_df, method = "glmnet",
trControl = trainControl("repeatedcv", repeats = 8),
tuneLength = 4)
# Make predictions with Elastic Net
elas_net_pred <- predict(elas_net_model, x_test)
# Model performance metrics using postResample
elas_net_performance <- postResample(pred = elas_net_pred, obs = test_df$permeability)
print(elas_net_performance)
## RMSE Rsquared MAE
## 13.5393766 0.3865869 9.1680644
elas_net_results <- data.frame(
Model = "Elastic Net",
RMSE = elas_net_performance["RMSE"],
Rsquared = elas_net_performance["Rsquared"]
)
Secondly I used Lasso Regression. It gives R2 score of 0.31.
#lasso
x_train <- as.matrix(train_df[, -which(names(train_df) == "permeability")])
y_train <- train_df$permeability
x_test_mat <- as.matrix(x_test)
# Fit glmnet
cv_fit <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)
best_lambda <- cv_fit$lambda.min
# predictions
lr_predictions <- predict(cv_fit, newx = x_test_mat, s = "lambda.min")
r_squared <- cor(lr_predictions, y_test)^2
lr_results <- data.frame(Model = "Lasso Regression",
RMSE = sqrt(mean((lr_predictions - y_test)^2)),
Rsquared = r_squared,
Lambda = best_lambda)
print(lr_results, row.names = FALSE)
## Model RMSE Rsquared Lambda
## Lasso Regression 14.24873 0.341191 2.537167
And lastly I used rigid fit, which gives R2 of 0.38
ridge_fit <- train(permeability ~ ., data = train_df,
method = 'glmnet',
metric = 'Rsquared',
tuneGrid = expand.grid(
alpha = 0, # 0 = Ridge, 1 = Lasso
lambda = 10^seq(-3, 2, length = 20) # Wider range of lambda
),
trControl = trainControl(method = 'cv', number = 10),
preProcess = c('center','scale'))
# Plot
plot(ridge_fit, main = "Ridge Regression Tuning (glmnet)")
# Predictions
ridge_predictions <- predict(ridge_fit, newdata = x_test)
ridge_results <- data.frame(Model = "Ridge Regression",
RMSE = caret::RMSE(ridge_predictions, y_test),
Rsquared = caret::R2(ridge_predictions, y_test))
print(ridge_results, row.names = FALSE)
## Model RMSE Rsquared
## Ridge Regression 13.66522 0.3799685
No, compare overall preditive models above, PLS, Redge regression , Lasso regression and elastic net regression, PLS returns betst RSquared value of close to 51% and lowest RMSE value.
NO, I wouldn’t recommend any other models. We can see from above testing results that PLS has the lowest RMSE and highest Rsquared value which indicates it’s the best model to use.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176 58
There are 1% misssing values in the dataset. I’ll use KNN imputation method to fill in the missing data and it’s showing 0% of missing after the imputation.
table(is.na(ChemicalManufacturingProcess))
##
## FALSE TRUE
## 10102 106
missmap(ChemicalManufacturingProcess,
main = "Missing Data Pattern - Before Imputation",
col = c("yellow", "black"))
#pre
preproc <- preProcess(ChemicalManufacturingProcess,
method = c("knnImpute"),
k = 5, # Number of neighbors
knnSummary = mean)
# Apply imputation
ChemicalManufacturingProcess_imputed <- predict(preproc, ChemicalManufacturingProcess)
ChemicalManufacturingProcess_imputed <- predict(preproc, ChemicalManufacturingProcess,
type = "vector")
# Check missing values
print(table(is.na(ChemicalManufacturingProcess_imputed)))
##
## FALSE
## 10208
cat("Total missing:", sum(is.na(ChemicalManufacturingProcess_imputed)), "\n")
## Total missing: 0
missmap(ChemicalManufacturingProcess_imputed,
main = "Missing Data Pattern - After KNN Imputation",
col = c("yellow", "black"))
I split the training and test set with 80/20.After turning model, optimal at 1 component. I’ll looking at RMSE and Rsqure value for the performance metric.
set.seed(88)
preproc_impute <- preProcess(ChemicalManufacturingProcess, method = "knnImpute", k = 5)
data_imputed <- predict(preproc_impute, ChemicalManufacturingProcess)
train_index <- createDataPartition(data_imputed$Yield, p = 0.8, list = FALSE)
train_data <- data_imputed[train_index, ]
test_data <- data_imputed[-train_index, ]
preproc_params <- preProcess(train_data, method = c("center", "scale"))
train_processed <- predict(preproc_params, train_data)
test_processed <- predict(preproc_params, test_data)
#Tune PLS
pls_model <- train(Yield ~ .,
data = train_processed,
method = "pls",
tuneLength = 20, # Test 20 components
trControl = trainControl(
method = "repeatedcv",
number = 10, # 10-fold cross-validation
repeats = 5, # 5 repetitions
savePredictions = "final"
),
preProcess = c("center", "scale"))
optimal_components <- pls_model$bestTune$ncomp
optimal_performance <- pls_model$results |>
filter(ncomp == optimal_components)
cat("Optimal number of components:", optimal_components, "\n")
## Optimal number of components: 1
cat("Cross-Validated RMSE:", round(optimal_performance$RMSE, 4), "\n")
## Cross-Validated RMSE: 0.7809
cat("Cross-Validated R-squared:", round(optimal_performance$Rsquared, 4), "\n")
## Cross-Validated R-squared: 0.4798
cat("Cross-Validated MAE:", round(optimal_performance$MAE, 4), "\n")
## Cross-Validated MAE: 0.6164
#Plot
ggplot(pls_model) +
ggtitle(paste("PLS Model Tuning - Optimal at", optimal_components, "Components")) +
xlab("Number of Components") +
ylab("RMSE (Cross-Validation)") +
theme_minimal()
Optimal number of components was set to be 1, with cross-Validated RMSE 0.7809 and Rsquared 0.4798, but seems the model is less accurate on unseen data.The large performance gap between training and test sets is a bit concerning that needs to be addressed before trusting this model for real-world predictions. It’s a sign of overfitting - it performs much better on the training data than on the test data.
test_predictions <- predict(pls_model, newdata = test_processed)
test_performance <- postResample(pred = test_predictions, obs = test_processed$Yield)
cat("Test RMSE:", round(test_performance["RMSE"], 4), "\n")
## Test RMSE: 0.8906
cat("Test R-squared:", round(test_performance["Rsquared"], 4), "\n")
## Test R-squared: 0.2599
cat("Test MAE:", round(test_performance["MAE"], 4), "\n")
## Test MAE: 0.7183
#comparison
training_results <- data.frame(
Metric = c("RMSE", "R-squared", "MAE"),
Training_CV = c(
round(optimal_performance$RMSE, 4),
round(optimal_performance$Rsquared, 4),
round(optimal_performance$MAE, 4)
),
Test_Set = c(
round(test_performance["RMSE"], 4),
round(test_performance["Rsquared"], 4),
round(test_performance["MAE"], 4)
)
)
print(training_results, row.names = FALSE)
## Metric Training_CV Test_Set
## RMSE 0.7809 0.8906
## R-squared 0.4798 0.2599
## MAE 0.6164 0.7183
cat("The optimal PLS model uses", optimal_components,
"components with a cross-validated RMSE of",
round(optimal_performance$RMSE, 4), "and R² of",
round(optimal_performance$Rsquared, 4), "\n")
## The optimal PLS model uses 1 components with a cross-validated RMSE of 0.7809 and R² of 0.4798
From the list of top predictors, manufacturingProcess dominated the list.
varImp_2 <- varImp(pls_model)$importance |>
as.data.frame() |>
rownames_to_column(var = "Predictor") |>
arrange(desc(Overall))|>
top_n(10, Overall)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
# plot
varImp_2 |>
mutate(Predictor = forcats::fct_inorder(Predictor)) |>
ggplot(aes(x = Overall, y = Predictor)) +
geom_col(fill = "steelblue", alpha = 0.8) +
labs(
title = "Top Important Predictors in PLS Model",
x = "Variable Importance",
y = "Predictor"
) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
From thie relationship chart below, seems there are strong relationships between the response Yield and the following predictors/variables: ManufacturingProcess 32, 36 and 09 has strong positive corolation, whereas inverse correlation can been seen with ManufacturingProcess36 and 13.
response_var <- colnames(train_processed)[1]
train_processed |>
select(all_of(c(response_var, varImp_2$Predictor))) |>
cor(use = "complete.obs") |>
corrplot::corrplot(tl.cex = 0.6,
method = "color",
title = paste("relationship:", response_var, "vs Predictors"),
mar = c(0, 0, 1, 0))