Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(AppliedPredictiveModeling)
data(permeability)
dim(fingerprints)
## [1] 165 1107
nzv <- nearZeroVar(fingerprints, saveMetrics = TRUE)
nzv_indices <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv_indices]
dim(filtered_fingerprints)
## [1] 165 388
perm_data <- data.frame(permeability = permeability, fingerprints)
set.seed(623)
trainIndex <- createDataPartition(perm_data$permeability, p = 0.7, list = FALSE)
train_data <- perm_data[trainIndex, ]
test_data <- perm_data[-trainIndex, ]
nzv <- nearZeroVar(train_data[, -1])
train_filtered <- train_data[, -c(1, nzv + 1)]
test_filtered <- test_data[, -c(1, nzv + 1)]
pls_model <- train(
x = train_filtered,
y = train_data$permeability,
method = "pls",
trControl = trainControl(method = "cv", number = 10),
tuneLength = 20,
preProc = c("center", "scale")
)
optimal_components <- pls_model$bestTune$ncomp
cat("Optimal number of latent variables:", optimal_components, "\n")
## Optimal number of latent variables: 2
resampled_results <- pls_model$results
resampled_r2 <- resampled_results[resampled_results$ncomp == optimal_components, "Rsquared"]
cat("Resampled estimate of R^2:", resampled_r2, "\n")
## Resampled estimate of R^2: 0.5120463
pls_pred <- predict(pls_model, newdata = test_filtered)
test_r2 <- cor(test_data$permeability, pls_pred)^2
cat("PLS Test R^2:", test_r2, "\n")
## PLS Test R^2: 0.3603431
test_rmse <- sqrt(mean((test_data$permeability - pls_pred)^2))
cat("Test set RMSE:", test_rmse, "\n")
## Test set RMSE: 12.51836
Yes, Elastic Net shows the best predictive performance based on test set metrics. It had the lowest RMSE and highest R^2 on the test data compared to the others.
enet_model <- train(
x = train_filtered, y = train_data$permeability,
method = "glmnet", trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = seq(0, 1, length = 5),
lambda = seq(0, 1, length = 20)),
preProc = c("center", "scale")
)
enet_pred <- predict(enet_model, newdata = test_filtered)
enet_test_r2 <- cor(test_data$permeability, enet_pred)^2
cat("Elastic Net Test R^2:", enet_test_r2, "\n")
## Elastic Net Test R^2: 0.4152464
test2_rmse <- sqrt(mean((test_data$permeability - enet_pred)^2))
cat("Test set RMSE:", test2_rmse, "\n")
## Test set RMSE: 12.11642
lasso_model <- train(
x = train_filtered, y = train_data$permeability,
method = "glmnet", trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = seq(0, 1, length = 20)),
preProc = c("center", "scale")
)
lasso_pred <- predict(lasso_model, newdata = test_filtered)
lasso_test_r2 <- cor(test_data$permeability, lasso_pred)^2
cat("Lasso Test R^2:", lasso_test_r2, "\n")
## Lasso Test R^2: 0.4581254
test3_rmse <- sqrt(mean((test_data$permeability - lasso_pred)^2))
cat("Test set RMSE:", test3_rmse, "\n")
## Test set RMSE: 11.53593
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
The dataset’s name from the book data(chemicalManufacturing) is incorrect or its been updated. I used the dataset data(ChemicalManufacturingProcess) from the book, which has one extra predictor.
library(AppliedPredictiveModeling)
#data(package = "AppliedPredictiveModeling")
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176 58
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
processPredictors <- ChemicalManufacturingProcess[, 2:58]
imputation_model <- preProcess(processPredictors, method = "knnImpute")
processPredictors_imputed <- predict(imputation_model, processPredictors)
ChemicalManufacturingProcess_imputed <- data.frame(Yield = ChemicalManufacturingProcess[, 1], processPredictors_imputed)
sum(is.na(processPredictors_imputed))
## [1] 0
set.seed(633)
trainIndex <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.7, list = FALSE)
trainData <- ChemicalManufacturingProcess[trainIndex, ]
testData <- ChemicalManufacturingProcess[-trainIndex, ]
trainPredictors <- trainData[, 2:58]
trainYield <- trainData[, 1]
testPredictors <- testData[, 2:58]
testYield <- testData[, 1]
preProc <- preProcess(trainPredictors, method = c("knnImpute", "center", "scale"))
trainPredictorsProc <- predict(preProc, trainPredictors)
testPredictorsProc <- predict(preProc, testPredictors)
trainDataProc <- data.frame(Yield = trainYield, trainPredictorsProc)
testDataProc <- data.frame(Yield = testYield, testPredictorsProc)
tuneGrid <- expand.grid(mtry = seq(5, 25, by = 5))
rfModel <- train(Yield ~ .,
data = trainDataProc,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = tuneGrid,
ntree = 500)
print(rfModel)
## Random Forest
##
## 124 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 100, 99, 100, 99, 98
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 5 1.240684 0.6303002 0.9718862
## 10 1.217709 0.6298186 0.9346331
## 15 1.189827 0.6433231 0.9051572
## 20 1.195483 0.6336262 0.9077871
## 25 1.193571 0.6320215 0.9023918
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 15.
testPredictions <- predict(rfModel, newdata = testDataProc)
testRMSE <- sqrt(mean((testYield - testPredictions)^2))
cat("Test Set RMSE:", testRMSE, "\n")
## Test Set RMSE: 1.084039
optimalRMSE <- min(rfModel$results$RMSE)
cat("Optimal Cross-Validated RMSE:", optimalRMSE, "\n")
## Optimal Cross-Validated RMSE: 1.189827
resampledRMSE <- min(rfModel$results$RMSE)
cat("Resampled RMSE from Training Set (5-fold CV):", resampledRMSE, "\n")
## Resampled RMSE from Training Set (5-fold CV): 1.189827
testPredictions <- predict(rfModel, newdata = testDataProc)
testRMSE <- sqrt(mean((testYield - testPredictions)^2))
cat("Test Set RMSE:", testRMSE, "\n")
## Test Set RMSE: 1.084039
difference <- testRMSE - resampledRMSE
cat("Difference (Test RMSE - Resampled RMSE):", difference, "\n")
## Difference (Test RMSE - Resampled RMSE): -0.105788
I will say Yes, Biological predictors dominate the list.
Top biological predictors (BiologicalMaterial01, BiologicalMaterial02) had high importance scores, meaning they contribute significantly to the model’s predictive accuracy.
Top process predictors ( ManufacturingProcess09, ManufacturingProcess32, ManufacturingProcess06) were also high-ranking, but slightly behind the biological ones.
importance <- varImp(rfModel, scale = FALSE)
importance_df <- importance$importance
importance_df$Predictor <- rownames(importance_df)
colnames(importance_df)[1] <- "Importance"
importance_df <- importance_df[order(-importance_df$Importance), ]
biological_predictors <- paste0("BiologicalMaterial", sprintf("%02d", 1:12))
process_predictors <- setdiff(names(ChemicalManufacturingProcess)[2:58], biological_predictors)
importance_df$Category <- ifelse(importance_df$Predictor %in% biological_predictors, "Biological", "Process")
print(head(importance_df, 10))
## Importance Predictor Category
## ManufacturingProcess32 67.178963 ManufacturingProcess32 Process
## ManufacturingProcess13 37.171125 ManufacturingProcess13 Process
## BiologicalMaterial03 31.913118 BiologicalMaterial03 Biological
## BiologicalMaterial12 31.220375 BiologicalMaterial12 Biological
## BiologicalMaterial06 27.568548 BiologicalMaterial06 Biological
## ManufacturingProcess09 19.923338 ManufacturingProcess09 Process
## ManufacturingProcess17 18.214656 ManufacturingProcess17 Process
## BiologicalMaterial02 12.491904 BiologicalMaterial02 Biological
## BiologicalMaterial11 12.161797 BiologicalMaterial11 Biological
## ManufacturingProcess31 9.861275 ManufacturingProcess31 Process
bio_importance <- sum(importance_df$Importance[importance_df$Category == "Biological"])
proc_importance <- sum(importance_df$Importance[importance_df$Category == "Process"])
cat("Total Importance - Biological:", bio_importance, "\n")
## Total Importance - Biological: 149.2273
cat("Total Importance - Process:", proc_importance, "\n")
## Total Importance - Process: 283.5934
cat("Percentage Biological:", bio_importance / (bio_importance + proc_importance) * 100, "%\n")
## Percentage Biological: 34.47786 %
Based on the PDP, most predictors show a rising curve, meaning that increasing those variables leads to higher predicted yield. However, one exception is ManufacturingProcess13, which shows a falling curve, indicating that increasing its value results in lower predicted yield.
These plots help us understand how each variable influences the model’s yield predictions. This information is valuable because it allows us to identify the variables that are most positively associated with high yield.
importance <- varImp(rfModel, scale = FALSE)
top_predictors <- rownames(importance$importance)[order(-importance$importance[,1])][1:5]
par(mfrow = c(3, 2))
for (pred in top_predictors) {
pd <- partial(rfModel$finalModel, pred.var = pred, train = trainDataProc)
plot(pd, main = paste("PDP for", pred), xlab = pred, ylab = "Predicted Yield")
}
par(mfrow = c(3, 2))
for (pred in top_predictors) {
plot(trainDataProc[[pred]], trainDataProc$Yield,
xlab = pred, ylab = "Yield", main = paste("Scatterplot for", pred))
lines(lowess(trainDataProc[[pred]], trainDataProc$Yield), col = "blue")
}