library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# Define a function to install packages if not already installed
install_if_missing <- function(package) {
if (!require(package, character.only = TRUE)) {
install.packages(package)
library(package, character.only = TRUE)
}
}
# Install 'AppliedPredictiveModeling' and 'caret' if they are missing
install_if_missing("AppliedPredictiveModeling")
## Loading required package: AppliedPredictiveModeling
install_if_missing("caret")
## Loading required package: caret
## Loading required package: lattice
# Load the libraries and data
suppressPackageStartupMessages({
library(AppliedPredictiveModeling)
library(caret)
library(ggplot2)
})
6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
This code loads the permeability
dataset with: -
fingerprints
: Matrix with 1,107 binary molecular
predictors. - permeability
: Response variable (permeability
values) for 165 compounds.
data(permeability)
summary(permeability)%>%
kable(caption = "Summary of Permeability Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
permeability | |
---|---|
Min. : 0.06 | |
1st Qu.: 1.55 | |
Median : 4.91 | |
Mean :12.24 | |
3rd Qu.:15.47 | |
Max. :55.60 |
nrow(permeability)%>%
kable(caption = "SNumber of Permeability Record") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
x |
---|
165 |
Remove predictors with low frequencies using caret
’s
nearZeroVar
function.
To streamline the dataset and remove features that don’t provide much information, I used the caret package’s nearZeroVar() function to identify and eliminate predictors with near-zero variance.
This step confirmed that 388 predictors remain in the dataset after removing those with near-zero variance.
# Identify predictors with near-zero variance
nzv <- nearZeroVar(fingerprints)
# Filter out those predictors
filtered_fingerprints <- fingerprints[, -nzv]
# Check the number of predictors remaining
num_predictors <- ncol(filtered_fingerprints)
print(paste("Number of predictors left:", num_predictors))
## [1] "Number of predictors left: 388"
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
Data Splitting: The data was split into an 80% training set and a 20% testing set for modeling.
Model Training and Tuning:
permeability
).Results:
Optimal Number of Latent Variables:
Optimal number of latent variables: 6
This means the best model performance during cross-validation was achieved using 6 latent variables.
Resampled Estimate of \(R^2\):
Resampled estimate of R^2: 0.533595642093268
This indicates that the \(R^2\) value, a measure of how well the model explains the variability in the response variable, was approximately 0.533 during cross-validation. These results suggest the model explains about 53% of the variability in the permeability data using 6 latent variables. Would you like further clarification or analysis of these results?
# Split the data into training and test sets
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)
trainData <- filtered_fingerprints[trainIndex, ]
testData <- filtered_fingerprints[-trainIndex, ]
trainResponse <- permeability[trainIndex]
testResponse <- permeability[-trainIndex]
# Preprocess and tune PLS model
control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
plsFit <- train(
x = trainData,
y = trainResponse,
method = "pls",
tuneLength = 20, # Number of components to try
trControl = control,
preProcess = c("center", "scale")
)
# Optimal number of latent variables and R^2
best_pls <- plsFit$bestTune$ncomp
resampled_R2 <- max(plsFit$results$Rsquared)
print(paste("Optimal number of latent variables:", best_pls))
## [1] "Optimal number of latent variables: 6"
print(paste("Resampled estimate of R^2:", resampled_R2))
## [1] "Resampled estimate of R^2: 0.533595642093268"
Predict on the test set and calculate the \(R^2\) for these predictions.
The code below evaluates the performance of the Partial Least Squares (PLS) model on the test dataset by calculating the \(R^2\) value for the test predictions.
The test set evaluation shows that the \(R^2\) value is 0.324, indicating that the model explains approximately 32.4% of the variability in the test data. This suggests the model has moderate predictive power, but there may be room for improvement.
# Predict on the test set
test_predictions <- predict(plsFit, newdata = testData)
# Calculate R^2 for the test set
test_R2 <- cor(test_predictions, testResponse)^2
print(paste("Test set estimate of R^2:", test_R2))
## [1] "Test set estimate of R^2: 0.324454163589977"
Try building other models discussed in this chapter. Do any have better predictive performance?
The evaluation of the Random Forest model on the test set resulted in an \(R^2\) value of 0.496, indicating that it explains approximately 49.6% of the variability in the test data. This performance is an improvement over the PLS model’s \(R^2\) value of 0.324, suggesting that the Random Forest model is a better choice for predicting permeability in this case.
# Random Forest Model
rfFit <- train(
x = trainData,
y = trainResponse,
method = "rf",
trControl = control,
preProcess = c("center", "scale")
)
# Evaluate Random Forest on test set
rf_predictions <- predict(rfFit, newdata = testData)
rf_R2 <- cor(rf_predictions, testResponse)^2
print(paste("Test set estimate of R^2 for Random Forest:", rf_R2))
## [1] "Test set estimate of R^2 for Random Forest: 0.495727708576195"
Based on the \(R^2\) values, Random Forest model is a better choice for predicting permeability in this case.
This section covers the ChemicalManufacturingProcess
dataset.
The code loads the ChemicalManufacturingProcess
dataset from the AppliedPredictiveModeling
package, splits
it into predictors (processPredictors
) and a response
variable (yield
), which represents the percentage
yield.
processPredictors
,
excluding the final column (the yield), and the response variable is
assigned to yield
.# Load necessary library
if (!require("AppliedPredictiveModeling")) {
install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
}
# Load the data
data("ChemicalManufacturingProcess")
processPredictors <- ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)] # Exclude the last column which is `yield`
yield <- ChemicalManufacturingProcess$Yield # Assign response variable explicitly
# Review the data
str(processPredictors)
## 'data.frame': 176 obs. of 57 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.6 14.6 14.6 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
str(yield) # Response variable: percent yield
## num [1:176] 38 42.4 42 41.4 42.5 ...
caret
’s preProcess
to fill missing
values.The code imputes missing values in processPredictors
using the median of each column with the help of the caret
package, creating a fully imputed dataset,
imputedPredictors
.
# Impute missing values
library(caret)
# Create a preprocessing object to impute missing values with the median
preProcessObj <- preProcess(processPredictors, method = "medianImpute")
# Apply the preprocessing object to impute missing values in the dataset
imputedPredictors <- predict(preProcessObj, processPredictors)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
yield
, and evaluates its performance. The model achieved an
optimal \(R^2\) of
0.988 on the training set, indicating an excellent fit
to the training data.# Split the data into training and test sets
set.seed(123)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainData <- imputedPredictors[trainIndex, ]
testData <- imputedPredictors[-trainIndex, ]
trainYield <- yield[trainIndex]
testYield <- yield[-trainIndex]
# Train a random forest model
control <- trainControl(method = "cv", number = 10)
rfModel <- train(
x = trainData,
y = trainYield,
method = "rf",
trControl = control,
preProcess = c("center", "scale")
)
# Find the optimal performance metric on the training set
optimal_metric <- max(rfModel$results$Rsquared)
print(paste("Optimal performance metric (R^2) on training set:", optimal_metric))
## [1] "Optimal performance metric (R^2) on training set: 0.98756391280691"
Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# Predict on test set
test_predictions <- predict(rfModel, newdata = testData)
# Calculate R^2 for the test set
test_R2 <- cor(test_predictions, testYield)^2
print(paste("Test set R^2:", test_R2))
## [1] "Test set R^2: 0.995287236635365"
# Compare with training set R^2
cat("Resampled R^2 on training set:", optimal_metric, "\n")
## Resampled R^2 on training set: 0.9875639
cat("R^2 on test set:", test_R2, "\n")
## R^2 on test set: 0.9952872
The analysis identifies the most important predictors influencing the
response variable (Yield
) using the varImp()
function on the Random Forest model:
Based on the results:
# Calculate variable importance
importance <- varImp(rfModel)
print(importance)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## Yield 100.00000
## ManufacturingProcess13 0.31123
## BiologicalMaterial02 0.24698
## ManufacturingProcess04 0.18206
## BiologicalMaterial11 0.17995
## BiologicalMaterial03 0.17993
## ManufacturingProcess17 0.15168
## BiologicalMaterial09 0.11197
## BiologicalMaterial01 0.09799
## BiologicalMaterial12 0.08765
## ManufacturingProcess06 0.08196
## ManufacturingProcess11 0.07111
## ManufacturingProcess21 0.06094
## ManufacturingProcess20 0.05887
## ManufacturingProcess18 0.05570
## BiologicalMaterial05 0.04687
## ManufacturingProcess27 0.04153
## ManufacturingProcess19 0.04004
## ManufacturingProcess30 0.03800
## ManufacturingProcess03 0.03723
# Plot variable importance for a quick visual assessment
plot(importance, top = 10)
The scatter plots illustrate the relationships between the top
predictors (e.g., BiologicalMaterial01, BiologicalMaterial02, etc.) and
the response variable (Yield
). Each plot includes a fitted
regression line to highlight trends.
Yield
. As the values of these predictors
increase, the yield generally increases.Yield
. For example,
ensuring optimal levels of BiologicalMaterial02 could
significantly improve outcomes.# Visualize the relationship between top predictors and yield
top_predictors <- rownames(importance$importance)[1:5]
# Create scatter plots
par(mfrow = c(2, 3))
for (predictor in top_predictors) {
plot(trainData[[predictor]], trainYield, main = predictor,
xlab = predictor, ylab = "Yield")
abline(lm(trainYield ~ trainData[[predictor]]), col = "blue")
}