6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data(permeability)
# Load the caret library
require(caret)
## Loading required package: caret
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Loading required package: lattice
# Load the permeability dataset which includes the fingerprints data frame
data("permeability")
# Display the dimensions of the fingerprints data frame to understand its structure
print(dim(fingerprints))
## [1] 165 1107
# Identify predictors with near-zero variance using the nearZeroVar function
nzv_indices <- nearZeroVar(fingerprints, saveMetrics = FALSE)
# Output the number of predictors with near-zero variance
print(length(nzv_indices))
## [1] 719
library(caret)
data("permeability")
data("fingerprints")
## Warning in data("fingerprints"): data set 'fingerprints' not found
fingerprints_filtered <- fingerprints[, -nearZeroVar(fingerprints)]
# Split data into training and testing sets
set.seed(123) # for reproducibility
training_indices <- createDataPartition(permeability, p=0.8, list=FALSE)
# Setup cross-validation settings for model training
control_settings <- trainControl(method = "cv", number = 10)
# Extract training and testing sets based on filtered predictors and the response vector
X_train <- fingerprints_filtered[training_indices, ]
Y_train <- permeability[training_indices]
X_test <- fingerprints_filtered[-training_indices, ]
Y_test <- permeability[-training_indices]
# Train the Partial Least Squares (PLS) model
pls_model <- train(x = X_train,
y = Y_train,
method = "pls",
metric = 'Rsquared',
tuneLength = 20,
trControl = control_settings,
preProcess = c('center', 'scale'))
print(pls_model)
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.31894 0.3442124 10.254018
## 2 11.78898 0.4830504 8.534741
## 3 11.98818 0.4792649 9.219285
## 4 12.04349 0.4923322 9.448926
## 5 11.79823 0.5193195 9.049121
## 6 11.53275 0.5335956 8.658301
## 7 11.64053 0.5229621 8.878265
## 8 11.86459 0.5144801 9.265252
## 9 11.98385 0.5188205 9.218594
## 10 12.55634 0.4808614 9.610747
## 11 12.69674 0.4758068 9.702325
## 12 13.01534 0.4538906 9.956623
## 13 13.12637 0.4367362 9.878017
## 14 13.44865 0.4140715 10.065088
## 15 13.60135 0.4034269 10.188150
## 16 13.79361 0.3943904 10.247160
## 17 14.00756 0.3845119 10.412776
## 18 14.18113 0.3711378 10.587027
## 19 14.25674 0.3703610 10.575726
## 20 14.33121 0.3723176 10.679764
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 6.
# Predict the response on the test set using the trained PLS model
predictions <- predict(pls_model, X_test)
# Calculate the R-squared value for the test set
test_R2 <- cor(Y_test, predictions)^2
# Print the R-squared value
print(test_R2)
## [1] 0.3244542
# Define training control settings for cross-validation
train_control <- trainControl(method = "cv", number = 10)
# Train the linear model
lm_model <- train(x = X_train, y = Y_train, method = "lm",
trControl = train_control, preProcess = c("center", "scale"))
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
predictions_lm <- predict(lm_model, X_test)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
test_R2_lm <- cor(Y_test, predictions_lm)^2
test_RMSE_lm <- sqrt(mean((Y_test - predictions_lm)^2))
print(list(R_squared = test_R2_lm, RMSE = test_RMSE_lm))
## $R_squared
## [1] 0.07853504
##
## $RMSE
## [1] 29.90647
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors)measurements of the manufacturing process (predictors), and the response ofproduct yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
# Setting up k-NN imputation
impute <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
# Applying the k-NN imputation
ChemicalManufacturingProcess <- predict(impute, ChemicalManufacturingProcess)
set.seed(27) # Ensure reproducibility of the random processes
# Remove features with near-zero variance from the dataset
ChemicalManufacturingProcess <- ChemicalManufacturingProcess %>%
select(-nearZeroVar(.))
# Partition the data into training and testing sets with 80% of the data allocated for training
train_indices <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
train_set <- ChemicalManufacturingProcess[train_indices, ]
test_set <- ChemicalManufacturingProcess[-train_indices, ]
# Train the PLS model using cross-validation with 10 folds and tuning over 20 component lengths
pls_model <- train(Yield ~ ., data = train_set, method = "pls",
trControl = trainControl(method = "cv", number = 10),
tuneLength = 20)
model_title <- paste("Training Set RMSE Minimized at", pls_model$bestTune$ncomp, "Components")
optimal_results <- pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp)
summary(pls_model)
## Data: X dimension: 144 56
## Y dimension: 144 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 17.04 25.81 32.46 39.44 45.35
## .outcome 51.76 64.78 69.99 72.28 73.76
plot(pls_model)