6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug: (a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(caret)
library(ggplot2)
nzvfp <- nearZeroVar(fingerprints)
nonzvfp <- fingerprints[,-nzvfp]
ncol(nonzvfp)
## [1] 388
split <- createDataPartition(permeability, p = 0.8, list = FALSE, times = 1)
Xtrain.data <- nonzvfp[split, ]
Xtest.data <- nonzvfp[-split, ]
Ytrain.data <- permeability[split]
Ytest.data <- permeability[-split]
#PLS Model & scale + center
ctrl <- trainControl(method = "cv", number = 10)
plsmod <- train(x = Xtrain.data, y = Ytrain.data,
method = "pls",
tuneLength = 15,
trControl = ctrl,
preProc = c("center", "scale"))
print(plsmod)
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 121, 119, 120, 121, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.59309 0.3188302 9.598363
## 2 11.36710 0.4595255 7.749247
## 3 11.07388 0.4746904 8.057209
## 4 11.16464 0.4785656 8.382748
## 5 11.13931 0.4855600 8.043630
## 6 11.02535 0.5135675 7.981960
## 7 11.06765 0.5073874 8.127469
## 8 10.98450 0.5229158 8.197463
## 9 11.23015 0.5068022 8.431096
## 10 11.49945 0.4834073 8.591727
## 11 11.83014 0.4725379 8.710300
## 12 11.89082 0.4662821 8.809819
## 13 12.04049 0.4616730 8.933141
## 14 12.42299 0.4413192 9.361306
## 15 12.56859 0.4383558 9.479632
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(plsmod)
(d) Predict the response for the test set. What is the test set estimate
for R squared?
test_predictions <- predict(plsmod, newdata = Xtest.data)
test_r_squared <- cor(Ytest.data, test_predictions)^2
print(paste("Test set R-squared:", round(test_r_squared, 4)))
## [1] "Test set R-squared: 0.5313"
#CV
ctrl <- trainControl(method = "cv", number = 10)
#Ridge Regression Model & scale + center
ridgeGrid <- data.frame(.lambda = seq(0, 1, by = 0.1))
ridgeRegFit <- train(Xtrain.data, Ytrain.data,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale", "knnImpute"))
print(ridgeRegFit)
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388), nearest neighbor imputation (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 120, 121, 120, 121, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 1.125874e+18 0.2800089 6.823853e+17
## 0.1 1.212636e+01 0.4170406 8.568913e+00
## 0.2 1.196254e+01 0.4449319 8.418901e+00
## 0.3 1.214782e+01 0.4569978 8.603576e+00
## 0.4 1.249251e+01 0.4629229 8.967930e+00
## 0.5 1.293384e+01 0.4661798 9.399849e+00
## 0.6 1.344670e+01 0.4681095 9.890764e+00
## 0.7 1.401751e+01 0.4691712 1.046584e+01
## 0.8 1.463229e+01 0.4696932 1.104786e+01
## 0.9 1.528873e+01 0.4699563 1.164365e+01
## 1.0 1.597278e+01 0.4700258 1.223602e+01
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2.
ctrl <- trainControl(method = "cv", number = 10)
# Lasso Regression Model & scale + center
lassoGrid <- expand.grid(.alpha = 1, .lambda = seq(0, 1, by = 0.1))
lassoRegFit <- train(Xtrain.data, Ytrain.data,
method = "glmnet",
tuneGrid = lassoGrid,
trControl = ctrl,
preProc = c("center", "scale", "knnImpute"))
print(lassoRegFit)
## glmnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388), nearest neighbor imputation (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 119, 120, 120, 118, 121, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 12.24832 0.4611300 8.866728
## 0.1 12.23527 0.4614725 8.833927
## 0.2 11.71666 0.4965321 8.361525
## 0.3 11.62843 0.5125720 8.362615
## 0.4 11.72340 0.5054104 8.436689
## 0.5 11.76697 0.5000980 8.443677
## 0.6 11.88020 0.4889762 8.497880
## 0.7 12.02198 0.4758262 8.564787
## 0.8 12.07811 0.4701341 8.584719
## 0.9 12.05770 0.4732123 8.560465
## 1.0 12.07119 0.4743039 8.563313
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.3.
models_list <- list(PLS = plsmod, Ridge = ridgeRegFit, Lasso = lassoRegFit)
r_squared_results <- lapply(models_list, function(model) {
cor(Ytest.data, predict(model, newdata = Xtest.data))^2
})
print("Test set R-squared values:")
## [1] "Test set R-squared values:"
print(r_squared_results)
## $PLS
## [1] 0.5312741
##
## $Ridge
## [1] 0.4884316
##
## $Lasso
## [1] 0.4750754
Initially, I thought I would recommend the model with the best results which happens to be PLS model, with the highest test set R-squared value at 0.5312741, explaining more of the variance than other models. It also happens to have the lowest RMSE value which minimized predictive errors. In reference to 1.4 permeability laboratory experiment… “are effective at quantifying a compound’s permeability, but the assay is expensive labor intensive” so if a predictive modelling happens to perform better then we might be able to suggest replacement. However, even though PLS performed better than other models, an R-squared of .53 is not sufficient and only explains half of the variance in the data.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
library(caret)
library(RANN)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
#Initially had issues accessing it /came up as null..
yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]
# Impute using knn
preProc <- preProcess(processPredictors, method = "knnImpute")
processPredictors_imputed <- predict(preProc, processPredictors)
print(sum(is.na(processPredictors_imputed)))
## [1] 0
#start with NZV, Identify and remove
nzv_columns <- nearZeroVar(processPredictors_imputed)
Predictors_cleaned <- processPredictors_imputed[, -nzv_columns]
data_combined <- data.frame(Yield = yield, Predictors_cleaned)
trainIndex <- createDataPartition(data_combined$Yield, p = .8,
list = FALSE,
times = 1)
trainData <- data_combined[trainIndex, ]
testData <- data_combined[-trainIndex, ]
# Prepare Ridge Regression
x_train <- as.matrix(trainData[, -1])
y_train <- trainData$Yield
x_test <- as.matrix(testData[, -1])
y_test <- testData$Yield
ridge_model <- glmnet(x_train, y_train, alpha = 0)
#cv
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_lambda <- cv_ridge$lambda.min
print(best_lambda)
## [1] 1.039974
The training RMSE is lower.
y_train_pred <- predict(cv_ridge, s = best_lambda, newx = x_train)
y_test_pred <- predict(cv_ridge, s = best_lambda, newx = x_test)
rmse_train <- sqrt(mean((y_train - y_train_pred)^2))
print(paste("Training RMSE:", rmse_train))
## [1] "Training RMSE: 1.02169306158907"
rmse_test <- sqrt(mean((y_test - y_test_pred)^2))
print(paste("Test RMSE:", rmse_test))
## [1] "Test RMSE: 1.16522526334198"
Process predictor dominate the list. This is found by extracting coefficients and sortin them; then comparing the absolute coefficients for both predictors.
ridge_coef <- coef(ridge_model, s = best_lambda)
importance <- as.data.frame(as.matrix(ridge_coef))
importance <- importance[-1, , drop = FALSE]
importance <- importance[order(-abs(importance[, 1])), , drop = FALSE]
biological_importance <- importance[grep("BiologicalMaterial", rownames(importance)), , drop = FALSE]
process_importance <- importance[grep("ManufacturingProcess", rownames(importance)), , drop = FALSE]
biological_importance_text <- paste(rownames(biological_importance),
round(biological_importance[, 1], 3),
sep = ": ")
process_importance_text <- paste(rownames(process_importance),
round(process_importance[, 1], 3),
sep = ": ")
dominance <- sum(abs(biological_importance)) > sum(abs(process_importance))
if (dominance) {
cat("\nBiological predictors dominate the model.\n")
} else {
cat("\nProcess predictors dominate the model.\n")
}
##
## Process predictors dominate the model.
I already have code above to find the importance of each predictor; then the code below loops the top predictors against ‘yield’ and creates scatter plot. Next, I added linear regression line to see the strength of the relationship. The vizualizations show that the top predictors are ManufactoringProcess 26(concentrated around zero on the x-axis), ManufactoringProcess 20(concentrated around zero on the x-axis), ManufactoringProcess 32 (postive linear relationship), ManufactoringProcess 29(concentrated around zero on the x-axis) and ManufactoringProcess 13 (negative linear relationship). 3 of the variables have low variance and probably do not have much importance. So I inlcuded top 7, which included ManufactoringProcess 36 and ManufactoringProcess 6. The information could be helpful to improve yield in future runs of manufacturing process, to use these variables or perhaps top 10 if there is limited time, cost, or persons available. Similarly, this could be helpful to focus more on ManufactoringProcess than Biological Materials
top_predictors <- rownames(importance)[1:7]
for (predictor in top_predictors) {
if (predictor %in% colnames(data_combined)) {
plot <- ggplot(data_combined, aes_string(x = predictor, y = "Yield")) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = paste("Relationship between", predictor, "and Yield"),
x = predictor,
y = "Yield") +
theme_minimal()
print(plot)
ggsave(paste0("relationship_", predictor, ".png"), plot = plot)
} else {
cat(paste("Warning: Predictor", predictor, "not found in data.\n"))
}
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'