Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
set.seed(312)
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(permeability)
# Load caret library
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
# view # of filtered data
length(nearZeroVar(fingerprints))
## [1] 719
There are 719 predictors filtered out. Since there were originally 1107 fingerprints, there are 388 remgaining for modeling.
# filter in new dataset
fingerprints_filter <- fingerprints[,-nearZeroVar(fingerprints)]
# Subset the data into objects for training using integer sub-setting
training_rows <- createDataPartition(permeability, p = 0.8, list = FALSE)
train_predictors <- fingerprints_filter[training_rows,]
train_permeability <- permeability[training_rows]
test_predictors <- fingerprints_filter[-training_rows,]
test_permeability <- permeability[-training_rows]
pls <- train(train_predictors, train_permeability, method = 'pls')
print(pls$results[pls$results$RMSE == min(pls$results$RMSE),])
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 3 3 12.04457 0.420321 8.87148 1.107693 0.08544581 0.8618283
There are 3 optimal components and the RMSE is 12.04. The R squared is 0.42.
# Predict the response for the test set
fingerprints_predictions <- predict(pls, test_predictors, ncomp = 3)
fingerprints_results <- data.frame(obs = test_permeability, pred = fingerprints_predictions)
defaultSummary(fingerprints_results)
## RMSE Rsquared MAE
## 11.0497837 0.5814816 8.5794172
Applying the model to the test set results in an Rsquared of 0.58.
I will try a simple linear model
# Create the model from the training data
lm_train <- as.data.frame(train_predictors)
lm_train$permeability <- train_permeability
lm <- lm(permeability ~ ., data = lm_train)
# Predict the response for the test set
lm_predictions <- predict(lm, as.data.frame(test_predictors))
## Warning in predict.lm(lm, as.data.frame(test_predictors)): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
# Report the accuracy of the predictions
lm_Values <- data.frame(obs = test_permeability, pred = lm_predictions)
defaultSummary(lm_Values)
## RMSE Rsquared MAE
## 49.87433767 0.01093671 26.23025784
The high RMSE and low Rsquared show that this model is not performing better than the original model.
The next model i will try is the ridge regression model
# Create model from training data
library(MASS)
library(lmridge)
ridgeModel <- lm.ridge(permeability ~ ., data = lm_train)
# Create a function to use the ridge model
ridge_predict <- function(model, df) {
prediction <- c()
for (i in 1:nrow(df)) {
value_predict <- model$Inter
for (j in 1:ncol(df)) {
value_predict <- value_predict + (model$coef[j] * df[i,j])}
prediction[i] <- value_predict}
return(prediction)}
# Predict response of test data
ridge_predictions <- ridge_predict(ridgeModel, as.data.frame(test_predictors))
# Reveiw accuracy
ridge_Values <- data.frame(obs = test_permeability, pred = ridge_predictions)
defaultSummary(ridge_Values)
## RMSE Rsquared MAE
## 2.664001e+14 7.774331e-02 2.561136e+14
The RMSE is very high and R squared very low so this model does not appear to a good model for the data.
All the models that I tested do not appear to have a high R squared value for a pharmaceutical company. The permeability laboratory experiment should not replaced at this time with any of these models.
A chemical manufacturing process for a pharmaceutical product was discussed in sec 1.4. In this problem, the objective is to understanding the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
# Impute missing values via knn
library(impute)
chemicals <- impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 123)
chemicals <- as.data.frame(chemicals$data)
# Check missing values
sum(is.na(chemicals))
## [1] 0
training_chemicals <- createDataPartition(chemicals$Yield, p = 0.8, list = FALSE)
train_predictors_chemicals <- chemicals[training_chemicals,-1]
train_yield <- chemicals[training_chemicals, 1]
test_predictors_chemicals <- chemicals[-training_chemicals,-1]
test_yield <- chemicals[-training_chemicals, 1]
plsModel <- train(train_predictors_chemicals,
train_yield,
method = 'pls',
preProc = c('center', 'scale'))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
print(plsModel$results[plsModel$results$RMSE == min(plsModel$results$RMSE),])
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 3 3 1.617133 0.4433487 1.120809 0.4666049 0.1591008 0.1472144
The optimal model uses 3 components and the RMSE is 1.62 with Rsquared 0.44.
yield_predictions <- predict(plsModel, test_predictors_chemicals, ncomp = 3)
yield_values <- data.frame(obs = test_yield, pred = yield_predictions)
defaultSummary(yield_values)
## RMSE Rsquared MAE
## 1.3854824 0.4649589 1.1768402
Predicting the response for the test set results in an RMSE of 1.39 and Rsquared of 0.46. The results are slightly improved in this test set compared to the training set.
varImp(plsModel, scale = FALSE)
## Warning: package 'pls' was built under R version 4.4.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 0.11937
## ManufacturingProcess17 0.10720
## ManufacturingProcess13 0.10675
## ManufacturingProcess09 0.10655
## ManufacturingProcess36 0.10587
## ManufacturingProcess06 0.08541
## ManufacturingProcess33 0.08108
## BiologicalMaterial06 0.07785
## BiologicalMaterial02 0.07691
## BiologicalMaterial03 0.07627
## BiologicalMaterial08 0.07543
## ManufacturingProcess11 0.07168
## BiologicalMaterial11 0.06999
## BiologicalMaterial12 0.06986
## BiologicalMaterial01 0.06705
## BiologicalMaterial04 0.06559
## ManufacturingProcess28 0.06119
## ManufacturingProcess12 0.05883
## ManufacturingProcess37 0.05671
## ManufacturingProcess02 0.05538
Manufacturing processes seem to dominate the list, making up 12 of the 20 variables of the top 20 most important variables. However, proportionally, there are more biological material predictors on the list since there are only 10 biological material predictors vs 37 manufacturing processes predictors.
ggplot(chemicals, aes(x = ManufacturingProcess32, y = Yield)) + geom_point()
ggplot(chemicals, aes(x = ManufacturingProcess17, y = Yield)) + geom_point()
ggplot(chemicals, aes(x = ManufacturingProcess13, y = Yield)) + geom_point()
ggplot(chemicals, aes(x = ManufacturingProcess09, y = Yield)) + geom_point()
ggplot(chemicals, aes(x = ManufacturingProcess36, y = Yield)) + geom_point()
The top 5 predictors have been plotted. Two plots seem to have a positive yield (as seen by the positive relationship in points plotted) and 3 have a negative yield (vice versa). This detail may help companies identify ways to increase yield.