6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

set.seed(312)

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(permeability)

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

# Load caret library
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
# view # of filtered data
length(nearZeroVar(fingerprints))
## [1] 719

There are 719 predictors filtered out. Since there were originally 1107 fingerprints, there are 388 remgaining for modeling.

# filter in new dataset
fingerprints_filter <- fingerprints[,-nearZeroVar(fingerprints)]

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

# Subset the data into objects for training using integer sub-setting 
 
training_rows <- createDataPartition(permeability, p = 0.8, list = FALSE)
train_predictors <- fingerprints_filter[training_rows,]
train_permeability <- permeability[training_rows]
test_predictors <- fingerprints_filter[-training_rows,]
test_permeability <- permeability[-training_rows]
pls <- train(train_predictors, train_permeability, method = 'pls')
print(pls$results[pls$results$RMSE == min(pls$results$RMSE),])
##   ncomp     RMSE Rsquared     MAE   RMSESD RsquaredSD     MAESD
## 3     3 12.04457 0.420321 8.87148 1.107693 0.08544581 0.8618283

There are 3 optimal components and the RMSE is 12.04. The R squared is 0.42.

(d) Predict the response for the test set. What is the test set estimate of R2?

# Predict the response for the test set
fingerprints_predictions <- predict(pls, test_predictors, ncomp = 3)
fingerprints_results <- data.frame(obs = test_permeability, pred = fingerprints_predictions)
defaultSummary(fingerprints_results)
##       RMSE   Rsquared        MAE 
## 11.0497837  0.5814816  8.5794172

Applying the model to the test set results in an Rsquared of 0.58.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

I will try a simple linear model

# Create the model from the training data
lm_train <- as.data.frame(train_predictors)
lm_train$permeability <- train_permeability
lm <- lm(permeability ~ ., data = lm_train)

# Predict the response for the test set
lm_predictions <- predict(lm, as.data.frame(test_predictors))
## Warning in predict.lm(lm, as.data.frame(test_predictors)): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
# Report the accuracy of the predictions
lm_Values <- data.frame(obs = test_permeability, pred = lm_predictions)
defaultSummary(lm_Values)
##        RMSE    Rsquared         MAE 
## 49.87433767  0.01093671 26.23025784

The high RMSE and low Rsquared show that this model is not performing better than the original model.

The next model i will try is the ridge regression model

# Create model from training data
library(MASS)
library(lmridge)
ridgeModel <- lm.ridge(permeability ~ ., data = lm_train)

# Create a function to use the ridge model

ridge_predict <- function(model, df) {
  prediction <- c()
  for (i in 1:nrow(df)) {
    value_predict <- model$Inter
    for (j in 1:ncol(df)) {
      value_predict <- value_predict + (model$coef[j] * df[i,j])}
    prediction[i] <- value_predict}
  return(prediction)}

# Predict response of test data
ridge_predictions <- ridge_predict(ridgeModel, as.data.frame(test_predictors))

# Reveiw accuracy
ridge_Values <- data.frame(obs = test_permeability, pred = ridge_predictions)
defaultSummary(ridge_Values)
##         RMSE     Rsquared          MAE 
## 2.664001e+14 7.774331e-02 2.561136e+14

The RMSE is very high and R squared very low so this model does not appear to a good model for the data.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

All the models that I tested do not appear to have a high R squared value for a pharmaceutical company. The permeability laboratory experiment should not replaced at this time with any of these models.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in sec 1.4. In this problem, the objective is to understanding the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturingProcess) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

sum(is.na(ChemicalManufacturingProcess))
## [1] 106
# Impute missing values via knn 
library(impute)
chemicals <- impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 123)
chemicals <- as.data.frame(chemicals$data)

# Check missing values
sum(is.na(chemicals))
## [1] 0

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

training_chemicals <- createDataPartition(chemicals$Yield, p = 0.8, list = FALSE)
train_predictors_chemicals <- chemicals[training_chemicals,-1]
train_yield <- chemicals[training_chemicals, 1]

test_predictors_chemicals <- chemicals[-training_chemicals,-1]
test_yield <- chemicals[-training_chemicals, 1]

plsModel <- train(train_predictors_chemicals,
                    train_yield,
                    method = 'pls',
                    preProc = c('center', 'scale'))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
print(plsModel$results[plsModel$results$RMSE == min(plsModel$results$RMSE),])
##   ncomp     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 3     3 1.617133 0.4433487 1.120809 0.4666049  0.1591008 0.1472144

The optimal model uses 3 components and the RMSE is 1.62 with Rsquared 0.44.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

yield_predictions <- predict(plsModel, test_predictors_chemicals, ncomp = 3)
yield_values <- data.frame(obs = test_yield, pred = yield_predictions)
defaultSummary(yield_values)
##      RMSE  Rsquared       MAE 
## 1.3854824 0.4649589 1.1768402

Predicting the response for the test set results in an RMSE of 1.39 and Rsquared of 0.46. The results are slightly improved in this test set compared to the training set.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(plsModel, scale = FALSE)
## Warning: package 'pls' was built under R version 4.4.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 0.11937
## ManufacturingProcess17 0.10720
## ManufacturingProcess13 0.10675
## ManufacturingProcess09 0.10655
## ManufacturingProcess36 0.10587
## ManufacturingProcess06 0.08541
## ManufacturingProcess33 0.08108
## BiologicalMaterial06   0.07785
## BiologicalMaterial02   0.07691
## BiologicalMaterial03   0.07627
## BiologicalMaterial08   0.07543
## ManufacturingProcess11 0.07168
## BiologicalMaterial11   0.06999
## BiologicalMaterial12   0.06986
## BiologicalMaterial01   0.06705
## BiologicalMaterial04   0.06559
## ManufacturingProcess28 0.06119
## ManufacturingProcess12 0.05883
## ManufacturingProcess37 0.05671
## ManufacturingProcess02 0.05538

Manufacturing processes seem to dominate the list, making up 12 of the 20 variables of the top 20 most important variables. However, proportionally, there are more biological material predictors on the list since there are only 10 biological material predictors vs 37 manufacturing processes predictors.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

ggplot(chemicals, aes(x = ManufacturingProcess32, y = Yield)) + geom_point()

ggplot(chemicals, aes(x = ManufacturingProcess17, y = Yield)) + geom_point()

ggplot(chemicals, aes(x = ManufacturingProcess13, y = Yield)) + geom_point()

ggplot(chemicals, aes(x = ManufacturingProcess09, y = Yield)) + geom_point()

ggplot(chemicals, aes(x = ManufacturingProcess36, y = Yield)) + geom_point()

The top 5 predictors have been plotted. Two plots seem to have a positive yield (as seen by the positive relationship in points plotted) and 3 have a negative yield (vice versa). This detail may help companies identify ways to increase yield.