library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(GGally)
library(questionr)
library(corrplot)
library(glue)
library(car)

Exercixe 6.2 (K&J)

First, let’s load the data, which is a package dataset related to different chemical compounds’ permeability

data(permeability)

First, we’ll filter out any sparse predictors with the caret::nearZeroVar function and see our feature variable drop significantly

filteredPredictors <- nearZeroVar(fingerprints)
selectedFeatures <- length(filteredPredictors)
print(glue('Filtered features: {selectedFeatures}'))
## Filtered features: 719

We went from 1,107 predictors to 719 using this filtering, which should help in reducing our dimensionality. Next we can set up training and testing sets

# Convert to dataframe and add target variable
df <- as.data.frame(fingerprints[, -filteredPredictors])

# Make reproducible
set.seed(12345)

#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.8,0.2))
train  <- df[sample, ]
train_perm <- permeability[sample, ]
test   <- df[!sample, ]
test_perm <-  permeability[!sample, ]

We see similarly-shaped distributions to above, but one a different axes set as a result of our scaling. Now we can fit a PLS model to our training data using the pls package in R

plsTune <- train(train, train_perm,
                 method = "pls",
                 tuneLength = 25,
                 trControl = trainControl("cv", number = 10), # use 10-fold cross-validation
                 preProc = c("center", "scale"))
# Print out tuning results
plsTune
## Partial Least Squares 
## 
## 127 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 114, 115, 113, 115, 115, 114, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.05140  0.3900833   9.478672
##    2     10.60560  0.5493278   7.423262
##    3     10.63133  0.5474712   7.899130
##    4     11.16055  0.4856720   8.578469
##    5     11.33737  0.4759082   8.776896
##    6     11.23015  0.4894740   8.500899
##    7     10.83032  0.5162806   8.124944
##    8     10.67679  0.5362638   8.090790
##    9     10.78641  0.5300481   8.161808
##   10     10.81929  0.5376240   8.239349
##   11     11.35431  0.4982427   8.674058
##   12     11.60134  0.5003045   8.888635
##   13     11.75470  0.4852296   8.928292
##   14     12.06698  0.4829724   9.019597
##   15     12.57757  0.4603988   9.436261
##   16     12.90200  0.4438803   9.655609
##   17     13.29048  0.4286756  10.013094
##   18     13.48678  0.4333363  10.034143
##   19     13.64779  0.4353093  10.176847
##   20     13.94223  0.4285920  10.297675
##   21     14.16026  0.4215928  10.390949
##   22     14.39081  0.4147651  10.631837
##   23     14.74680  0.4089150  10.853707
##   24     15.04677  0.4016397  11.037444
##   25     15.00629  0.4086865  10.912050
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
plot(plsTune)

We ended up with 20 latent variables in our PLS model as we specified during training, with an optimal number of ~10 components. Now we can predict on the test set to see how well our model performs

predictions <- predict(plsTune, test)

postResample(pred =predict(plsTune, test), obs = test_perm)
##       RMSE   Rsquared        MAE 
## 15.2047392  0.2173242 10.2671761

We get an \(R^2\) value of 0.217, which isn’t great in terms of predictive power. Let’s see if another model specified in the chapter can do better. We’ll try a Ridge Regression model

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 10)) # use the same tuning grid as K&J
ridgeRegFit <- train(train, train_perm,
                     method="ridge",
                     tuneGrid = ridgeGrid,
                     trControl = trainControl("cv", number = 5),
                     preProc = c("center", "scale"))

Now let’s plot our tuning of our ridge regression model

plot(ridgeRegFit)

Now we can perform our prediction and see the predictions’ \(R^2\) value

predictions <- predict(ridgeRegFit, test)

postResample(pred = predictions, obs = test_perm)
##       RMSE   Rsquared        MAE 
## 14.7395198  0.3038846 10.7747381

With a ridge regression model, we see an improvement in our value for \(R^2\).

Would you recommend any of your models to replace the permeability

laboratory experiment?

It depends on the specific modeling problem statement and business context in which the model will be operationalized. Between the two models, the ridge has better diagnostic metrics (\(R^2\), RMSE), but may not necessarily lead to a more explainable model. Ridge regression is also best-suited for correlated sets of features, which is likely as certain compounds that have a similar molecular structure likely exhibit similar permeability behavior.

Exercixe 6.3 (K&J)

Similar to above, we’ll load in the dataset from an R package:

data("ChemicalManufacturingProcess")

chemical <- ChemicalManufacturingProcess

For our missing values, we can use the knnImpute method of imputation, which imputes missing values based on a given instance’s neighbors.

freq.na(chemical)
##                        missing %
## ManufacturingProcess03      15 9
## ManufacturingProcess11      10 6
## ManufacturingProcess10       9 5
## ManufacturingProcess25       5 3
## ManufacturingProcess26       5 3
## ManufacturingProcess27       5 3
## ManufacturingProcess28       5 3
## ManufacturingProcess29       5 3
## ManufacturingProcess30       5 3
## ManufacturingProcess31       5 3
## ManufacturingProcess33       5 3
## ManufacturingProcess34       5 3
## ManufacturingProcess35       5 3
## ManufacturingProcess36       5 3
## ManufacturingProcess02       3 2
## ManufacturingProcess06       2 1
## ManufacturingProcess01       1 1
## ManufacturingProcess04       1 1
## ManufacturingProcess05       1 1
## ManufacturingProcess07       1 1
## ManufacturingProcess08       1 1
## ManufacturingProcess12       1 1
## ManufacturingProcess14       1 1
## ManufacturingProcess22       1 1
## ManufacturingProcess23       1 1
## ManufacturingProcess24       1 1
## ManufacturingProcess40       1 1
## ManufacturingProcess41       1 1
## Yield                        0 0
## BiologicalMaterial01         0 0
## BiologicalMaterial02         0 0
## BiologicalMaterial03         0 0
## BiologicalMaterial04         0 0
## BiologicalMaterial05         0 0
## BiologicalMaterial06         0 0
## BiologicalMaterial07         0 0
## BiologicalMaterial08         0 0
## BiologicalMaterial09         0 0
## BiologicalMaterial10         0 0
## BiologicalMaterial11         0 0
## BiologicalMaterial12         0 0
## ManufacturingProcess09       0 0
## ManufacturingProcess13       0 0
## ManufacturingProcess15       0 0
## ManufacturingProcess16       0 0
## ManufacturingProcess17       0 0
## ManufacturingProcess18       0 0
## ManufacturingProcess19       0 0
## ManufacturingProcess20       0 0
## ManufacturingProcess21       0 0
## ManufacturingProcess32       0 0
## ManufacturingProcess37       0 0
## ManufacturingProcess38       0 0
## ManufacturingProcess39       0 0
## ManufacturingProcess42       0 0
## ManufacturingProcess43       0 0
## ManufacturingProcess44       0 0
## ManufacturingProcess45       0 0
# Impute missing values in our chemical yield data
imputed <- preProcess(chemical,
                       method = c("knnImpute"))

trans <- predict(imputed, chemical)

We can set up a train-test split

# Split into train and test splits
#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(trans), replace=TRUE, prob=c(0.8,0.2))
train  <- trans[sample, ]
# train_chem <- permeability[sample, ]
test   <- trans[!sample, ]
test_chem <-  trans[!sample, ]

Now we can perform a pre-processing and tuning of another PLS model to predict our Yield variable. In this case, we’ll perform a center & scaling operation for preprocessing

plsTuneChem <- train(train %>% dplyr::select(-c("Yield")), train$Yield,
                 method = "pls",
                 tuneLength = 25,
                 trControl = trainControl("cv", number = 10), # use 10-fold cross-validation
                 preProc = c("center", "scale"))
# Print out tuning results
plsTuneChem
## Partial Least Squares 
## 
## 141 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 126, 127, 128, 127, 126, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8663122  0.4276692  0.6501043
##    2     1.1601272  0.4959575  0.6947183
##    3     0.7766584  0.5571534  0.5664209
##    4     0.6679411  0.5996658  0.5396495
##    5     0.7442712  0.5528027  0.5711927
##    6     0.7251357  0.5581351  0.5656480
##    7     0.7665851  0.5563854  0.5790296
##    8     0.7664510  0.5631207  0.5844252
##    9     0.9175999  0.5236779  0.6363468
##   10     1.0136695  0.4959591  0.6683134
##   11     1.0500410  0.5030719  0.6722153
##   12     1.1249017  0.4844603  0.7022117
##   13     1.2191227  0.4753188  0.7285546
##   14     1.3469311  0.4561334  0.7690914
##   15     1.3933295  0.4469013  0.7831681
##   16     1.4116852  0.4480998  0.7846823
##   17     1.4375787  0.4472155  0.7946098
##   18     1.4640746  0.4446627  0.8026953
##   19     1.4483031  0.4567912  0.7939346
##   20     1.4307592  0.4605558  0.7891952
##   21     1.4201754  0.4558201  0.7875109
##   22     1.4017377  0.4528395  0.7831241
##   23     1.3789790  0.4555035  0.7759475
##   24     1.3615982  0.4565618  0.7705764
##   25     1.3539982  0.4576828  0.7692463
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
plot(plsTuneChem)

predictions <- predict(plsTuneChem, test_chem)

postResample(pred = as.matrix(predictions), obs = as.matrix(test_chem))
##      RMSE  Rsquared       MAE 
## 1.0911858        NA 0.8201533

We see a RMSE value of 1.187 for our PLS model trained on the chemical data, this is worse than the resampled RMSE vlaues on the training set when iterating through each component. Next we can plot the importance of individual features within our model

# Plot importance of variables to our model
imps <- varImp(plsTuneChem)
plot(imps, top = 10)

In our variable importance plot ManufacturingProcess variables tend to dominate over BigologicalMaterial variables.

importance <- imps$importance
importance$Process <- rownames(importance)

# Get top-10 most important processes)
top_importance <- head(importance[order(importance$Overall, decreasing = TRUE), ], 10)

# Plot correlation plot of most-important features
corChem <- cor(train[, top_importance$Process])

corrplot(corChem)

From our correlation matrix plot, we see some stronger correlations among our most-important feature variables. Knowing that certain manufacturing processes. could be strongly correlated between themselves could lean to process streamlining. For example, if two processes ar ehighly correlated, there’s a stronger chance (though not a guarantee) that they are in fact redundant when it comes to producing a higher pharmaceutical yield.