All exercises can be found in “Applied Predictive Modelling” written by Max Kuhn and Kjell Johnson.

Setup

The following packages are required to rerun this .rmd file:

seed_num <- 200
library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(elasticnet)
library(Hmisc)

Exercise 6.2

Description

Developing a model to predict permeability (see Sect. 1.4.) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

Part (a)

Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeatbility contains permeability response.

Part (b)

Description

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modelling?

Solution

The cell below uses the nearZeroVar() function to remove all of the sparse predictors from the fingerprints matrix`:

X <- fingerprints[,-nearZeroVar(fingerprints)]
dim(X)
## [1] 165 388

The dim output shows that only 388 of our original 1,107 predictors remain after this transformation.

Part (c)

Description

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of \(R^2\)?

Solution

First, the cell below splits the data into a training and test set, with 70% of the observations being used for training:

set.seed(seed_num)

train_ratio <- 0.7
n <- nrow(X)
n_train <- round(train_ratio * n)

train_ind <- sample(1:n, n_train)

all_data <- cbind(X, permeability)

train <- all_data[train_ind,]
test <- all_data[-train_ind,]

The cell below handles the pre-processing of the data, ensuring that each column of the training data is centered and scaled (each column will have \mu=0 and \sigma=1. The same scaling parameters that were used on the training data are also applied to the test set data to guarantee consistency and prevent data leakage.

# center and scale the test data
train <- scale(train)

# get the scaling and centering params for each column
center_params <- attr(train, "scaled:center")
scaling_params <- attr(train, "scaled:scale")

# apply those parameters to the test set
test <- scale(test, center = center_params, scale = scaling_params)

# split into X and y sets 
X_train <- train[, -ncol(train)]
y_train <- train[, ncol(train)]

X_test <- test[, -ncol(test)]
y_test <- test[, ncol(test)]

Finally, we can build a tuned partial least squares (PLS regression model):

set.seed(seed_num)

# perform cross-validation with 10 folds
ctrl <- trainControl(method = "cv", number = 10)

pls <- train(X_train, y_train,
             method = 'pls',
             tuneLength = 20,
             trControl = ctrl
)

pls
## Partial Least Squares 
## 
## 115 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 104, 103, 103, 104, 103, 104, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8227107  0.3253135  0.6261086
##    2     0.7537161  0.4496527  0.5298961
##    3     0.7627574  0.4488285  0.5640499
##    4     0.7625812  0.4422559  0.5775311
##    5     0.7615072  0.4611883  0.5606068
##    6     0.7468974  0.4775009  0.5536685
##    7     0.7336268  0.4918085  0.5416747
##    8     0.7491195  0.4729930  0.5619891
##    9     0.7675694  0.4508992  0.5701325
##   10     0.7894267  0.4377419  0.5977939
##   11     0.8127757  0.4238155  0.6088677
##   12     0.8574382  0.3822877  0.6431405
##   13     0.8794986  0.3671858  0.6561052
##   14     0.8948586  0.3541570  0.6730436
##   15     0.9199390  0.3482477  0.6937459
##   16     0.9669761  0.3259265  0.7246305
##   17     0.9950649  0.3148797  0.7436594
##   18     1.0116639  0.3060393  0.7634373
##   19     1.0290408  0.2985742  0.7864603
##   20     1.0527236  0.2870957  0.8092480
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
plot(pls)

The plot makes it clear that a model with 7 latent variables (ncomp=7) was the best choice for minimizing the root mean squared error. Based on the output from the tuning, the re-sampling estimate of \(R^2\) using 7 latent variables is \(\approx 0.492\).

Part (d)

Description

Predict the response for the test set. What is the test set estimate of \(R^2\)?

Solution

The cell below calculates \(R^2\) using the tuned PLS model predictions on the test set data:

preds <- predict(pls, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
##      RMSE  Rsquared       MAE 
## 0.6842800 0.4825105 0.4632900

We see that the \(R^2\) value is slightly lower than the maximum \(R^2\) achieved using the training set, which is expected.

Part (e)

Description

Try building other methods discussed in this chapter. Do any have better predictive performance?

Solution

The steps outlined in parts (c)-(e) are repeated below using a ridge regression model, which is a type of penalized regression model:

ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))

ridge <- train(X_train, y_train,
               method = 'ridge',
               tuneGrid = ridge_grid,
               trControl = ctrl
)

preds <- predict(ridge, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
##      RMSE  Rsquared       MAE 
## 0.7093680 0.4935191 0.4725948

Elastic net models are also an option, which combine the penalties from the ridge and lasso regression models:

enet_grid <- expand.grid(.lambda = c(0, 0.01, .1),
                          .fraction = seq(.05, 1, length = 20))

enet <- train(X_train, y_train,
               method = 'enet',
               tuneGrid = enet_grid,
               trControl = ctrl
)

preds <- predict(enet, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
##      RMSE  Rsquared       MAE 
## 0.7400934 0.3816758 0.4850571

Lastly, the cell below implements good old fashioned ordinary linear regression:

olr <- lm(permeability ~ ., data=data.frame(train))
preds <- predict(olr, data.frame(X_test))
## Warning in predict.lm(olr, data.frame(X_test)): prediction from rank-deficient
## fit; attr(*, "non-estim") has doubtful cases
defaultSummary(data.frame(obs = y_test, pred = preds))
##      RMSE  Rsquared       MAE 
## 1.8375596 0.1369144 0.9196858

The outputs of the penalized regression models (ridge and lasso) did not perform much differently than the PLS model implemented earlier, but these three showed significant improvement compared to the ordinary linear regression model (which only achieved an \(R^2\) of \(\approx 0.14\) on the test set data).

Part (f)

Description

Would you recommend any of your models to replace the permeability laboratory experiment?

Solution

While the PLS or penalized regression models explored in the previous parts of this problem did exhibit some degree of predictive power, none of them were able to achieve an \(R^2\) value greater than 0.5. This is typically not considered a impressive model performance and thus should not completely replace the current methods in place to examine molecule permeability. That being said, the model could be used as a way to help limit the number of molecules examined, and should continued to be explored as more/better data continues to be gathered.

Exercise 6.3

Description

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictions cannot be changed but can be used to assess the vitality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.

Part (a)

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

Part (b)

Description

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

Solution

The cell below imputes missing values using the median value from each missing value’s respective column:

imputation <- ChemicalManufacturingProcess %>%
  preProcess("medianImpute")

df <- predict(imputation, ChemicalManufacturingProcess)
any(is.na(df))
## [1] FALSE

The output above proves that all missing values have now been filled.

Part (c)

Description

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Solution

The cell below splits the data into training and testing sets:

set.seed(seed_num)

train_ratio <- 0.7
n <- nrow(df)
n_train <- round(train_ratio * n)

train_ind <- sample(1:n, n_train)

train <- data.frame(df[train_ind,])
test <- data.frame(df[-train_ind,])

Next the data is pre-processed using the same methodology implemented in the solution for Exercise 6.2 Part (c):

# center and scale the test data
train <- scale(train)

# get the scaling and centering params for each column
center_params <- attr(train, "scaled:center")
scaling_params <- attr(train, "scaled:scale")

# apply those parameters to the test set
test <- scale(test, center = center_params, scale = scaling_params)

# split into X and y sets 
X_train <- train[, -1]
y_train <- train[, 1]

X_test <- test[, -1]
y_test <- test[, 1]

Finally, the cell below tunes an elastic net regression model and plots the results of the tuning process:

enet_grid <- expand.grid(.lambda = c(0, 0.01, .1),
                          .fraction = seq(.05, 1, length = 20))

enet <- train(X_train, y_train,
               method = 'enet',
               tuneGrid = enet_grid,
               trControl = ctrl
)

final_model <- enet$finalModel
plot(enet)

The plot reveals that the optimal values of parameters are when the regularization strength (AKA \(\lambda\) AKA the weight decay) is equal to 0.01, and when the fraction parameter is equal to 0.4. The performance of each examined model is also shown in the table below:

enet
## Elasticnet 
## 
## 123 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 111, 109, 111, 111, 110, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.05      0.6733885  0.5945412  0.5565428
##   0.00    0.10      0.6052689  0.6386862  0.5075280
##   0.00    0.15      0.5722129  0.6762485  0.4675350
##   0.00    0.20      0.5781863  0.6721559  0.4607778
##   0.00    0.25      0.7152989  0.6000297  0.5219352
##   0.00    0.30      0.8339759  0.5846411  0.5678134
##   0.00    0.35      0.9352468  0.5641485  0.6044845
##   0.00    0.40      1.0572188  0.5453223  0.6512028
##   0.00    0.45      1.2021697  0.5277003  0.7016926
##   0.00    0.50      1.3208350  0.5148254  0.7440507
##   0.00    0.55      1.4477538  0.5032670  0.7879048
##   0.00    0.60      1.5054594  0.4795579  0.8191759
##   0.00    0.65      1.5776207  0.4509900  0.8512790
##   0.00    0.70      1.6600073  0.4268370  0.8831479
##   0.00    0.75      1.7439701  0.4122012  0.9123948
##   0.00    0.80      1.8295616  0.4021848  0.9409735
##   0.00    0.85      1.9337573  0.3947602  0.9750615
##   0.00    0.90      2.0124019  0.3884260  1.0034291
##   0.00    0.95      2.0789239  0.3828600  1.0281589
##   0.00    1.00      2.1439520  0.3787268  1.0520199
##   0.01    0.05      0.8080885  0.5768720  0.6549251
##   0.01    0.10      0.6905618  0.5869149  0.5711075
##   0.01    0.15      0.6440991  0.6040876  0.5375444
##   0.01    0.20      0.6121867  0.6371733  0.5139822
##   0.01    0.25      0.5917115  0.6574974  0.4923617
##   0.01    0.30      0.5786337  0.6709964  0.4753471
##   0.01    0.35      0.5674419  0.6829711  0.4609393
##   0.01    0.40      0.5669536  0.6823427  0.4565740
##   0.01    0.45      0.5930228  0.6537728  0.4736251
##   0.01    0.50      0.6775076  0.6084980  0.5119356
##   0.01    0.55      0.7629635  0.5832445  0.5439049
##   0.01    0.60      0.8243566  0.5689840  0.5669834
##   0.01    0.65      0.8654508  0.5642327  0.5824702
##   0.01    0.70      0.8966381  0.5604679  0.5942336
##   0.01    0.75      0.9277029  0.5572929  0.6054951
##   0.01    0.80      0.9583862  0.5534767  0.6166610
##   0.01    0.85      0.9909599  0.5494702  0.6277079
##   0.01    0.90      1.0453950  0.5463217  0.6439030
##   0.01    0.95      1.1168352  0.5401023  0.6679394
##   0.01    1.00      1.1903199  0.5293253  0.6962792
##   0.10    0.05      0.8826480  0.5310679  0.7139808
##   0.10    0.10      0.7861882  0.5821477  0.6410484
##   0.10    0.15      0.7161241  0.5848429  0.5903931
##   0.10    0.20      0.6666330  0.5955550  0.5538419
##   0.10    0.25      0.6412769  0.6032193  0.5342394
##   0.10    0.30      0.6255157  0.6163376  0.5240617
##   0.10    0.35      0.6090425  0.6341562  0.5095474
##   0.10    0.40      0.5959960  0.6473249  0.4969035
##   0.10    0.45      0.5867794  0.6568084  0.4867932
##   0.10    0.50      0.5799340  0.6638225  0.4792673
##   0.10    0.55      0.5755305  0.6684639  0.4729845
##   0.10    0.60      0.5743916  0.6700262  0.4702404
##   0.10    0.65      0.5801770  0.6662864  0.4725311
##   0.10    0.70      0.5881151  0.6611434  0.4793388
##   0.10    0.75      0.6112863  0.6477850  0.4874050
##   0.10    0.80      0.6306526  0.6345448  0.4986488
##   0.10    0.85      0.6477773  0.6198304  0.5096628
##   0.10    0.90      0.6662890  0.6051236  0.5212422
##   0.10    0.95      0.6821755  0.5935190  0.5299821
##   0.10    1.00      0.6943104  0.5844902  0.5367698
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.01.

The optimal model in this case has an \(R^2\) value of approximately 0.68 exhibiting a minimum RMSE of 0.57.

Part (d)

Description

Predict the response for the test set. What is the value of the performance metric and how does it compare with the re-sampled performance metric on the training set?

Solution

preds <- predict(enet, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
##      RMSE  Rsquared       MAE 
## 1.0217916 0.4413231 0.6667878

The model’s performance clearly drops when calculating performance metrics using the test set. The \(R^2\) dropped significantly, while the RMSE rose. Part of the reason for the significant decrease could simply be due to the small sample size of the test set.

Part (e)

Description

Which predictors are most important in the model you have trained? Do either biological or process predictors dominate the list?

Solution

Using the varImp() function, feature importances can be extracted from the tuned enet produced previously. The cell below calculates these feature importances and plots the top 20:

var_imps <- data.frame(varImp(enet)$importance)
var_imps$predictor_name <- rownames(var_imps)
rownames(var_imps) <- NULL

top_imps <- var_imps %>%
  arrange(desc(Overall)) %>%
    slice_head(n = 20)

ggplot(top_imps, aes(x = reorder(predictor_name, -Overall), y = Overall)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip axes for better readability
  labs(
    title = "Variable Importance",
    x = "Variables",
    y = "Importance Score"
  ) +
  theme_minimal()

The plot reveals that both manufacturing and biological predictors were important to the model; neither seems to particularly “dominate” the list of top predictors.

man_imps <- var_imps %>%
  filter(startsWith(predictor_name, "Manufacturing"))

bio_imps <- var_imps %>%
  filter(startsWith(predictor_name, "Biological"))

avg_man_imp <- round(mean(man_imps$Overall), 2)
avg_bio_imp <- round(mean(bio_imps$Overall), 2)

print1 <- paste("Manufacturing average feature importance:", avg_man_imp)
print2 <- paste("Biological average feature importance:", avg_bio_imp)

cat(paste(print1, print2, sep='\n'))
## Manufacturing average feature importance: 24.48
## Biological average feature importance: 48.24

However, the above output shows that the average importance of the biological predictors was about double that of the manufacturing predictors. This could hint that, overall, the biological features offer slightly superior predictive insight.

Part (f)

Description

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Solution

To explore the relationship between the top predictors and yield, the cell below determines the correlation of the top 20 predictors with the response variable:

top_names <- top_imps$predictor_name
rs <- c()

i <- 1
for(column in top_names){
  r_val <- cor(df[[column]], df[["Yield"]])
  rs[i] <- r_val
  i = i + 1
}

rs <- data.frame(predictor_name = top_names, r_value = rs)

ggplot(rs, aes(x = predictor_name, y = r_value)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top Predictors Correlation with Yield",
    x = "Variables",
    y = "Pearson-r Correlation"
  ) +
  coord_flip() +
  theme_minimal()

The plot reveals that all of the top biological predictors have a positive correlation with yield, meaning that we can conclude that larger values of these predictors result in increased yield. The behavior of the top manufacturing predictors is less clear, with both positive and negative correlations being visible. That being said, these correlations can help paint a picture of how yield will be impacted should their value change in any way.