All exercises can be found in “Applied Predictive Modelling” written by Max Kuhn and Kjell Johnson.
The following packages are required to rerun this .rmd
file:
seed_num <- 200
library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(elasticnet)
library(Hmisc)
Developing a model to predict permeability (see Sect. 1.4.) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.
Start R and use these commands to load the data:
data(permeability)
The matrix fingerprints contains the 1,107 binary
molecular predictors for the 165 compounds, while
permeatbility contains permeability response.
The fingerprint predictors indicate the presence or absence of
substructures of a molecule and are often sparse meaning that relatively
few of the molecules contain each substructure. Filter out the
predictors that have low frequencies using the nearZeroVar
function from the caret package. How many predictors are
left for modelling?
The cell below uses the nearZeroVar() function to remove
all of the sparse predictors from the fingerprints
matrix`:
X <- fingerprints[,-nearZeroVar(fingerprints)]
dim(X)
## [1] 165 388
The dim output shows that only 388 of our original 1,107
predictors remain after this transformation.
Split the data into a training and a test set, pre-process the data,
and tune a PLS model. How many latent variables are optimal
and what is the corresponding re-sampled estimate of \(R^2\)?
First, the cell below splits the data into a training and test set, with 70% of the observations being used for training:
set.seed(seed_num)
train_ratio <- 0.7
n <- nrow(X)
n_train <- round(train_ratio * n)
train_ind <- sample(1:n, n_train)
all_data <- cbind(X, permeability)
train <- all_data[train_ind,]
test <- all_data[-train_ind,]
The cell below handles the pre-processing of the data, ensuring that
each column of the training data is centered and scaled (each column
will have \mu=0 and \sigma=1. The same scaling
parameters that were used on the training data are also applied to the
test set data to guarantee consistency and prevent data leakage.
# center and scale the test data
train <- scale(train)
# get the scaling and centering params for each column
center_params <- attr(train, "scaled:center")
scaling_params <- attr(train, "scaled:scale")
# apply those parameters to the test set
test <- scale(test, center = center_params, scale = scaling_params)
# split into X and y sets
X_train <- train[, -ncol(train)]
y_train <- train[, ncol(train)]
X_test <- test[, -ncol(test)]
y_test <- test[, ncol(test)]
Finally, we can build a tuned partial least squares (PLS regression model):
set.seed(seed_num)
# perform cross-validation with 10 folds
ctrl <- trainControl(method = "cv", number = 10)
pls <- train(X_train, y_train,
method = 'pls',
tuneLength = 20,
trControl = ctrl
)
pls
## Partial Least Squares
##
## 115 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 104, 103, 103, 104, 103, 104, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8227107 0.3253135 0.6261086
## 2 0.7537161 0.4496527 0.5298961
## 3 0.7627574 0.4488285 0.5640499
## 4 0.7625812 0.4422559 0.5775311
## 5 0.7615072 0.4611883 0.5606068
## 6 0.7468974 0.4775009 0.5536685
## 7 0.7336268 0.4918085 0.5416747
## 8 0.7491195 0.4729930 0.5619891
## 9 0.7675694 0.4508992 0.5701325
## 10 0.7894267 0.4377419 0.5977939
## 11 0.8127757 0.4238155 0.6088677
## 12 0.8574382 0.3822877 0.6431405
## 13 0.8794986 0.3671858 0.6561052
## 14 0.8948586 0.3541570 0.6730436
## 15 0.9199390 0.3482477 0.6937459
## 16 0.9669761 0.3259265 0.7246305
## 17 0.9950649 0.3148797 0.7436594
## 18 1.0116639 0.3060393 0.7634373
## 19 1.0290408 0.2985742 0.7864603
## 20 1.0527236 0.2870957 0.8092480
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
plot(pls)
The plot makes it clear that a model with 7 latent variables
(ncomp=7) was the best choice for minimizing the root mean
squared error. Based on the output from the tuning, the re-sampling
estimate of \(R^2\) using 7 latent
variables is \(\approx 0.492\).
Predict the response for the test set. What is the test set estimate of \(R^2\)?
The cell below calculates \(R^2\) using the tuned PLS model predictions on the test set data:
preds <- predict(pls, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
## RMSE Rsquared MAE
## 0.6842800 0.4825105 0.4632900
We see that the \(R^2\) value is slightly lower than the maximum \(R^2\) achieved using the training set, which is expected.
Try building other methods discussed in this chapter. Do any have better predictive performance?
The steps outlined in parts (c)-(e) are repeated below using a ridge regression model, which is a type of penalized regression model:
ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))
ridge <- train(X_train, y_train,
method = 'ridge',
tuneGrid = ridge_grid,
trControl = ctrl
)
preds <- predict(ridge, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
## RMSE Rsquared MAE
## 0.7093680 0.4935191 0.4725948
Elastic net models are also an option, which combine the penalties from the ridge and lasso regression models:
enet_grid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
enet <- train(X_train, y_train,
method = 'enet',
tuneGrid = enet_grid,
trControl = ctrl
)
preds <- predict(enet, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
## RMSE Rsquared MAE
## 0.7400934 0.3816758 0.4850571
Lastly, the cell below implements good old fashioned ordinary linear regression:
olr <- lm(permeability ~ ., data=data.frame(train))
preds <- predict(olr, data.frame(X_test))
## Warning in predict.lm(olr, data.frame(X_test)): prediction from rank-deficient
## fit; attr(*, "non-estim") has doubtful cases
defaultSummary(data.frame(obs = y_test, pred = preds))
## RMSE Rsquared MAE
## 1.8375596 0.1369144 0.9196858
The outputs of the penalized regression models (ridge and lasso) did not perform much differently than the PLS model implemented earlier, but these three showed significant improvement compared to the ordinary linear regression model (which only achieved an \(R^2\) of \(\approx 0.14\) on the test set data).
Would you recommend any of your models to replace the permeability laboratory experiment?
While the PLS or penalized regression models explored in the previous parts of this problem did exhibit some degree of predictive power, none of them were able to achieve an \(R^2\) value greater than 0.5. This is typically not considered a impressive model performance and thus should not completely replace the current methods in place to examine molecule permeability. That being said, the model could be used as a way to help limit the number of molecules examined, and should continued to be explored as more/better data continues to be gathered.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictions cannot be changed but can be used to assess the vitality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.
The cell below imputes missing values using the median value from each missing value’s respective column:
imputation <- ChemicalManufacturingProcess %>%
preProcess("medianImpute")
df <- predict(imputation, ChemicalManufacturingProcess)
any(is.na(df))
## [1] FALSE
The output above proves that all missing values have now been filled.
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
The cell below splits the data into training and testing sets:
set.seed(seed_num)
train_ratio <- 0.7
n <- nrow(df)
n_train <- round(train_ratio * n)
train_ind <- sample(1:n, n_train)
train <- data.frame(df[train_ind,])
test <- data.frame(df[-train_ind,])
Next the data is pre-processed using the same methodology implemented in the solution for Exercise 6.2 Part (c):
# center and scale the test data
train <- scale(train)
# get the scaling and centering params for each column
center_params <- attr(train, "scaled:center")
scaling_params <- attr(train, "scaled:scale")
# apply those parameters to the test set
test <- scale(test, center = center_params, scale = scaling_params)
# split into X and y sets
X_train <- train[, -1]
y_train <- train[, 1]
X_test <- test[, -1]
y_test <- test[, 1]
Finally, the cell below tunes an elastic net regression model and plots the results of the tuning process:
enet_grid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
enet <- train(X_train, y_train,
method = 'enet',
tuneGrid = enet_grid,
trControl = ctrl
)
final_model <- enet$finalModel
plot(enet)
The plot reveals that the optimal values of parameters are when the
regularization strength (AKA \(\lambda\) AKA the weight decay) is equal to
0.01, and when the fraction parameter is equal to 0.4. The
performance of each examined model is also shown in the table below:
enet
## Elasticnet
##
## 123 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 111, 109, 111, 111, 110, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 0.6733885 0.5945412 0.5565428
## 0.00 0.10 0.6052689 0.6386862 0.5075280
## 0.00 0.15 0.5722129 0.6762485 0.4675350
## 0.00 0.20 0.5781863 0.6721559 0.4607778
## 0.00 0.25 0.7152989 0.6000297 0.5219352
## 0.00 0.30 0.8339759 0.5846411 0.5678134
## 0.00 0.35 0.9352468 0.5641485 0.6044845
## 0.00 0.40 1.0572188 0.5453223 0.6512028
## 0.00 0.45 1.2021697 0.5277003 0.7016926
## 0.00 0.50 1.3208350 0.5148254 0.7440507
## 0.00 0.55 1.4477538 0.5032670 0.7879048
## 0.00 0.60 1.5054594 0.4795579 0.8191759
## 0.00 0.65 1.5776207 0.4509900 0.8512790
## 0.00 0.70 1.6600073 0.4268370 0.8831479
## 0.00 0.75 1.7439701 0.4122012 0.9123948
## 0.00 0.80 1.8295616 0.4021848 0.9409735
## 0.00 0.85 1.9337573 0.3947602 0.9750615
## 0.00 0.90 2.0124019 0.3884260 1.0034291
## 0.00 0.95 2.0789239 0.3828600 1.0281589
## 0.00 1.00 2.1439520 0.3787268 1.0520199
## 0.01 0.05 0.8080885 0.5768720 0.6549251
## 0.01 0.10 0.6905618 0.5869149 0.5711075
## 0.01 0.15 0.6440991 0.6040876 0.5375444
## 0.01 0.20 0.6121867 0.6371733 0.5139822
## 0.01 0.25 0.5917115 0.6574974 0.4923617
## 0.01 0.30 0.5786337 0.6709964 0.4753471
## 0.01 0.35 0.5674419 0.6829711 0.4609393
## 0.01 0.40 0.5669536 0.6823427 0.4565740
## 0.01 0.45 0.5930228 0.6537728 0.4736251
## 0.01 0.50 0.6775076 0.6084980 0.5119356
## 0.01 0.55 0.7629635 0.5832445 0.5439049
## 0.01 0.60 0.8243566 0.5689840 0.5669834
## 0.01 0.65 0.8654508 0.5642327 0.5824702
## 0.01 0.70 0.8966381 0.5604679 0.5942336
## 0.01 0.75 0.9277029 0.5572929 0.6054951
## 0.01 0.80 0.9583862 0.5534767 0.6166610
## 0.01 0.85 0.9909599 0.5494702 0.6277079
## 0.01 0.90 1.0453950 0.5463217 0.6439030
## 0.01 0.95 1.1168352 0.5401023 0.6679394
## 0.01 1.00 1.1903199 0.5293253 0.6962792
## 0.10 0.05 0.8826480 0.5310679 0.7139808
## 0.10 0.10 0.7861882 0.5821477 0.6410484
## 0.10 0.15 0.7161241 0.5848429 0.5903931
## 0.10 0.20 0.6666330 0.5955550 0.5538419
## 0.10 0.25 0.6412769 0.6032193 0.5342394
## 0.10 0.30 0.6255157 0.6163376 0.5240617
## 0.10 0.35 0.6090425 0.6341562 0.5095474
## 0.10 0.40 0.5959960 0.6473249 0.4969035
## 0.10 0.45 0.5867794 0.6568084 0.4867932
## 0.10 0.50 0.5799340 0.6638225 0.4792673
## 0.10 0.55 0.5755305 0.6684639 0.4729845
## 0.10 0.60 0.5743916 0.6700262 0.4702404
## 0.10 0.65 0.5801770 0.6662864 0.4725311
## 0.10 0.70 0.5881151 0.6611434 0.4793388
## 0.10 0.75 0.6112863 0.6477850 0.4874050
## 0.10 0.80 0.6306526 0.6345448 0.4986488
## 0.10 0.85 0.6477773 0.6198304 0.5096628
## 0.10 0.90 0.6662890 0.6051236 0.5212422
## 0.10 0.95 0.6821755 0.5935190 0.5299821
## 0.10 1.00 0.6943104 0.5844902 0.5367698
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.01.
The optimal model in this case has an \(R^2\) value of approximately 0.68 exhibiting a minimum RMSE of 0.57.
Predict the response for the test set. What is the value of the performance metric and how does it compare with the re-sampled performance metric on the training set?
preds <- predict(enet, X_test)
defaultSummary(data.frame(obs = y_test, pred = preds))
## RMSE Rsquared MAE
## 1.0217916 0.4413231 0.6667878
The model’s performance clearly drops when calculating performance metrics using the test set. The \(R^2\) dropped significantly, while the RMSE rose. Part of the reason for the significant decrease could simply be due to the small sample size of the test set.
Which predictors are most important in the model you have trained? Do either biological or process predictors dominate the list?
Using the varImp() function, feature importances can be
extracted from the tuned enet produced previously. The cell
below calculates these feature importances and plots the top 20:
var_imps <- data.frame(varImp(enet)$importance)
var_imps$predictor_name <- rownames(var_imps)
rownames(var_imps) <- NULL
top_imps <- var_imps %>%
arrange(desc(Overall)) %>%
slice_head(n = 20)
ggplot(top_imps, aes(x = reorder(predictor_name, -Overall), y = Overall)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip axes for better readability
labs(
title = "Variable Importance",
x = "Variables",
y = "Importance Score"
) +
theme_minimal()
The plot reveals that both manufacturing and biological predictors were important to the model; neither seems to particularly “dominate” the list of top predictors.
man_imps <- var_imps %>%
filter(startsWith(predictor_name, "Manufacturing"))
bio_imps <- var_imps %>%
filter(startsWith(predictor_name, "Biological"))
avg_man_imp <- round(mean(man_imps$Overall), 2)
avg_bio_imp <- round(mean(bio_imps$Overall), 2)
print1 <- paste("Manufacturing average feature importance:", avg_man_imp)
print2 <- paste("Biological average feature importance:", avg_bio_imp)
cat(paste(print1, print2, sep='\n'))
## Manufacturing average feature importance: 24.48
## Biological average feature importance: 48.24
However, the above output shows that the average importance of the biological predictors was about double that of the manufacturing predictors. This could hint that, overall, the biological features offer slightly superior predictive insight.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
To explore the relationship between the top predictors and
yield, the cell below determines the correlation of the top
20 predictors with the response variable:
top_names <- top_imps$predictor_name
rs <- c()
i <- 1
for(column in top_names){
r_val <- cor(df[[column]], df[["Yield"]])
rs[i] <- r_val
i = i + 1
}
rs <- data.frame(predictor_name = top_names, r_value = rs)
ggplot(rs, aes(x = predictor_name, y = r_value)) +
geom_bar(stat = "identity") +
labs(
title = "Top Predictors Correlation with Yield",
x = "Variables",
y = "Pearson-r Correlation"
) +
coord_flip() +
theme_minimal()
The plot reveals that all of the top biological predictors have a
positive correlation with yield, meaning that we can
conclude that larger values of these predictors result in increased
yield. The behavior of the top manufacturing predictors is less clear,
with both positive and negative correlations being visible. That being
said, these correlations can help paint a picture of how
yield will be impacted should their value change in any
way.