Applied Predictive Modeling.
Instructions
Do problems 6.2
and 6.3
in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.
Exercises
6.2
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a)
Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability
contains permeability response.
(b)
The fingerprint
predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar
function from the caret package.
How many predictors are left for modeling?
library(caret)
fingerprints.df <- data.frame(fingerprints)
fingerprints.ZeroVar <- nearZeroVar(fingerprints.df, names = TRUE)
predictors <- dim(fingerprints.df)[2] - length(fingerprints.ZeroVar)
The remaining number of predictors left for modeling are: 388.
(c)
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of \(R^2\)?
In order to pre-process the data, what we need to do is to center and scale, in order to do so, we can do it as follows:
\[\text{Pre-Process Data} = \frac{y_i - \mu}{\sigma}\]
The above data, will be centered and scaled while being trained for the plsTune.
# First, we need to pre-process the data by discarding predictors
# with Near Zero Variance.
reduced <- names(fingerprints.df) %in% fingerprints.ZeroVar
fingerprints.reduced.df <- fingerprints.df[ , !reduced]
# Now, I will split the data as follows:
# Training 75%
# Test 25%
set.seed(123)
n <- nrow(fingerprints.reduced.df)
trainIndex <- sample(1:n, size = round(0.75*n), replace=FALSE)
fingerprints.train <- fingerprints.reduced.df[trainIndex ,]
fingerprints.test <- fingerprints.reduced.df[-trainIndex ,]
# Defining Control for a ten-fold Cross Validation
ctrl <- trainControl(method = "cv", number = 10)
# Defining training sets in order to tune a PLS model
x <- fingerprints.reduced.df[trainIndex ,]
y <- permeability[trainIndex]
# Tuning a PLS model
set.seed(100)
plsTune <- train(x, y,
method = "pls",
## The default tuning grid evaluates
## components 1... tuneLength
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
The cross-validation profiles for a Partial Least Squares (PLS) model.
From the above graph, we notice how the lowest RMSE is obtained when there are 8 components used. The lowest obtained RMSE is \(10.635\) and the respective \(R^2\) is \(0.49\)
The below table shows their respective results for each component and the corresponding \(R^2\).
## Partial Least Squares
##
## 124 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 111, 111, 112, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 11.98173 0.3565306 8.987053
## 2 10.98268 0.4927106 7.765846
## 3 11.17758 0.4596512 8.562117
## 4 11.35684 0.4462066 8.618259
## 5 10.97281 0.4789562 8.257085
## 6 10.99786 0.4678063 8.246938
## 7 10.84200 0.4783988 8.117766
## 8 10.63489 0.4948970 8.005125
## 9 10.88187 0.4879433 7.900238
## 10 10.89131 0.4734362 7.939967
## 11 11.04987 0.4544378 8.049652
## 12 10.97178 0.4547096 8.131693
## 13 10.98286 0.4520488 8.121929
## 14 11.20020 0.4325586 8.250437
## 15 11.38889 0.4227432 8.391429
## 16 11.34637 0.4255055 8.422120
## 17 11.29886 0.4185157 8.299045
## 18 11.43835 0.4114370 8.345948
## 19 11.40326 0.4152484 8.127962
## 20 11.54095 0.4051354 8.087660
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
(d)
Predict the response for the test set. What is the test set estimate of \(R^2\)?
fingerprints.train$Permeability <- permeability[trainIndex]
# PLS Fit and Predictions with up to 8 components.
library(pls)
plsFit <- plsr(Permeability ~ ., data = fingerprints.train)
predictPLS <- predict(plsFit, fingerprints.test, ncomp = 1:8)
# Creating a data frame from previous results.
predictPLS.df <- data.frame(predictPLS)
# Calculating R-squared on the testing data is a little tricky,
# as you have to remember what your baseline is.
# Your baseline projection is a mean of your training data.
get_R2 <- function(c){
# Defining the correct baseline data
y <- permeability[-trainIndex]
# y_bar is the mean of the observed data
y_bar <- mean(fingerprints.train$Permeability)
# Defining the predicted data, in this case for 8 components
f <- predictPLS.df[,c]
# then the variability of the data set can be measured
# using three sums of squares formulas:
# The total sum of squares (proportional to the variance of the data):
SS.total <- sum((y - y_bar)^2)
# The regression sum of squares, also called the explained sum of squares:
SS.regression <- sum((f - y_bar)^2)
# The sum of squares of residuals, also called the residual sum of squares:
SS.residual <- sum((y - f)^2)
# The most general definition of the R Squared coefficient of determination is:
R2 <- 1 - SS.residual/SS.total
R2 <- round(R2,7)
return(R2)
}
# Obtaining R2 values for predicted R2.
R2_Predicted <- ""
for (i in 1:8){
R2_Predicted[i] <- get_R2(i)
}
Let’s compare the values and their differences, the row indicates the number of components used.
R2_Predicted | R2_Train | |
---|---|---|
1 | 0.396103 | 0.3565306 |
2 | 0.4591568 | 0.4927106 |
3 | 0.3993341 | 0.4596512 |
4 | 0.4114313 | 0.4462066 |
5 | 0.3813263 | 0.4789562 |
6 | 0.4171841 | 0.4678063 |
7 | 0.4884075 | 0.4783988 |
8 | 0.4944606 | 0.4948970 |
As we can notice on the last row, the values are near.
(e)
Try building other models discussed in this chapter. Do any have better predictive performance?
Ridge Model
# Tuning a Ridge model
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeRegFit <- train(x, y,
method = "ridge",
## Fit the model over many penalty values
tuneGrid = ridgeGrid,
trControl = ctrl,
## put the predictors on the same scale
preProc = c("center", "scale"))
## Ridge Regression
##
## 124 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 111, 111, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 11.61404 0.4251840 7.979825
## 0.007142857 106.73370 0.2961840 85.376218
## 0.014285714 20.47443 0.3572114 13.821656
## 0.021428571 29.52876 0.3978892 19.833103
## 0.028571429 11.34962 0.4222886 7.914513
## 0.035714286 11.21141 0.4327152 7.859329
## 0.042857143 11.13188 0.4397708 7.834692
## 0.050000000 11.08863 0.4447829 7.846635
## 0.057142857 11.04668 0.4494877 7.857929
## 0.064285714 11.01545 0.4533167 7.864283
## 0.071428571 11.01397 0.4559008 7.888531
## 0.078571429 11.00020 0.4588167 7.900352
## 0.085714286 10.99370 0.4610162 7.915624
## 0.092857143 10.99070 0.4633150 7.930794
## 0.100000000 10.99585 0.4647631 7.947210
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.09285714.
The cross-validation profiles for a Ridge model.
From the above, we notice how the weight of decay come to what seems to be a straight line starting with \(\lambda = 0.0286\), obtaining the lowest RMSE results with \(\lambda = 0.0929\).
Let’s calculate the \(R^2\).
get_R2enet <- function(f){
# Defining the correct baseline data
y <- permeability[-trainIndex]
# y_bar is the mean of the observed data
y_bar <- mean(fingerprints.train$Permeability)
# Defining the predicted data, in this case for 8 components
#f <- ridgePred$fit
# then the variability of the data set can be measured
# using three sums of squares formulas:
# The total sum of squares (proportional to the variance of the data):
SS.total <- sum((y - y_bar)^2)
# The regression sum of squares, also called the explained sum of squares:
SS.regression <- sum((f - y_bar)^2)
# The sum of squares of residuals, also called the residual sum of squares:
SS.residual <- sum((y - f)^2)
# The most general definition of the R Squared coefficient of determination is:
R2 <- 1 - SS.residual/SS.total
R2 <- round(R2,7)
return(R2)
}
get_R2enet(ridgePred$fit)
#[1] 0.4547481
In this case, the returned \(R^2 = 0.4547481\), which is slightly lower compared to the \(R^2 = 0.4633150\) returned from the training data set.
Now, by comparing the \(R^2 = 0.4547481\) from Ridge model and the \(R^2 = 0.4944606\) from the PLS model, we noticed that the PLS explains slightly better the variance in the data set.
Enet Model
# Tuning a Enet model
y <- permeability[trainIndex]
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(x, y,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
## Warning: model fit failed for Fold06: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Elasticnet
##
## 124 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 111, 112, 111, 111, 112, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 10.711923 0.5501899 7.977930
## 0.00 0.10 10.407607 0.5348454 7.329535
## 0.00 0.15 10.295620 0.5294794 7.182214
## 0.00 0.20 10.133571 0.5290360 7.157684
## 0.00 0.25 9.975111 0.5347711 7.042807
## 0.00 0.30 9.918872 0.5355194 7.022232
## 0.00 0.35 9.940631 0.5345060 7.032634
## 0.00 0.40 10.034343 0.5262229 7.108625
## 0.00 0.45 10.186233 0.5137433 7.200506
## 0.00 0.50 10.340369 0.4990464 7.276011
## 0.00 0.55 10.374904 0.4898779 7.263962
## 0.00 0.60 10.471261 0.4808300 7.294505
## 0.00 0.65 10.557048 0.4755929 7.365999
## 0.00 0.70 10.627163 0.4734722 7.425724
## 0.00 0.75 10.711507 0.4706411 7.446881
## 0.00 0.80 10.827727 0.4658134 7.511314
## 0.00 0.85 10.963902 0.4595376 7.600576
## 0.00 0.90 11.175606 0.4476629 7.713156
## 0.00 0.95 11.387338 0.4359785 7.838597
## 0.00 1.00 11.614037 0.4251840 7.979825
## 0.01 0.05 21.084119 0.5250403 14.377453
## 0.01 0.10 35.676932 0.5084231 23.179511
## 0.01 0.15 48.522321 0.5247744 31.705414
## 0.01 0.20 56.776652 0.5238109 37.386250
## 0.01 0.25 61.961964 0.5290693 40.911665
## 0.01 0.30 67.576149 0.5331302 44.482246
## 0.01 0.35 73.473727 0.5277792 48.251206
## 0.01 0.40 79.395211 0.5208918 52.229652
## 0.01 0.45 85.595075 0.5109801 56.246767
## 0.01 0.50 92.652097 0.4950344 60.610062
## 0.01 0.55 99.974890 0.4773596 65.040782
## 0.01 0.60 107.460371 0.4574953 69.546294
## 0.01 0.65 114.992047 0.4400249 74.046791
## 0.01 0.70 122.549989 0.4248916 78.523706
## 0.01 0.75 130.087702 0.4116705 82.970361
## 0.01 0.80 137.608462 0.4041988 88.120819
## 0.01 0.85 145.249561 0.3982796 93.463457
## 0.01 0.90 152.542654 0.3918093 98.526807
## 0.01 0.95 159.833900 0.3869120 103.578796
## 0.01 1.00 167.096488 0.3865823 108.627985
## 0.10 0.05 11.253149 0.5129544 8.586781
## 0.10 0.10 10.685980 0.5162941 7.531721
## 0.10 0.15 10.737138 0.4906719 7.420195
## 0.10 0.20 10.469545 0.4946123 7.367835
## 0.10 0.25 10.193889 0.5078322 7.229669
## 0.10 0.30 10.096661 0.5141351 7.229300
## 0.10 0.35 10.099017 0.5161346 7.214265
## 0.10 0.40 10.078473 0.5179258 7.131271
## 0.10 0.45 10.084795 0.5170338 7.130899
## 0.10 0.50 10.163397 0.5109624 7.195639
## 0.10 0.55 10.277115 0.5034110 7.282561
## 0.10 0.60 10.383843 0.4972600 7.344549
## 0.10 0.65 10.472132 0.4926200 7.411040
## 0.10 0.70 10.550363 0.4887880 7.466813
## 0.10 0.75 10.618902 0.4852205 7.531970
## 0.10 0.80 10.701186 0.4806883 7.615791
## 0.10 0.85 10.797671 0.4757570 7.716131
## 0.10 0.90 10.881163 0.4712766 7.813281
## 0.10 0.95 10.938739 0.4680404 7.885414
## 0.10 1.00 10.995850 0.4647631 7.947210
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.3 and lambda = 0.
The cross-validation profiles for a Enet model.
enetModel <- enet(x = as.matrix(x), y, lambda = 0, normalize = TRUE)
enetPred <- predict(enetModel, newx = as.matrix(fingerprints.test),
s = .1, mode = "fraction",
type = "fit")
get_R2enet(enetPred$fit)
#[1] 0.4296827
In this case, the returned \(R^2 = 0.4296827\), which is slightly lower compared to the \(R^2 = 0.5355194\) returned from the training data set.
Now, by comparing the \(R^2 = 0.4296827\) from Ridge model and the \(R^2 = 0.4944606\) from the PLS model, we noticed that the PLS explains much better the variance in the data set.
Do any have better predictive performance?
The above two examples show that the PLS model has a better performance in terms of \(R^2\).
(f)
Would you recommend any of your models to replace the permeability laboratory experiment?
Answer:
Based on the above results, I feel that I should not recommend any of my models in order to replace the permeability laboratory result. The reason why, is because there’s still too much variability that is not being explained by the above models. Thus can be seeing the the “low” scores for the \(R^2\) values.
6.3.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
(a)
Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield
for each run.
(b)
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Let’s visualize the missing values.
Let’s have a better understanding of the missing data.
Let’s visualize the missing values.
Since there are various predictors with a small number of predictors, I believe is better to replace the missing entries with the mean on that respective predictor. I believe this to be the best approach due to:
Using complete cases will reduce dramatically the data set, with the downside of losing possible valuable factors currently present in the model.
Replacing just a few records per predictor with the mean seems to be the best approach due to low incidence of missing records, thus not affecting dramatically the data set since only about 1% of data is missing.
# Procedure to replace NAs with the respective mean of the predictor.
for (i in 1:dim(ChemicalManufacturingProcess)[2]){
totalNA <- sum(is.na(ChemicalManufacturingProcess[i]))
if (totalNA > 0)
print(i)
meanColunm <- mean(ChemicalManufacturingProcess[,i], na.rm = TRUE)
ChemicalManufacturingProcess[i][is.na(ChemicalManufacturingProcess[i])] <- meanColunm
}
(c)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
In order to pre-process the data, what we need to do is to center and scale, in order to do so, we can do it as follows:
\[\text{Pre-Process Data} = \frac{y_i - \mu}{\sigma}\]
The above data, will be centered and scaled.
We can achieve this by employing the preProcess
function from the caret
library. This function has the ability to transform, center, scale, or impute values.
# Function to pre-process data.
library(caret)
trans <- preProcess(ChemicalManufacturingProcess,
method = c("center", "scale"))
# Need to obtain new transformed values
CMP.trans <- predict(trans, ChemicalManufacturingProcess)
Now, I will split the data into 75 % training data and 25 % test.
# Now, I will split the data as follows:
# Training 75%
# Test 25%
set.seed(123)
n <- nrow(CMP.trans)
trainIndex <- sample(1:n, size = round(0.75*n), replace=FALSE)
CMPtrain <- CMP.trans[trainIndex ,]
CMPtest <- CMP.trans[-trainIndex ,]
Let’s find correlations:
The below list, represents the column numbers in which highly correlated data is present \(> 0.90\).
## [1] 3 5 13 42 55 38 40 44 31 53
PLS Model
From the above list, is important to note that a few predictors are highly correlated. Hence, I will select a PLS as my starting model.
# Defining Control for a ten-fold Cross Validation
ctrl <- trainControl(method = "cv", number = 10)
# Defining training sets in order to tune a PLS model
x <- CMPtrain[, -1]
y <- CMPtrain[ , 1]
# Tuning a PLS model
set.seed(100)
plsTune <- train(x, y,
method = "pls",
## The default tuning grid evaluates
## components 1... tuneLength
tuneLength = 20,
trControl = ctrl
#preProc = c("center", "scale")
)
## Partial Least Squares
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.7658526 0.4518044 0.6171320
## 2 0.6757649 0.5505490 0.5537815
## 3 0.6546293 0.5779411 0.5459406
## 4 0.6638579 0.5715837 0.5544711
## 5 0.6835387 0.5547950 0.5811289
## 6 0.6887232 0.5601403 0.5809894
## 7 0.7001701 0.5479305 0.5886709
## 8 0.7214515 0.5421792 0.5996710
## 9 0.7553794 0.5253173 0.6218221
## 10 0.7957133 0.4972207 0.6439174
## 11 0.8457952 0.4598843 0.6746375
## 12 0.8836218 0.4424776 0.6936333
## 13 0.9010614 0.4409993 0.7013384
## 14 0.9127460 0.4408725 0.7020893
## 15 0.9280000 0.4367019 0.7101100
## 16 0.9499455 0.4306951 0.7116352
## 17 0.9802895 0.4241682 0.7105741
## 18 0.9990792 0.4208851 0.7097576
## 19 1.0276590 0.4158179 0.7120654
## 20 1.0779890 0.4015857 0.7296648
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The cross-validation profiles for a Partial Least Squares (PLS) model.
From the above results, a total of 3 Components will return the lowest RMSE with a \(R^2 = 0.5779411\).
(d)
Predict the response for the test set. What is the value of the performance metric and how does this compare with the re-sampled performance metric on the training set?
# PLS Fit and Predictions with up to 8 components.
library(pls)
plsFit <- plsr(Yield ~ ., data = CMPtrain)
predictPLS <- predict(plsFit, CMPtest, ncomp = 1:8)
# Creating a data frame from previous results.
predictPLS.df <- data.frame(predictPLS)
Let’s calculate the test \(R^2\).
# Calculating R-squared on the testing data is a little tricky,
# as you have to remember what your baseline is.
# Your baseline projection is a mean of your training data.
get_R2_3 <- function(c){
# Defining the correct baseline data
y <- CMPtest[1]
# y_bar is the mean of the observed data
y_bar <- mean(CMPtrain[,1])
# Defining the predicted data, in this case for 8 components
f <- predictPLS.df[,c]
# then the variability of the data set can be measured
# using three sums of squares formulas:
# The total sum of squares (proportional to the variance of the data):
SS.total <- sum((y - y_bar)^2)
# The regression sum of squares, also called the explained sum of squares:
SS.regression <- sum((f - y_bar)^2)
# The sum of squares of residuals, also called the residual sum of squares:
SS.residual <- sum((y - f)^2)
# The most general definition of the R Squared coefficient of determination is:
R2 <- 1 - SS.residual/SS.total
R2 <- round(R2,7)
return(R2)
}
# Obtaining R2 values for predicted R2.
R2_Predicted <- ""
for (i in 1:8){
R2_Predicted[i] <- get_R2_3(i)
}
Let’s compare the values and their differences, the row indicates the number of components used.
R2_Predicted | R2_Train | |
---|---|---|
1 | 0.493486 | 0.4518044 |
2 | 0.6625767 | 0.5505490 |
3 | 0.6962113 | 0.5779411 |
4 | 0.6896897 | 0.5715837 |
5 | 0.6733863 | 0.5547950 |
6 | 0.6271297 | 0.5601403 |
7 | 0.6354934 | 0.5479305 |
8 | 0.6446168 | 0.5421792 |
From the above table, it is interesting to note such a “big” change in between the \(R^2\) generated during training and the \(R^2\) generated during testing. The model seems to perform better on the tested data set. For this, we will focus on the 3 components in which the predicted \(R^2 = 0.6962113\) vs the training \(R^2 = 0.5779411\).
(e)
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
A diagram depicting the structure of a PLS model. PLS finds components that simultaneously summarize variation of the predictors while being optimally correlated with the outcome.
Since the PLS are constructed using linear combinations of the original predictors, and the PLS finds components that maximally summarize the variation of the predictors while simultaneously requiring these components to have maximum correlation with the response and, the PLS therefore strikes a compromise between the objectives of predictor space dimension reduction and a predictive relationship with the response. It is concluded that all predictors are important, that is, since 12 describe the input biological material and 45 describing the process predictors, the process predictors are the ones that dominate.
(f)
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
From the above plots, we can notice some sort of significance correlation in between the manufacture predictors and the yield; thus by focusing in the manufacture processed, the yield of the future runs of the manufacturing process can be improved.
References
Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.