library(caret)
library(pls)
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
library(psych)
library(e1071)
a.
Load the data
data(permeability)
Summarizing data
summary(permeability)
## permeability
## Min. : 0.06
## 1st Qu.: 1.55
## Median : 4.91
## Mean :12.24
## 3rd Qu.:15.47
## Max. :55.60
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
b.
As suggested in the question, we will use nearZeroVar function as the nearZeroVar function can be useful to remove sparse and unbalanced variables.
fingerprints <- as.data.frame(fingerprints)
print(paste('Total predictors:', ncol(fingerprints)))
## [1] "Total predictors: 1107"
As we see, total no. of predictors are: 1107
print(paste('Non-Sparse predictors:', ncol(fingerprints[, -nearZeroVar(fingerprints)])))
## [1] "Non-Sparse predictors: 388"
by using nearZeroVar non-sparse predictors are removed. So, we have 338 predictors which are not sparsed and not unbalanced.
c.
Pre process the data with nearZeroVar which removes non-sparse and unbalanced predictors.
fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
Split the dataset into test and training purposes
set.seed(0)
smp_size <- floor(0.8 * nrow(fingerprints))
train_ind <- sample(seq_len(nrow(fingerprints)), size = smp_size)
Xtrain <- fingerprints[train_ind, ]
Xtest <- fingerprints[-train_ind, ]
ytrain <- permeability[train_ind, ]
ytest <- permeability[-train_ind, ]
Tuning a PLS model
#model
set.seed(0)
plsTune <- train(Xtrain,
ytrain,
method = "pls",
tuneLength = 30,
preProc = c("center", "scale"),
trControl = trainControl(method = 'cv', 10))
Let’s plot the scree plot of the model
plot(plsTune)
Number of latent variables that are optimal
(lv <- which.min(plsTune$results$RMSE))
## [1] 7
As we see, according to the scree plot, the ideal amount of latent variables is 7
Now, let’s find the corresponding resampled estimate of R2
print(plsTune$results[lv,3])
## [1] 0.4599706
As we see, the R-squared corresponding to 7 latent variables is 0.4599706.
d.
Predicting response of test set and calculating test set estimate of R2
pls_pred <- predict(plsTune, Xtest)
plot(pls_pred, ytest, main=paste("Predicted vs Observed Permeability, PLS Model with", lv, "Components"), xlab="Predicted", ylab="Actual")
(cor(ytest, pls_pred) ^ 2)
## [1] 0.5366691
Therefore, R-squared with PLS model for the test set is 0.5366691
e.
There are various models discussed in this chapter which can be used analyze which model has better predictive response.
Other models discussed in this chapter are: penalized regression models, Ridge regression and elastic net models
Ridge regression method
set.seed(0)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 20))
ridgeTune <- train(Xtrain,
ytrain,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = trainControl(method = 'cv', 10),
preProc = c("center", "scale"))
ridge_pred <- predict(ridgeTune, Xtest)
print(cor(ytest, ridge_pred) ^ 2)
## [1] 0.4191791
As we see, R-squared with Ridge regression for the test set is 0.4191791
Elastic Net method
set.seed(0)
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enetTune <- train(Xtrain,
ytrain,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = 'cv', 10),
preProc = c("center", "scale"))
enet_pred <- predict(enetTune, Xtest)
print(cor(ytest, enet_pred) ^ 2)
## [1] 0.4863764
R-squared with Elastic Net regression for the test set is 0.4863764.
Lasso Regression method
set.seed(1)
lassoFit <- train(x=Xtrain,
y=ytrain,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
trControl=trainControl(method='cv',10),
preProcess=c('center','scale')
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x = Xtrain, y = ytrain, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results
lasso_pred <- predict(lassoFit, Xtest)
print(cor(ytest, lasso_pred) ^ 2)
## [1] 0.5600347
R-squared with Lasso regression for the test set is 0.5600347
plot(lassoFit)
Therefore, from the three models above, lasso model has the highest R-squared value compared to elastic net, ridge regression and penalized method.
Now, let’s do a summary of all the models
resamp <- resamples(list(PLS=plsTune, Ridge=ridgeTune, Lasso=lassoFit, enet=enetTune))
(resamp.s <- summary(resamp))
##
## Call:
## summary.resamples(object = resamp)
##
## Models: PLS, Ridge, Lasso, enet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 6.075969 7.411221 8.298590 8.973499 10.564444 13.16217 0
## Ridge 6.187568 7.729724 8.348537 9.191375 9.813148 14.30403 0
## Lasso 7.549591 7.949481 9.268991 9.488784 10.435625 13.66312 0
## enet 5.866445 6.941955 7.986778 8.566948 8.979818 13.28066 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 7.736314 9.568023 11.55162 12.18095 13.60852 18.13738 0
## Ridge 9.356785 10.052108 12.25675 12.99319 14.00416 19.55836 0
## Lasso 9.252343 10.505785 12.00145 12.60549 14.78490 16.93082 0
## enet 7.770362 9.339625 11.65216 11.95094 12.24416 19.09025 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.0917526390 0.2306658 0.5511770 0.4599706 0.6549881 0.7645322 0
## Ridge 0.1340257666 0.2124637 0.4323068 0.4278519 0.6249448 0.8067128 0
## Lasso 0.1579702758 0.4015255 0.5236534 0.5353705 0.6828748 0.9263229 0
## enet 0.0005609121 0.2450148 0.5847285 0.4813678 0.6864913 0.7904524 0
As we see here, lasso method has the highest R-squared value.
Now, lets evaluate models based on test set
multiResample <- function(models, newdata, obs){
res = list()
methods = c()
i = 1
for (model in models){
pred <- predict(model, newdata=newdata)
metrics <- postResample(pred=pred, obs=obs)
res[[i]] <- metrics
methods[[i]] <- model$method
i <- 1 + i
}
names(res) <- methods
return(res)
}
models <- list(plsTune, ridgeTune, lassoFit, enetTune)
(resampleResult <- multiResample(models, Xtest, ytest))
## $pls
## RMSE Rsquared MAE
## 10.5988938 0.5366691 7.8881514
##
## $ridge
## RMSE Rsquared MAE
## 12.8579086 0.4191791 8.4360630
##
## $lasso
## RMSE Rsquared MAE
## 10.2128685 0.5600347 6.8014518
##
## $enet
## RMSE Rsquared MAE
## 10.9811713 0.4863764 6.9035123
As we see from results above, evaluation on test set tells that lasso method has highest R-squared value and also lowest rmse value.
This result matches with our 10-fold cross validatios, which says that lasso model is a good predictive method compared to other predictive methods.
f.
By analyzing performance of all models used, I would not recommend to replace any of the models with permeability laboratory experiment.
Since all models have MAE between 6 and 8. Which means that these model predictions on an average are plus/minus 6 to 8 off.
For confirming, let’s analyze target variable permeability
hist(permeability)
As we can see, most of the permeability are between 0 to 10 range. Therefore, model’s accuracy is not good enough for lab test.
a.
Load the data
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
b.
We will use nearZeroVar function as the nearZeroVar function can be useful to remove sparse and unbalanced variables using two methods:
cmp <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]
print(paste('Total predictors:', ncol(ChemicalManufacturingProcess)))
## [1] "Total predictors: 58"
As we see, total no. of predictors are: 58
print(paste('Non-Sparse predictors:', ncol(ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)])))
## [1] "Non-Sparse predictors: 57"
by using nearZeroVar non-sparse predictors are removed. So, we have 57 predictors which are not sparsed and not unbalanced.
Let’s use kNN imputation function to impute missing values
cmp <- preProcess(as.data.frame(cmp), method = "knnImpute", k = 10)$data
c.
Split the dataset into test and training set.
# test train split
set.seed(0)
smp_size <- floor(0.8 * nrow(cmp))
train_ind <- sample(seq_len(nrow(cmp)), size = smp_size)
Xtrain <- cmp[train_ind, -1]
Xtest <- cmp[-train_ind, -1]
ytrain <- cmp[train_ind, 1]
ytest <- cmp[-train_ind, 1]
Now, let’s analyze the dataset for any skewness in the data points
multi.hist(Xtrain, main = '', bcol = 'blue')
As we see the histograms of training data features, we see that these features have skewed distribution.
Let’s confirm this using skewness function
head(sort(apply(Xtrain, 2, skewness)), 8)
## ManufacturingProcess26 ManufacturingProcess25 ManufacturingProcess27
## -10.635856 -10.595888 -10.555903
## ManufacturingProcess31 ManufacturingProcess29 ManufacturingProcess39
## -10.107105 -9.556297 -7.230829
## ManufacturingProcess30 ManufacturingProcess02
## -4.947736 -2.181194
tail(sort(apply(Xtrain, 2, skewness)), 2)
## ManufacturingProcess06 ManufacturingProcess43
## 3.764174 8.986510
The skewness function confirms the skewness in the features of this dataset. The data will centered and scaled to address this.
Tuning the data using pls method
set.seed(0)
plsTune <- train(Xtrain,
ytrain,
method = "pls",
tuneLength = 30,
preProc = c("center", "scale"),
trControl = trainControl(method = 'cv', 10))
Plot the pls model
plot(plsTune)
optimal value of latent variable is:
lv <- which.min(plsTune$results$RMSE)
paste(lv)
## [1] "2"
According to the scree plot, the optimal value of latent variables is 2
R-sqaured value of training dataset:
print( plsTune$results[lv,3])
## [1] 0.5411331
Train set R-squared with PLS model having 2 latent variables is: 0.5411331
d.
Predicting response on test set:
pls_pred <- predict(plsTune, Xtest)
print( cor(ytest, pls_pred) ^ 2)
## [1] 0.4985139
Test set R-squared with PLS model is 0.4985139
As we compare, test set R-squared value is lower than training set r-squared value.
e.
Let’s use varImp to analyze which predictors are most important:
plot(varImp(plsTune), top = 10)
As we see from the graph above, most important predictors are ManufacturingProcess09, ManufacturingProcess13, ManufacturingProcess32, ManufacturingProcess17, and ManufacturingProcess36.
Manufacturing process features have more importance compared to biological process.
f.
Let’s explore relationship between each of the top predictors and the response using correlation plot
cmp %>%
select(c('ManufacturingProcess09','ManufacturingProcess13','ManufacturingProcess32','ManufacturingProcess17','ManufacturingProcess36',
'BiologicalMaterial02', 'BiologicalMaterial03', 'BiologicalMaterial06', 'ManufacturingProcess06', 'BiologicalMaterial04', 'Yield')) %>%
cor() %>%
corrplot(method = 'circle')
As we see from the correlation plot above, relation between our top 5 predictors and yield are as follows:
The relationship between
ManufacturingProcess09 and yield is moderate and positive
ManufacturingProcess13 and yield is strong and negative
ManufacturingProcess32 and yield is moderate and positive
ManufacturingProcess17 and yield is moderate and negative
ManufacturingProcess36 and yield is moderate and negative
Also, as we see from the correlation plot, there is a significant correlation between predictors and this information can be useful for improving yield.
By using this analyses, processes having negative correlations with yield can be reduced and processes with positive correlations can be enhanced.