The dataset contain 165 observations (compounds) of 1107 binary molecular predictors. The target is permeability.
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.4
data("permeability")
#dim(fingerprints)
Filter low frequency predictors. How many predictors remain?
Variables with sparse observations are removed from the dataset using the nearZeroVar
function. After removal of sparse variables, 388 molecular predictors remain.
library(caret)
library(dplyr)
# function removes sparse variables with no predictive power
fp_filter <- fingerprints[,-nearZeroVar(fingerprints)]
# get new dimensions
dim(fp_filter)
## [1] 165 388
Split Data, Preprocess, and Tune Model. How many latent variables are optimal? What is corresponding resampled estimate of R2?
Since the data contains more predictors than observations of predictors, ordinary least squares will fail calculate a unique solution set. The Partial Least Squares method can overcome this shortcoming of OLS. PLS usually requires a centered and scaled predictor set, but since the predictors are binary this consideration is not applicable. There are no missing values in the dataset. Due to the skew of the target, a box transformation, lambda= -0.1 was applied prior to the data splitting. A 80/20 train-test split was applied. A ten-fold cross validation was performed on the train set to determine the tuning parameter, number of components. Eight latent variables were optimal and the resampled r squared from the training set was, R2=0.7585. A graph showing how R-squared increases with number of componants and a plot of measured vs fitted values is provided. The linear relationship suggests a good model fit.
# partial least squares package
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(DataExplorer)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# put into dataframe: col 1 is the target
perm_df <- cbind(permeability, fp_filter) %>%
as.data.frame()
# distribution of the target
hist(perm_df$permeability)
# apply Box Cox transformation to the target
BoxCox.lambda(perm_df$permeability)
## [1] -0.1079344
perm_df$permeability <- BoxCox(perm_df$permeability, lambda =-0.1)
# Train/Test Split
set.seed(2021)
trainIndex <- createDataPartition(perm_df$permeability, p = .8) %>% unlist()
train <- perm_df[ trainIndex,]
test <- perm_df[-trainIndex,]
# find value for tuning paramter ncomp
ctrl <- trainControl(method="cv", number=10)
train(train[,2:389], train$permeability, method="pls", tuneLength=20, trControl=ctrl)
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 121, 120, 118, 118, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.2339687 0.2106623 1.0012125
## 2 1.1019224 0.4096729 0.8381896
## 3 1.0555680 0.4523629 0.8029745
## 4 1.0694140 0.4565116 0.8215200
## 5 1.0402596 0.4866160 0.8069072
## 6 1.0197411 0.4979235 0.7903989
## 7 1.0118420 0.5076031 0.7758057
## 8 0.9947809 0.5148210 0.7642556
## 9 1.0084382 0.5092348 0.7783193
## 10 1.0227309 0.5090948 0.7864260
## 11 1.0248214 0.5111347 0.7989744
## 12 1.0515071 0.4963563 0.8301784
## 13 1.0611571 0.4925760 0.8427364
## 14 1.0666597 0.4900671 0.8540898
## 15 1.0671198 0.4877931 0.8456419
## 16 1.0758890 0.4869658 0.8459993
## 17 1.0937201 0.4826484 0.8608161
## 18 1.1048165 0.4796958 0.8646640
## 19 1.1289543 0.4703343 0.8849121
## 20 1.1474809 0.4636733 0.9004218
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
pls_fit <- plsr(permeability ~ ., data=train, ncomp=8)
R2(pls_fit)
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## 0.0000 0.2071 0.4561 0.5417 0.6212 0.6934
## 6 comps 7 comps 8 comps
## 0.7163 0.7407 0.7585
validationplot(pls_fit,val.type="R2")
plot(pls_fit)
What is test set estimate of R2?
The R2 on the test set is lower than the test set, R2=0.594, but the fit suggests a valid model.
predictions <- predict(pls_fit, test[2:389], ncomp=8)
pls_vals <- data.frame(obs=test$permeability, pred=predictions) %>%
dplyr::select(obs, `pred`=permeability.8.comps)
defaultSummary(pls_vals)
## RMSE Rsquared MAE
## 0.8530225 0.5942218 0.6362122
plot(pls_vals)
Do they have better predictive performance
The ridge and lasso regression models performed marginally better in terms of R2 values than the pls model.
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
# optimize lambda
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=10))
set.seed(666)
ridgeRegFit <- train(train[,-1], train[,1],
method="ridge",
tuneGrid=ridgeGrid,
trControl=ctrl)
## Warning: model fit failed for Fold01: lambda=0.00000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
# build model
ridge <- enet(x=as.matrix(train[,-1]), y=train[,1], lambda=0.1)
# make predictions
ridge_predictions <- predict(ridge, newx=as.matrix(test[-1]), s=1, mode="fraction")
# Get a value for R-squared
ridge_vals <- data.frame(obs=test$permeability, pred=ridge_predictions$fit)
defaultSummary(ridge_vals)
## RMSE Rsquared MAE
## 0.8710103 0.6347350 0.7000787
# build model
lasso <- enet(x=as.matrix(train[,-1]), y=train[,1], lambda=0.1)
# make predictions
lasso_predictions <- predict(lasso, newx=as.matrix(test[-1]),s=.1, mode="fraction",type="fit")
# Calculate R-squared
lasso_vals <- data.frame(obs=test$permeability, pred=lasso_predictions$fit)
defaultSummary(lasso_vals)
## RMSE Rsquared MAE
## 0.7540995 0.6765546 0.5882158
Would you recommend any of the models replace the permeability lab experiment
No, validation of a models requires lab data. More lab data will also inform better models.
Data contains 57 predictors for 176 manufacturing runs. The target is yield
. All variables are numeric.
data("ChemicalManufacturingProcess")
#str(ChemicalManufacturingProcess)
Missing values were present in some of the process predictors, and were imputed with a k-nearest neighbor imputation algorithm form the DMwR
package.
library(DataExplorer)
library(DMwR)
#apply(is.na(ChemicalManufacturingProcess), 2, sum) %>% sort(decreasing=T)
ChemicalManufacturingProcess <- knnImputation(ChemicalManufacturingProcess, k=3)
preprocess, tune a model of choice, optimal vlaue of performance metric
# convert to dataframe: target is the first variable
chem_df <- ChemicalManufacturingProcess %>%
as.data.frame()
# preprocess: cneter and scale
chem_df <- scale(chem_df,center=T,scale=T) %>%
as.data.frame()
chem_df$Yield %>% hist
# Train/Test Split
set.seed(2021)
trainIndex <- createDataPartition(chem_df$Yield , p = .8) %>%
unlist()
train <- chem_df[ trainIndex,]
test <- chem_df[-trainIndex,]
# find value for tuning paramter ncomp
ctrl <- trainControl(method="cv", number=10)
train(train[,2:58], train$Yield, method="pls", tuneLength=10, trControl=ctrl)
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 131, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.7356683 0.4839612 0.5906832
## 2 0.6527268 0.5782169 0.5282370
## 3 0.6242984 0.6210263 0.5041648
## 4 0.6268474 0.6214670 0.5144139
## 5 0.6376797 0.6069507 0.5223590
## 6 0.6469258 0.5915780 0.5291802
## 7 0.6579156 0.5845381 0.5395922
## 8 0.6753553 0.5754387 0.5541001
## 9 0.6919960 0.5686118 0.5597279
## 10 0.7090433 0.5594914 0.5570071
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
# fit a model
pls_fit <- plsr(Yield ~ ., data=train, ncomp=3)
# Compute metrics
R2(pls_fit)
## (Intercept) 1 comps 2 comps 3 comps
## 0.0000 0.5076 0.6494 0.7084
validationplot(pls_fit,val.type="R2")
plot(pls_fit)
What is value of performance metric How does it compare with performance metric of ressampled performace metric of training set
predictions <- predict(pls_fit, test[2:58], ncomp=3)
pls_vals <- data.frame(obs=test$Yield, pred=predictions) %>%
dplyr::select(obs, `pred`=Yield.3.comps)
defaultSummary(pls_vals)
## RMSE Rsquared MAE
## 0.6718667 0.5065851 0.5756275
Do biological or process predictors dominate?
By comparing the magnitude of the coefficients, we can determine which of the predictors are most important in determining Yield in the chemical manufacturing process. Because the predictors were centered and scaled, a direct comparison of absolute values of coefficients is possible. The manufacturing processes were the most important predictors.
coefficients <- coef(pls_fit)%>% as.data.frame()
colnames(coefficients) <- c('coefficient')
coefficients <- coefficients %>%
abs() %>%
arrange(desc(coefficient))
Explore relationships of top two predictors and response. How could this info be helpful in improving yield in future manufacturing runs.
Process 32 has a positive relationship with Yield - Process 36 has a negative impact on yield. This information could be valuable in guiding future process runs so that yield can be improved.
library(corrplot)
## corrplot 0.84 loaded
##
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
##
## corrplot
top2 <- ChemicalManufacturingProcess %>%
dplyr::select(Yield, ManufacturingProcess32, ManufacturingProcess36)
top2 %>% ggplot(aes(x=ManufacturingProcess32,y=Yield)) + geom_point()
top2 %>% ggplot(aes(x=ManufacturingProcess36,y=Yield)) + geom_point()
corrplot(cor(top2),order="hclust")