6.2 Permeability

a. Load data

The dataset contain 165 observations (compounds) of 1107 binary molecular predictors. The target is permeability.

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.4
data("permeability")
#dim(fingerprints)

b. Filter Sparse

Filter low frequency predictors. How many predictors remain?

Variables with sparse observations are removed from the dataset using the nearZeroVar function. After removal of sparse variables, 388 molecular predictors remain.

library(caret)
library(dplyr)

# function removes sparse variables with no predictive power
fp_filter <- fingerprints[,-nearZeroVar(fingerprints)]

# get new dimensions
dim(fp_filter)
## [1] 165 388

c. Build a PLS model

Split Data, Preprocess, and Tune Model. How many latent variables are optimal? What is corresponding resampled estimate of R2?

Since the data contains more predictors than observations of predictors, ordinary least squares will fail calculate a unique solution set. The Partial Least Squares method can overcome this shortcoming of OLS. PLS usually requires a centered and scaled predictor set, but since the predictors are binary this consideration is not applicable. There are no missing values in the dataset. Due to the skew of the target, a box transformation, lambda= -0.1 was applied prior to the data splitting. A 80/20 train-test split was applied. A ten-fold cross validation was performed on the train set to determine the tuning parameter, number of components. Eight latent variables were optimal and the resampled r squared from the training set was, R2=0.7585. A graph showing how R-squared increases with number of componants and a plot of measured vs fitted values is provided. The linear relationship suggests a good model fit.

# partial least squares package
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
library(DataExplorer)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# put into dataframe: col 1 is the target 
perm_df <- cbind(permeability, fp_filter) %>%  
  as.data.frame()

# distribution of the target
hist(perm_df$permeability)

# apply Box Cox transformation to the target
BoxCox.lambda(perm_df$permeability)
## [1] -0.1079344
perm_df$permeability <-  BoxCox(perm_df$permeability, lambda =-0.1)

# Train/Test Split
set.seed(2021)
trainIndex <- createDataPartition(perm_df$permeability, p = .8) %>% unlist()
train <- perm_df[ trainIndex,]
test  <- perm_df[-trainIndex,]

# find value for tuning paramter ncomp
ctrl <- trainControl(method="cv", number=10)
train(train[,2:389], train$permeability, method="pls", tuneLength=20, trControl=ctrl)
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 121, 120, 118, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     1.2339687  0.2106623  1.0012125
##    2     1.1019224  0.4096729  0.8381896
##    3     1.0555680  0.4523629  0.8029745
##    4     1.0694140  0.4565116  0.8215200
##    5     1.0402596  0.4866160  0.8069072
##    6     1.0197411  0.4979235  0.7903989
##    7     1.0118420  0.5076031  0.7758057
##    8     0.9947809  0.5148210  0.7642556
##    9     1.0084382  0.5092348  0.7783193
##   10     1.0227309  0.5090948  0.7864260
##   11     1.0248214  0.5111347  0.7989744
##   12     1.0515071  0.4963563  0.8301784
##   13     1.0611571  0.4925760  0.8427364
##   14     1.0666597  0.4900671  0.8540898
##   15     1.0671198  0.4877931  0.8456419
##   16     1.0758890  0.4869658  0.8459993
##   17     1.0937201  0.4826484  0.8608161
##   18     1.1048165  0.4796958  0.8646640
##   19     1.1289543  0.4703343  0.8849121
##   20     1.1474809  0.4636733  0.9004218
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
pls_fit <- plsr(permeability ~ ., data=train, ncomp=8)

R2(pls_fit)
## (Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
##      0.0000       0.2071       0.4561       0.5417       0.6212       0.6934  
##     6 comps      7 comps      8 comps  
##      0.7163       0.7407       0.7585
validationplot(pls_fit,val.type="R2")

plot(pls_fit)

d. Predict Test set response

What is test set estimate of R2?

The R2 on the test set is lower than the test set, R2=0.594, but the fit suggests a valid model.

predictions <- predict(pls_fit, test[2:389], ncomp=8) 

pls_vals <- data.frame(obs=test$permeability, pred=predictions) %>% 
  dplyr::select(obs, `pred`=permeability.8.comps)
defaultSummary(pls_vals)
##      RMSE  Rsquared       MAE 
## 0.8530225 0.5942218 0.6362122
plot(pls_vals)

e. Build other models

Do they have better predictive performance

The ridge and lasso regression models performed marginally better in terms of R2 values than the pls model.

ridge

library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
# optimize lambda
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=10))
set.seed(666)
ridgeRegFit <- train(train[,-1], train[,1], 
                     method="ridge",
                     tuneGrid=ridgeGrid,
                     trControl=ctrl)
## Warning: model fit failed for Fold01: lambda=0.00000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
# build model
ridge <- enet(x=as.matrix(train[,-1]), y=train[,1], lambda=0.1)

# make predictions
ridge_predictions <- predict(ridge, newx=as.matrix(test[-1]), s=1, mode="fraction") 

# Get a value for R-squared
ridge_vals <- data.frame(obs=test$permeability, pred=ridge_predictions$fit) 
defaultSummary(ridge_vals)
##      RMSE  Rsquared       MAE 
## 0.8710103 0.6347350 0.7000787

lasso

# build model
lasso <- enet(x=as.matrix(train[,-1]), y=train[,1], lambda=0.1)

# make predictions
lasso_predictions <- predict(lasso, newx=as.matrix(test[-1]),s=.1, mode="fraction",type="fit") 

# Calculate R-squared
lasso_vals <- data.frame(obs=test$permeability, pred=lasso_predictions$fit) 
defaultSummary(lasso_vals)
##      RMSE  Rsquared       MAE 
## 0.7540995 0.6765546 0.5882158

f. Recommendations

Would you recommend any of the models replace the permeability lab experiment

No, validation of a models requires lab data. More lab data will also inform better models.

6.3 Chemical Manufacturing

a. Import Data

Data contains 57 predictors for 176 manufacturing runs. The target is yield. All variables are numeric.

data("ChemicalManufacturingProcess")
#str(ChemicalManufacturingProcess)

b. Missing Values

Missing values were present in some of the process predictors, and were imputed with a k-nearest neighbor imputation algorithm form the DMwR package.

library(DataExplorer)
library(DMwR)
#apply(is.na(ChemicalManufacturingProcess), 2, sum) %>%  sort(decreasing=T)

ChemicalManufacturingProcess <-  knnImputation(ChemicalManufacturingProcess, k=3)

c. Build model

preprocess, tune a model of choice, optimal vlaue of performance metric

# convert to dataframe: target is the first variable
chem_df <- ChemicalManufacturingProcess %>% 
  as.data.frame()

# preprocess: cneter and scale
chem_df <- scale(chem_df,center=T,scale=T) %>% 
  as.data.frame()
chem_df$Yield %>% hist

# Train/Test Split
set.seed(2021)
trainIndex <- createDataPartition(chem_df$Yield , p = .8) %>% 
  unlist()
train <- chem_df[ trainIndex,]
test  <- chem_df[-trainIndex,]

# find value for tuning paramter ncomp
ctrl <- trainControl(method="cv", number=10)
train(train[,2:58], train$Yield, method="pls", tuneLength=10, trControl=ctrl)
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7356683  0.4839612  0.5906832
##    2     0.6527268  0.5782169  0.5282370
##    3     0.6242984  0.6210263  0.5041648
##    4     0.6268474  0.6214670  0.5144139
##    5     0.6376797  0.6069507  0.5223590
##    6     0.6469258  0.5915780  0.5291802
##    7     0.6579156  0.5845381  0.5395922
##    8     0.6753553  0.5754387  0.5541001
##    9     0.6919960  0.5686118  0.5597279
##   10     0.7090433  0.5594914  0.5570071
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
# fit a model
pls_fit <- plsr(Yield ~ ., data=train, ncomp=3)

# Compute metrics
R2(pls_fit)
## (Intercept)      1 comps      2 comps      3 comps  
##      0.0000       0.5076       0.6494       0.7084
validationplot(pls_fit,val.type="R2")

plot(pls_fit)

d. Predict response of test set

What is value of performance metric How does it compare with performance metric of ressampled performace metric of training set

predictions <- predict(pls_fit, test[2:58], ncomp=3) 

pls_vals <- data.frame(obs=test$Yield, pred=predictions) %>% 
  dplyr::select(obs, `pred`=Yield.3.comps)
defaultSummary(pls_vals)
##      RMSE  Rsquared       MAE 
## 0.6718667 0.5065851 0.5756275

e. Which predicors are the most importannt?

Do biological or process predictors dominate?

By comparing the magnitude of the coefficients, we can determine which of the predictors are most important in determining Yield in the chemical manufacturing process. Because the predictors were centered and scaled, a direct comparison of absolute values of coefficients is possible. The manufacturing processes were the most important predictors.

coefficients <- coef(pls_fit)%>% as.data.frame()

colnames(coefficients) <- c('coefficient')
                  
coefficients <- coefficients %>% 
  abs() %>% 
  arrange(desc(coefficient))

f. Explore relationships

Explore relationships of top two predictors and response. How could this info be helpful in improving yield in future manufacturing runs.

Process 32 has a positive relationship with Yield - Process 36 has a negative impact on yield. This information could be valuable in guiding future process runs so that yield can be improved.

library(corrplot)
## corrplot 0.84 loaded
## 
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
## 
##     corrplot
top2 <- ChemicalManufacturingProcess %>% 
  dplyr::select(Yield, ManufacturingProcess32, ManufacturingProcess36)

top2 %>% ggplot(aes(x=ManufacturingProcess32,y=Yield)) + geom_point()

top2 %>% ggplot(aes(x=ManufacturingProcess36,y=Yield)) + geom_point()

corrplot(cor(top2),order="hclust")