DATA 624 Homework 7

Forecasting Principles and Practice

Load required packages

library(caret)
library(pls)
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
library(psych)
library(e1071)

6.2

Load the data

data(permeability)

Summarizing data

summary(permeability)

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

As suggested in the question, we will use nearZeroVar function as the nearZeroVar function can be useful to remove sparse and unbalanced variables.

fingerprints <- as.data.frame(fingerprints)
print(paste('Total predictors:', ncol(fingerprints)))

## [1] "Total predictors: 1107"

As we see, total no. of predictors are: 1107

print(paste('Non-Sparse predictors:', ncol(fingerprints[, -nearZeroVar(fingerprints)])))

## [1] "Non-Sparse predictors: 388"

by using nearZeroVar non-sparse predictors are removed. So, we have 338 predictors which are not sparsed and not unbalanced.

Pre process the data with nearZeroVar which removes non-sparse and unbalanced predictors.

fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]

Split the dataset into test and training purposes

set.seed(0)
smp_size <- floor(0.8 * nrow(fingerprints))
train_ind <- sample(seq_len(nrow(fingerprints)), size = smp_size)

Xtrain <- fingerprints[train_ind, ]
Xtest <- fingerprints[-train_ind, ]

ytrain <- permeability[train_ind, ]
ytest <- permeability[-train_ind, ]

Tuning a PLS model

#model
set.seed(0)
plsTune <- train(Xtrain, 
                 ytrain,
                 method = "pls",
                 tuneLength = 30,
                 preProc = c("center", "scale"),
                 trControl =  trainControl(method = 'cv', 10))

Let’s plot the scree plot of the model

plot(plsTune)

Number of latent variables that are optimal

(lv <- which.min(plsTune$results$RMSE))

## [1] 7

As we see, according to the scree plot, the ideal amount of latent variables is 7

Now, let’s find the corresponding resampled estimate of R2

print(plsTune$results[lv,3])

## [1] 0.4599706

As we see, the R-squared corresponding to 7 latent variables is 0.4599706.

Predicting response of test set and calculating test set estimate of R2

pls_pred <- predict(plsTune, Xtest)
plot(pls_pred, ytest, main=paste("Predicted vs Observed Permeability, PLS Model with", lv, "Components"), xlab="Predicted", ylab="Actual")

(cor(ytest, pls_pred) ^ 2)

## [1] 0.5366691

Therefore, R-squared with PLS model for the test set is 0.5366691

There are various models discussed in this chapter which can be used analyze which model has better predictive response.

Other models discussed in this chapter are: penalized regression models, Ridge regression and elastic net models

Ridge regression method

set.seed(0)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 20))
ridgeTune <- train(Xtrain, 
                 ytrain,
                 method = "ridge",
                 tuneGrid = ridgeGrid,
                 trControl = trainControl(method = 'cv', 10),
                 preProc = c("center", "scale"))

ridge_pred <- predict(ridgeTune, Xtest)
print(cor(ytest, ridge_pred) ^ 2)

## [1] 0.4191791

As we see, R-squared with Ridge regression for the test set is 0.4191791

Elastic Net method

set.seed(0)
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enetTune <- train(Xtrain, 
                 ytrain,
                 method = "enet",
                 tuneGrid = enetGrid,
                 trControl = trainControl(method = 'cv', 10),
                 preProc = c("center", "scale"))

enet_pred <- predict(enetTune, Xtest)
print(cor(ytest, enet_pred) ^ 2)

## [1] 0.4863764

R-squared with Elastic Net regression for the test set is 0.4863764.

Lasso Regression method

set.seed(1)
lassoFit <- train(x=Xtrain,
                  y=ytrain,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv',10),
                  preProcess=c('center','scale')
                  )

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

## Warning in train.default(x = Xtrain, y = ytrain, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results

lasso_pred <- predict(lassoFit, Xtest)
print(cor(ytest, lasso_pred) ^ 2)

## [1] 0.5600347

R-squared with Lasso regression for the test set is 0.5600347

plot(lassoFit)

Therefore, from the three models above, lasso model has the highest R-squared value compared to elastic net, ridge regression and penalized method.

Now, let’s do a summary of all the models

resamp <- resamples(list(PLS=plsTune, Ridge=ridgeTune, Lasso=lassoFit, enet=enetTune))
(resamp.s <- summary(resamp))

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: PLS, Ridge, Lasso, enet 
## Number of resamples: 10 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean   3rd Qu.     Max. NA's
## PLS   6.075969 7.411221 8.298590 8.973499 10.564444 13.16217    0
## Ridge 6.187568 7.729724 8.348537 9.191375  9.813148 14.30403    0
## Lasso 7.549591 7.949481 9.268991 9.488784 10.435625 13.66312    0
## enet  5.866445 6.941955 7.986778 8.566948  8.979818 13.28066    0
## 
## RMSE 
##           Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS   7.736314  9.568023 11.55162 12.18095 13.60852 18.13738    0
## Ridge 9.356785 10.052108 12.25675 12.99319 14.00416 19.55836    0
## Lasso 9.252343 10.505785 12.00145 12.60549 14.78490 16.93082    0
## enet  7.770362  9.339625 11.65216 11.95094 12.24416 19.09025    0
## 
## Rsquared 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS   0.0917526390 0.2306658 0.5511770 0.4599706 0.6549881 0.7645322    0
## Ridge 0.1340257666 0.2124637 0.4323068 0.4278519 0.6249448 0.8067128    0
## Lasso 0.1579702758 0.4015255 0.5236534 0.5353705 0.6828748 0.9263229    0
## enet  0.0005609121 0.2450148 0.5847285 0.4813678 0.6864913 0.7904524    0

As we see here, lasso method has the highest R-squared value.

Now, lets evaluate models based on test set

multiResample <- function(models, newdata, obs){
  res = list()
  methods = c()
  i = 1
  for (model in models){
    pred <- predict(model, newdata=newdata)
    metrics <- postResample(pred=pred, obs=obs)
    res[[i]] <- metrics
    methods[[i]] <- model$method
    i <- 1 + i
  }
  names(res) <- methods
  return(res)
}

models <- list(plsTune, ridgeTune, lassoFit, enetTune)
(resampleResult <- multiResample(models, Xtest, ytest))

## $pls
##       RMSE   Rsquared        MAE 
## 10.5988938  0.5366691  7.8881514 
## 
## $ridge
##       RMSE   Rsquared        MAE 
## 12.8579086  0.4191791  8.4360630 
## 
## $lasso
##       RMSE   Rsquared        MAE 
## 10.2128685  0.5600347  6.8014518 
## 
## $enet
##       RMSE   Rsquared        MAE 
## 10.9811713  0.4863764  6.9035123

As we see from results above, evaluation on test set tells that lasso method has highest R-squared value and also lowest rmse value.

This result matches with our 10-fold cross validatios, which says that lasso model is a good predictive method compared to other predictive methods.

By analyzing performance of all models used, I would not recommend to replace any of the models with permeability laboratory experiment.

Since all models have MAE between 6 and 8. Which means that these model predictions on an average are plus/minus 6 to 8 off.

For confirming, let’s analyze target variable permeability

hist(permeability)

As we can see, most of the permeability are between 0 to 10 range. Therefore, model’s accuracy is not good enough for lab test.

6.3

Load the data

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

We will use nearZeroVar function as the nearZeroVar function can be useful to remove sparse and unbalanced variables using two methods:

cmp <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]
print(paste('Total predictors:', ncol(ChemicalManufacturingProcess)))

## [1] "Total predictors: 58"

As we see, total no. of predictors are: 58

print(paste('Non-Sparse predictors:', ncol(ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)])))

## [1] "Non-Sparse predictors: 57"

by using nearZeroVar non-sparse predictors are removed. So, we have 57 predictors which are not sparsed and not unbalanced.

Let’s use kNN imputation function to impute missing values

cmp <- preProcess(as.data.frame(cmp), method = "knnImpute", k = 10)$data

Split the dataset into test and training set.

# test train split
set.seed(0)
smp_size <- floor(0.8 * nrow(cmp))
train_ind <- sample(seq_len(nrow(cmp)), size = smp_size)

Xtrain <- cmp[train_ind, -1]
Xtest <- cmp[-train_ind, -1]

ytrain <- cmp[train_ind, 1]
ytest <- cmp[-train_ind, 1]

Now, let’s analyze the dataset for any skewness in the data points

multi.hist(Xtrain, main = '', bcol = 'blue')

As we see the histograms of training data features, we see that these features have skewed distribution.

Let’s confirm this using skewness function

head(sort(apply(Xtrain, 2, skewness)), 8)

## ManufacturingProcess26 ManufacturingProcess25 ManufacturingProcess27 
##             -10.635856             -10.595888             -10.555903 
## ManufacturingProcess31 ManufacturingProcess29 ManufacturingProcess39 
##             -10.107105              -9.556297              -7.230829 
## ManufacturingProcess30 ManufacturingProcess02 
##              -4.947736              -2.181194

tail(sort(apply(Xtrain, 2, skewness)), 2)

## ManufacturingProcess06 ManufacturingProcess43 
##               3.764174               8.986510

The skewness function confirms the skewness in the features of this dataset. The data will centered and scaled to address this.

Tuning the data using pls method

set.seed(0)
plsTune <- train(Xtrain, 
                 ytrain,
                 method = "pls",
                 tuneLength = 30,
                 preProc = c("center", "scale"),
                 trControl =  trainControl(method = 'cv', 10))

Plot the pls model

plot(plsTune)

optimal value of latent variable is:

lv <- which.min(plsTune$results$RMSE)
paste(lv)

## [1] "2"

According to the scree plot, the optimal value of latent variables is 2

R-sqaured value of training dataset:

print( plsTune$results[lv,3])

## [1] 0.5411331

Train set R-squared with PLS model having 2 latent variables is: 0.5411331

Predicting response on test set:

pls_pred <- predict(plsTune, Xtest)
print( cor(ytest, pls_pred) ^ 2)

## [1] 0.4985139

Test set R-squared with PLS model is 0.4985139

As we compare, test set R-squared value is lower than training set r-squared value.

Let’s use varImp to analyze which predictors are most important:

plot(varImp(plsTune), top = 10)

As we see from the graph above, most important predictors are ManufacturingProcess09, ManufacturingProcess13, ManufacturingProcess32, ManufacturingProcess17, and ManufacturingProcess36.
Manufacturing process features have more importance compared to biological process.

Let’s explore relationship between each of the top predictors and the response using correlation plot

cmp %>% 
  select(c('ManufacturingProcess09','ManufacturingProcess13','ManufacturingProcess32','ManufacturingProcess17','ManufacturingProcess36',
           'BiologicalMaterial02', 'BiologicalMaterial03', 'BiologicalMaterial06', 'ManufacturingProcess06', 'BiologicalMaterial04', 'Yield')) %>%
  cor() %>%
  corrplot(method = 'circle')

As we see from the correlation plot above, relation between our top 5 predictors and yield are as follows:

The relationship between

ManufacturingProcess09 and yield is moderate and positive
ManufacturingProcess13 and yield is strong and negative
ManufacturingProcess32 and yield is moderate and positive
ManufacturingProcess17 and yield is moderate and negative
ManufacturingProcess36 and yield is moderate and negative

Also, as we see from the correlation plot, there is a significant correlation between predictors and this information can be useful for improving yield.

By using this analyses, processes having negative correlations with yield can be reduced and processes with positive correlations can be enhanced.

DATA 624 Homework 7

Priya Shaji

10/28/2020

Forecasting Principles and Practice

Load required packages

6.2

6.3