Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.

#Load packages
library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
library(elasticnet)
library(glmnet)
library(kableExtra)
library(corrplot)
library(VIM)
library(RANN)

Exercise 6.2

(a)

Start R and use these commands to load the data:

data(permeability)
summary(permeability)
##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

The matrix fingerprints contain the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

predictors = nearZeroVar(fingerprints)
fingerprints1 = fingerprints[,-predictors]
dim(fingerprints1)[2]
## [1] 388

719 columns have been removed using nearZeroVar function. 388 predictors are left for modeling. Let us look at the pairwise correlations.

correlations <- cor(fingerprints1)
plot(correlations)

We will remove highly correlated predictors with threshold 0.90.

highCorr <- findCorrelation(correlations, cutoff = .9)
fingerprints1 <- fingerprints1[, -highCorr]
correlations <- cor(fingerprints1)
corrplot(correlations, order = "hclust")

(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

#adding permeability data to last column
df <- as.data.frame(fingerprints1) %>% mutate(permeability = permeability)

#Split the data
set.seed(425)
in_train <- createDataPartition(df$permeability, times = 1, p = 0.7, list = FALSE)
train <- df[in_train, ]
test <- df[-in_train, ]
#run PLS model
set.seed(525)

pls_model <- train(permeability ~ ., data = train, method = "pls",
              center = TRUE,trControl = trainControl("cv", number = 10),
              tuneLength = 25, preProcess= c('center','scale'))
pls_model
## Partial Least Squares 
## 
## 117 samples
## 110 predictors
## 
## Pre-processing: centered (110), scaled (110) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.49375  0.4597224   9.091507
##    2     12.24800  0.4945470   9.112055
##    3     12.36040  0.4797827   9.406771
##    4     12.16871  0.4952475   9.389245
##    5     11.89010  0.5057603   9.013300
##    6     11.94220  0.5014062   9.128401
##    7     12.16506  0.5011089   9.317626
##    8     12.47311  0.4897341   9.493437
##    9     12.79494  0.4728488   9.731745
##   10     12.96546  0.4680538   9.982773
##   11     13.31205  0.4411153  10.232183
##   12     13.79447  0.4090751  10.538288
##   13     14.20229  0.3927278  10.950167
##   14     14.41850  0.3926943  11.122089
##   15     14.56554  0.3893528  11.310679
##   16     14.76587  0.3792650  11.414448
##   17     14.95644  0.3712629  11.480098
##   18     15.15261  0.3643800  11.625230
##   19     15.33920  0.3616213  11.803363
##   20     15.25886  0.3631634  11.815923
##   21     15.32512  0.3658402  11.871329
##   22     15.43380  0.3662152  12.169136
##   23     15.56990  0.3650867  12.217720
##   24     15.69537  0.3635686  12.282632
##   25     15.98231  0.3573109  12.512090
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
plot(pls_model, main = "R-squared Error of PLS Model")

pls_model$results %>%
  filter(ncomp == pls_model$bestTune$ncomp) 
##   ncomp    RMSE  Rsquared    MAE   RMSESD RsquaredSD    MAESD
## 1     5 11.8901 0.5057603 9.0133 2.380438  0.2206955 1.604525

The number of components resulting in the smallest root mean squared error is 2. It has RMSE = 12,30, R2 = 0.49, and MAE = 8.74. Therefore 2 variables account for the largest portion of the variability in the data than all other latent variables.

(d)

Predict the response for the test set. What is the test set estimate of \(R^2\)?

set.seed(526)
#Predictions
predictions <- predict(pls_model, test)
#Model performance 
results <- data.frame(Model = "PLS",
                      RMSE = caret::RMSE(predictions, test$permeability),
                      Rsquared = caret::R2(predictions, test$permeability),
                      MAE = caret::MAE(predictions, test$permeability))
results
##              Model     RMSE  Rsquared      MAE
## permeability   PLS 9.931766 0.4813456 7.404271

The test set estimate of \(R^2\) is 0.31.

(e)

Try building other models discussed in this chapter. Do any have better predictive performance?

PCR Model

set.seed(527)

pcr_model <- train(permeability ~ ., data = train, method = "pcr",
              center = TRUE,trControl = trainControl("cv", number = 10),
              tuneLength = 25, preProcess= c('center','scale'))
pcr_model
## Principal Component Analysis 
## 
## 117 samples
## 110 predictors
## 
## Pre-processing: centered (110), scaled (110) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 105, 105, 106, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     14.53036  0.2719165  11.569587
##    2     13.77096  0.3868836  10.527510
##    3     12.77983  0.5293771   9.659629
##    4     12.67719  0.5235212   9.492291
##    5     12.57255  0.5308346   9.358080
##    6     12.80383  0.5159124   9.586236
##    7     12.73028  0.5194885   9.680261
##    8     12.61531  0.5272950   9.455674
##    9     12.84418  0.5002749   9.528027
##   10     12.94629  0.4904117   9.729650
##   11     13.00556  0.4871528   9.803775
##   12     13.23303  0.4547214  10.021566
##   13     13.49666  0.4450683  10.265017
##   14     13.20284  0.4698343   9.678929
##   15     12.97044  0.4760409   9.979021
##   16     12.75276  0.4904039   9.763029
##   17     12.92765  0.4702478   9.935250
##   18     12.88617  0.4718028   9.811889
##   19     12.93372  0.4762997   9.843803
##   20     12.70200  0.4859912   9.732408
##   21     12.53706  0.4996257   9.540718
##   22     12.38637  0.4998647   9.356427
##   23     12.09613  0.5351635   9.098953
##   24     12.07304  0.5213616   9.049872
##   25     12.43117  0.4889082   9.351105
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 24.
plot(pcr_model, main = "R-squared Error of PCR Model")

set.seed(528)
#Predictions
predictions_pcr <- predict(pcr_model, test)
#Model performance 
results_pcr <- data.frame(Model = "PCR",
                      RMSE = caret::RMSE(predictions_pcr, test$permeability),
                      Rsquared = caret::R2(predictions_pcr, test$permeability),
                      MAE = caret::MAE(predictions_pcr, test$permeability))
results_pcr
##              Model    RMSE  Rsquared      MAE
## permeability   PCR 9.78416 0.4838728 7.612136

Ridge Regression

ridgeGrid<-data.frame(.lambda=seq(0,1, length=15))
set.seed(529)

#Ridge Method Fit
ridge_model <- train(permeability ~ ., data = train, method='ridge', metric='Rsquared',
                   tuneGrid = ridgeGrid,
                   trControl = trainControl(method = 'cv'), preProcess = c('center','scale'))
ridge_model
## Ridge Regression 
## 
## 117 samples
## 110 predictors
## 
## Pre-processing: centered (110), scaled (110) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 105, 105, 105, 106, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE          Rsquared   MAE         
##   0.00000000  4.498531e+08  0.1449095  2.376653e+08
##   0.07142857  1.232739e+01  0.5178299  9.146404e+00
##   0.14285714  1.210243e+01  0.5406364  8.984906e+00
##   0.21428571  1.212827e+01  0.5476154  8.980539e+00
##   0.28571429  1.225627e+01  0.5499758  9.071299e+00
##   0.35714286  1.244352e+01  0.5504269  9.205432e+00
##   0.42857143  1.267159e+01  0.5499507  9.352651e+00
##   0.50000000  1.293066e+01  0.5489891  9.533762e+00
##   0.57142857  1.321465e+01  0.5477679  9.749646e+00
##   0.64285714  1.351938e+01  0.5464119  9.978588e+00
##   0.71428571  1.384169e+01  0.5449936  1.022930e+01
##   0.78571429  1.417913e+01  0.5435565  1.049217e+01
##   0.85714286  1.452964e+01  0.5421271  1.075541e+01
##   0.92857143  1.489147e+01  0.5407216  1.101782e+01
##   1.00000000  1.526313e+01  0.5393498  1.127768e+01
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.3571429.
plot(ridge_model, main = "R-squared Error of Ridge Model")

set.seed(530)
#Predictions
predictions_ridge <- predict(ridge_model, test)
#Model performance 
results_ridge <- data.frame(Model = "Ridge",
                      RMSE = caret::RMSE(predictions_ridge, test$permeability),
                      Rsquared = caret::R2(predictions_ridge, test$permeability),
                      MAE = caret::MAE(predictions_ridge, test$permeability))
results_ridge
##              Model     RMSE  Rsquared      MAE
## permeability Ridge 11.27315 0.4742014 8.297782

Lasso Regression

set.seed(531)
lasso_model <- train(permeability ~ ., data = train, method='lasso', metric='Rsquared', 
                   tuneGrid = data.frame(.fraction = seq(0,0.5, by=0.05)),
                   trControl = trainControl(method='cv'),
                   preProcess = c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
lasso_model
## The lasso 
## 
## 117 samples
## 110 predictors
## 
## Pre-processing: centered (110), scaled (110) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE          Rsquared   MAE         
##   0.00      1.612720e+01        NaN  1.308597e+01
##   0.05      6.297561e+05  0.2256917  3.665642e+05
##   0.10      1.209421e+06  0.2118223  6.980189e+05
##   0.15      1.750901e+06  0.2115700  1.001342e+06
##   0.20      2.293185e+06  0.2019776  1.304688e+06
##   0.25      2.835730e+06  0.1876124  1.607993e+06
##   0.30      3.378434e+06  0.1765733  1.911290e+06
##   0.35      6.441781e+06  0.1518032  3.588910e+06
##   0.40      9.834419e+06  0.1509985  5.446040e+06
##   0.45      1.320198e+07  0.1507521  7.287634e+06
##   0.50      1.652143e+07  0.1494106  9.099398e+06
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
plot(lasso_model, main = "R-squared Lasso Model")

set.seed(532)
#Predictions
predictions_lasso <- predict(lasso_model, test)
#Model performance 
results_lasso <- data.frame(Model = "Lasso",
                      RMSE = caret::RMSE(predictions_lasso, test$permeability),
                      Rsquared = caret::R2(predictions_lasso, test$permeability),
                      MAE = caret::MAE(predictions_lasso, test$permeability))
results_lasso
##              Model     RMSE    Rsquared     MAE
## permeability Lasso 35991793 0.007920709 5194982

Elastic Net

set.seed(533)

elasticnet_model <- train(permeability ~ ., data = train, method ='enet', metric='Rsquared',
                        tuneGrid = expand.grid(.fraction=seq(0,1,by=0.1),
                        .lambda=seq(0,1,by=0.1)),
                        trControl=trainControl(method='cv'),
                        preProcess=c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
elasticnet_model
## Elasticnet 
## 
## 117 samples
## 110 predictors
## 
## Pre-processing: centered (110), scaled (110) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 106, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE          Rsquared   MAE         
##   0.0     0.0       1.605289e+01        NaN  1.311468e+01
##   0.0     0.1       1.068953e+16  0.2185678  6.047534e+15
##   0.0     0.2       2.137905e+16  0.1855136  1.209506e+16
##   0.0     0.3       3.206857e+16  0.1751223  1.814259e+16
##   0.0     0.4       4.275808e+16  0.1600843  2.419012e+16
##   0.0     0.5       5.344760e+16  0.1484753  3.023765e+16
##   0.0     0.6       6.413712e+16  0.1397991  3.628518e+16
##   0.0     0.7       7.482664e+16  0.1347267  4.233272e+16
##   0.0     0.8       8.551615e+16  0.1318993  4.838025e+16
##   0.0     0.9       9.620567e+16  0.1302927  5.442778e+16
##   0.0     1.0       1.068952e+17  0.1293711  6.047531e+16
##   0.1     0.0       1.605289e+01        NaN  1.311468e+01
##   0.1     0.1       1.209484e+01  0.5003136  9.232768e+00
##   0.1     0.2       1.167021e+01  0.5207194  8.424887e+00
##   0.1     0.3       1.172327e+01  0.5214684  8.532537e+00
##   0.1     0.4       1.174074e+01  0.5220290  8.565115e+00
##   0.1     0.5       1.189062e+01  0.5137933  8.596010e+00
##   0.1     0.6       1.203865e+01  0.5067960  8.728549e+00
##   0.1     0.7       1.213169e+01  0.5047378  8.822719e+00
##   0.1     0.8       1.228958e+01  0.5018736  8.985004e+00
##   0.1     0.9       1.247119e+01  0.4980152  9.124576e+00
##   0.1     1.0       1.267160e+01  0.4923753  9.284064e+00
##   0.2     0.0       1.605289e+01        NaN  1.311468e+01
##   0.2     0.1       1.221857e+01  0.5010274  9.533200e+00
##   0.2     0.2       1.171156e+01  0.5162873  8.369793e+00
##   0.2     0.3       1.169990e+01  0.5247079  8.370508e+00
##   0.2     0.4       1.170722e+01  0.5314202  8.412648e+00
##   0.2     0.5       1.178959e+01  0.5294089  8.497834e+00
##   0.2     0.6       1.195086e+01  0.5232267  8.610123e+00
##   0.2     0.7       1.205052e+01  0.5225170  8.715108e+00
##   0.2     0.8       1.217582e+01  0.5214250  8.836916e+00
##   0.2     0.9       1.233128e+01  0.5190818  9.000577e+00
##   0.2     1.0       1.247769e+01  0.5164552  9.169102e+00
##   0.3     0.0       1.605289e+01        NaN  1.311468e+01
##   0.3     0.1       1.234262e+01  0.5012940  9.724087e+00
##   0.3     0.2       1.171478e+01  0.5162028  8.331104e+00
##   0.3     0.3       1.170155e+01  0.5282976  8.241005e+00
##   0.3     0.4       1.180700e+01  0.5330533  8.340881e+00
##   0.3     0.5       1.189075e+01  0.5339338  8.452902e+00
##   0.3     0.6       1.204760e+01  0.5308615  8.587209e+00
##   0.3     0.7       1.216317e+01  0.5309007  8.702374e+00
##   0.3     0.8       1.228967e+01  0.5310289  8.836379e+00
##   0.3     0.9       1.243380e+01  0.5299548  9.027190e+00
##   0.3     1.0       1.257521e+01  0.5284496  9.200748e+00
##   0.4     0.0       1.605289e+01        NaN  1.311468e+01
##   0.4     0.1       1.240368e+01  0.5009499  9.816886e+00
##   0.4     0.2       1.172083e+01  0.5153425  8.278039e+00
##   0.4     0.3       1.174592e+01  0.5296434  8.161662e+00
##   0.4     0.4       1.194355e+01  0.5333826  8.325030e+00
##   0.4     0.5       1.208066e+01  0.5346198  8.470189e+00
##   0.4     0.6       1.223528e+01  0.5342476  8.613265e+00
##   0.4     0.7       1.237649e+01  0.5355645  8.741814e+00
##   0.4     0.8       1.250892e+01  0.5365687  8.923495e+00
##   0.4     0.9       1.265996e+01  0.5362266  9.127898e+00
##   0.4     1.0       1.281559e+01  0.5351700  9.316409e+00
##   0.5     0.0       1.605289e+01        NaN  1.311468e+01
##   0.5     0.1       1.243088e+01  0.5005166  9.858802e+00
##   0.5     0.2       1.175785e+01  0.5131168  8.240715e+00
##   0.5     0.3       1.183430e+01  0.5293797  8.110441e+00
##   0.5     0.4       1.210811e+01  0.5333377  8.360932e+00
##   0.5     0.5       1.232607e+01  0.5340980  8.555698e+00
##   0.5     0.6       1.248934e+01  0.5357311  8.686468e+00
##   0.5     0.7       1.265149e+01  0.5380442  8.858571e+00
##   0.5     0.8       1.280805e+01  0.5394694  9.104317e+00
##   0.5     0.9       1.296726e+01  0.5402103  9.329111e+00
##   0.5     1.0       1.314822e+01  0.5390668  9.571011e+00
##   0.6     0.0       1.605289e+01        NaN  1.311468e+01
##   0.6     0.1       1.243886e+01  0.5000112  9.872817e+00
##   0.6     0.2       1.179800e+01  0.5112173  8.207406e+00
##   0.6     0.3       1.194946e+01  0.5285314  8.087585e+00
##   0.6     0.4       1.230161e+01  0.5329802  8.412493e+00
##   0.6     0.5       1.260468e+01  0.5331262  8.658001e+00
##   0.6     0.6       1.278347e+01  0.5363459  8.795060e+00
##   0.6     0.7       1.297368e+01  0.5393004  9.047542e+00
##   0.6     0.8       1.316466e+01  0.5410254  9.369195e+00
##   0.6     0.9       1.334632e+01  0.5422274  9.637833e+00
##   0.6     1.0       1.354741e+01  0.5412955  9.897148e+00
##   0.7     0.0       1.605289e+01        NaN  1.311468e+01
##   0.7     0.1       1.243609e+01  0.4995259  9.869835e+00
##   0.7     0.2       1.183484e+01  0.5097874  8.162437e+00
##   0.7     0.3       1.208311e+01  0.5276243  8.088323e+00
##   0.7     0.4       1.251388e+01  0.5323363  8.476564e+00
##   0.7     0.5       1.289290e+01  0.5326997  8.757181e+00
##   0.7     0.6       1.310979e+01  0.5364246  8.943365e+00
##   0.7     0.7       1.333937e+01  0.5396325  9.277415e+00
##   0.7     0.8       1.357027e+01  0.5415626  9.661536e+00
##   0.7     0.9       1.377747e+01  0.5429873  9.964952e+00
##   0.7     1.0       1.399695e+01  0.5424790  1.024426e+01
##   0.8     0.0       1.605289e+01        NaN  1.311468e+01
##   0.8     0.1       1.242505e+01  0.4990176  9.856433e+00
##   0.8     0.2       1.187427e+01  0.5084516  8.114485e+00
##   0.8     0.3       1.223352e+01  0.5267246  8.099047e+00
##   0.8     0.4       1.274079e+01  0.5317895  8.552355e+00
##   0.8     0.5       1.319213e+01  0.5323682  8.899139e+00
##   0.8     0.6       1.347109e+01  0.5359483  9.143150e+00
##   0.8     0.7       1.374226e+01  0.5391840  9.547237e+00
##   0.8     0.8       1.401282e+01  0.5413733  9.976749e+00
##   0.8     0.9       1.424731e+01  0.5429638  1.033446e+01
##   0.8     1.0       1.448535e+01  0.5429824  1.064620e+01
##   0.9     0.0       1.605289e+01        NaN  1.311468e+01
##   0.9     0.1       1.241084e+01  0.4984631  9.837369e+00
##   0.9     0.2       1.191265e+01  0.5076839  8.066954e+00
##   0.9     0.3       1.239418e+01  0.5257997  8.125098e+00
##   0.9     0.4       1.297862e+01  0.5310715  8.654536e+00
##   0.9     0.5       1.350613e+01  0.5317670  9.105192e+00
##   0.9     0.6       1.386112e+01  0.5350888  9.454870e+00
##   0.9     0.7       1.417284e+01  0.5384760  9.873698e+00
##   0.9     0.8       1.448001e+01  0.5408636  1.032868e+01
##   0.9     0.9       1.474556e+01  0.5425441  1.072039e+01
##   0.9     1.0       1.500394e+01  0.5430317  1.104976e+01
##   1.0     0.0       1.605289e+01        NaN  1.311468e+01
##   1.0     0.1       1.239543e+01  0.4979113  9.813566e+00
##   1.0     0.2       1.195463e+01  0.5072525  8.019172e+00
##   1.0     0.3       1.256500e+01  0.5249396  8.175630e+00
##   1.0     0.4       1.322902e+01  0.5302971  8.782117e+00
##   1.0     0.5       1.384047e+01  0.5309961  9.369314e+00
##   1.0     0.6       1.427143e+01  0.5340739  9.791762e+00
##   1.0     0.7       1.462591e+01  0.5375252  1.025180e+01
##   1.0     0.8       1.496755e+01  0.5401210  1.072109e+01
##   1.0     0.9       1.526609e+01  0.5419639  1.113243e+01
##   1.0     1.0       1.554593e+01  0.5427731  1.146808e+01
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 1 and lambda = 0.9.
plot(elasticnet_model, main = "R-square Elastic Net Model")

set.seed(534)
#Predictions
predictions_elasticnet <- predict(elasticnet_model, test)
#Model performance 
results_elasticnet <- data.frame(Model = "Elastic Net",
                      RMSE = caret::RMSE(predictions_elasticnet, test$permeability),
                      Rsquared = caret::R2(predictions_elasticnet, test$permeability),
                      MAE = caret::MAE(predictions_elasticnet, test$permeability))
results_elasticnet
##                    Model     RMSE Rsquared      MAE
## permeability Elastic Net 13.60161 0.473866 10.31296

Comparison of the models

compmod<-rbind(results, results_pcr, results_ridge, results_lasso, results_elasticnet)
#compmod<-comp[-1]
compmod%>%kable(row.names = F)%>% kable_styling(full_width = FALSE)
Model RMSE Rsquared MAE
PLS 9.931766e+00 0.4813456 7.404271e+00
PCR 9.784160e+00 0.4838728 7.612136e+00
Ridge 1.127315e+01 0.4742014 8.297782e+00
Lasso 3.599179e+07 0.0079207 5.194982e+06
Elastic Net 1.360161e+01 0.4738660 1.031296e+01

From the table, PLS Model is doing better if we see the values of RMSE and MAE compared to other models. Notice that \(R^2\) is lower in PLS but \(R^2\) it can be increased with adding even an insignificant predictor. Therefore, PLS is a good model to choose from here.

(f)

Would you recommend any of your models to replace the permeability laboratory experiment?

The \(R^2\) value seems to be low for a physical process so I would not use this.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, the manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a)

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

library(Amelia)
missmap(ChemicalManufacturingProcess)

The missing values are insignificant as a % and seem to be missing at random. I will be using kNN to impute the missing values.

pre_process <-preProcess(ChemicalManufacturingProcess[, -c(1)], method = "knnImpute")
chemical_imp <- predict(pre_process, ChemicalManufacturingProcess[, -c(1)])
print('No of missing values is')
## [1] "No of missing values is"
sum(is.na(chemical_imp))
## [1] 0

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

It is better to filter out pairs with high correlation first, we use a threshold of 0.90.

correlations <- cor(chemical_imp)
highCorr <- findCorrelation(correlations, cutoff = .9)
chemical_imp <- chemical_imp[, -highCorr]

We will center and scale the data

(pre.process = preProcess(chemical_imp, method = c("BoxCox", "center", "scale")))
## Created from 176 samples and 47 variables
## 
## Pre-processing:
##   - centered (47)
##   - ignored (0)
##   - scaled (47)

Split the data

set.seed(675)
#Splitting data into training and test datasets
splitt <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.7, list=FALSE)
X_train <- chemical_imp[splitt, ]
y_train <- ChemicalManufacturingProcess$Yield[splitt]

X_test <- chemical_imp[-splitt, ]
y_test <- ChemicalManufacturingProcess$Yield[-splitt]

We will use the PLS model here.

model_pls <- train(X_train, y_train, method='pls', metric='RMSE',
                   tuneLength=20, trControl = trainControl(method='cv'))             

model_pls
## Partial Least Squares 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 111, 112, 111, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.287871  0.5025434  1.0625987
##    2     1.199313  0.5513777  0.9759255
##    3     1.188637  0.5731375  0.9588816
##    4     1.186537  0.5800785  0.9644993
##    5     1.200505  0.5753360  0.9697254
##    6     1.222620  0.5638443  0.9804865
##    7     1.247016  0.5588769  1.0110444
##    8     1.290894  0.5437929  1.0572265
##    9     1.333730  0.5307537  1.0891686
##   10     1.378266  0.5198481  1.1076451
##   11     1.412787  0.5103675  1.1317620
##   12     1.450614  0.4995396  1.1558610
##   13     1.475706  0.5014033  1.1639757
##   14     1.505213  0.4957886  1.1723211
##   15     1.531410  0.4934102  1.1786340
##   16     1.554586  0.4913442  1.1898845
##   17     1.587931  0.4859518  1.2041464
##   18     1.635204  0.4805996  1.2266924
##   19     1.703957  0.4761138  1.2556180
##   20     1.814303  0.4688571  1.3096496
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
plot(model_pls)

The optimal value of ncomp chosen is 4 and that gives the lowest RMSE at 1.186537 and \(R^2\) of 0.58.

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

pls_predict <- predict(model_pls, X_test)
postResample(pls_predict, y_test)
##      RMSE  Rsquared       MAE 
## 1.2094706 0.5892557 0.9925435

The RMSE has gone up slightly to 1.209. \(R^2\) remains almost the same.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

plot(varImp(model_pls), top =10)

The Manufacturing Processes dominate the list for top 20 in terms of variable importance, with a few biological materials showing up in the list. The top 2 are ManufacturingProcess32 and ManufacturingProcess13.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

model_pls$finalModel$coefficients
## , , 1 comps
## 
##                            .outcome
## BiologicalMaterial01    0.100044110
## BiologicalMaterial03    0.117074112
## BiologicalMaterial05    0.040799770
## BiologicalMaterial06    0.128834635
## BiologicalMaterial07    0.000000000
## BiologicalMaterial08    0.114337069
## BiologicalMaterial09    0.030432658
## BiologicalMaterial10    0.061667893
## BiologicalMaterial11    0.107852468
## ManufacturingProcess01 -0.036455446
## ManufacturingProcess02 -0.053011666
## ManufacturingProcess03 -0.023009466
## ManufacturingProcess04 -0.087867874
## ManufacturingProcess05  0.028525895
## ManufacturingProcess06  0.101885833
## ManufacturingProcess07  0.002408217
## ManufacturingProcess08  0.018957463
## ManufacturingProcess09  0.121702177
## ManufacturingProcess10  0.071796874
## ManufacturingProcess11  0.080990064
## ManufacturingProcess12  0.103662554
## ManufacturingProcess13 -0.138025156
## ManufacturingProcess14 -0.028408217
## ManufacturingProcess15  0.028122045
## ManufacturingProcess16 -0.015900173
## ManufacturingProcess17 -0.096513371
## ManufacturingProcess19  0.037533864
## ManufacturingProcess20 -0.021440596
## ManufacturingProcess21  0.025267641
## ManufacturingProcess22  0.001851531
## ManufacturingProcess23 -0.037334823
## ManufacturingProcess24 -0.061518814
## ManufacturingProcess26  0.011409920
## ManufacturingProcess28  0.076658387
## ManufacturingProcess30  0.050915236
## ManufacturingProcess32  0.152264786
## ManufacturingProcess33  0.097762636
## ManufacturingProcess34  0.065201061
## ManufacturingProcess35 -0.027222280
## ManufacturingProcess36 -0.119421577
## ManufacturingProcess37 -0.025731448
## ManufacturingProcess38 -0.016182959
## ManufacturingProcess39  0.002845507
## ManufacturingProcess41 -0.014353698
## ManufacturingProcess43  0.049582863
## ManufacturingProcess44  0.006956481
## ManufacturingProcess45  0.006259375
## 
## , , 2 comps
## 
##                            .outcome
## BiologicalMaterial01    0.047351099
## BiologicalMaterial03    0.124763334
## BiologicalMaterial05    0.019550329
## BiologicalMaterial06    0.117951499
## BiologicalMaterial07    0.000000000
## BiologicalMaterial08    0.073754046
## BiologicalMaterial09   -0.003844060
## BiologicalMaterial10   -0.007022295
## BiologicalMaterial11    0.074751085
## ManufacturingProcess01  0.006098761
## ManufacturingProcess02  0.019246222
## ManufacturingProcess03 -0.039744769
## ManufacturingProcess04 -0.059907697
## ManufacturingProcess05  0.003770578
## ManufacturingProcess06  0.137628637
## ManufacturingProcess07  0.012898871
## ManufacturingProcess08  0.064643104
## ManufacturingProcess09  0.219879504
## ManufacturingProcess10  0.098863980
## ManufacturingProcess11  0.116494989
## ManufacturingProcess12  0.167363514
## ManufacturingProcess13 -0.264810622
## ManufacturingProcess14 -0.061211261
## ManufacturingProcess15  0.015817166
## ManufacturingProcess16 -0.014537542
## ManufacturingProcess17 -0.231120929
## ManufacturingProcess19  0.011323635
## ManufacturingProcess20  0.012665854
## ManufacturingProcess21 -0.025221306
## ManufacturingProcess22  0.041494556
## ManufacturingProcess23 -0.019856992
## ManufacturingProcess24 -0.066237188
## ManufacturingProcess26 -0.006392833
## ManufacturingProcess28  0.020537751
## ManufacturingProcess30  0.066578397
## ManufacturingProcess32  0.219260350
## ManufacturingProcess33  0.100903841
## ManufacturingProcess34  0.161186184
## ManufacturingProcess35 -0.022006379
## ManufacturingProcess36 -0.152897693
## ManufacturingProcess37 -0.091013809
## ManufacturingProcess38 -0.013093473
## ManufacturingProcess39  0.035821571
## ManufacturingProcess41 -0.038435744
## ManufacturingProcess43  0.057609380
## ManufacturingProcess44  0.036834531
## ManufacturingProcess45  0.046146908
## 
## , , 3 comps
## 
##                            .outcome
## BiologicalMaterial01    0.016572795
## BiologicalMaterial03    0.150083196
## BiologicalMaterial05    0.043454468
## BiologicalMaterial06    0.140454724
## BiologicalMaterial07    0.000000000
## BiologicalMaterial08    0.061453931
## BiologicalMaterial09   -0.057811504
## BiologicalMaterial10   -0.048016749
## BiologicalMaterial11    0.079335601
## ManufacturingProcess01  0.043354791
## ManufacturingProcess02  0.048686569
## ManufacturingProcess03 -0.015350798
## ManufacturingProcess04  0.024607847
## ManufacturingProcess05  0.005939617
## ManufacturingProcess06  0.101086717
## ManufacturingProcess07 -0.015277141
## ManufacturingProcess08  0.101060679
## ManufacturingProcess09  0.247814706
## ManufacturingProcess10  0.043881434
## ManufacturingProcess11  0.059989215
## ManufacturingProcess12  0.133617942
## ManufacturingProcess13 -0.300641487
## ManufacturingProcess14  0.009633348
## ManufacturingProcess15  0.101229928
## ManufacturingProcess16 -0.017147063
## ManufacturingProcess17 -0.289032251
## ManufacturingProcess19  0.110816434
## ManufacturingProcess20  0.084072234
## ManufacturingProcess21 -0.071356748
## ManufacturingProcess22  0.098172284
## ManufacturingProcess23 -0.008934295
## ManufacturingProcess24 -0.067210642
## ManufacturingProcess26 -0.008322115
## ManufacturingProcess28 -0.058122914
## ManufacturingProcess30  0.023832260
## ManufacturingProcess32  0.363607345
## ManufacturingProcess33  0.160931887
## ManufacturingProcess34  0.254342163
## ManufacturingProcess35 -0.024596007
## ManufacturingProcess36 -0.229737554
## ManufacturingProcess37 -0.180972255
## ManufacturingProcess38 -0.026622142
## ManufacturingProcess39  0.101741214
## ManufacturingProcess41 -0.054764475
## ManufacturingProcess43  0.051824888
## ManufacturingProcess44  0.081864819
## ManufacturingProcess45  0.129008785
## 
## , , 4 comps
## 
##                            .outcome
## BiologicalMaterial01    0.004936713
## BiologicalMaterial03    0.143246348
## BiologicalMaterial05    0.062738627
## BiologicalMaterial06    0.152371830
## BiologicalMaterial07    0.000000000
## BiologicalMaterial08    0.063700161
## BiologicalMaterial09   -0.128720687
## BiologicalMaterial10   -0.076435291
## BiologicalMaterial11    0.056360030
## ManufacturingProcess01  0.085843807
## ManufacturingProcess02  0.042783747
## ManufacturingProcess03  0.021176812
## ManufacturingProcess04  0.131433333
## ManufacturingProcess05  0.048267317
## ManufacturingProcess06  0.059392645
## ManufacturingProcess07 -0.079238893
## ManufacturingProcess08  0.091052719
## ManufacturingProcess09  0.309875891
## ManufacturingProcess10  0.086401272
## ManufacturingProcess11  0.052338856
## ManufacturingProcess12  0.060362044
## ManufacturingProcess13 -0.352444434
## ManufacturingProcess14 -0.007647458
## ManufacturingProcess15  0.126463685
## ManufacturingProcess16 -0.050849753
## ManufacturingProcess17 -0.299693048
## ManufacturingProcess19  0.212194069
## ManufacturingProcess20  0.099737541
## ManufacturingProcess21 -0.020877521
## ManufacturingProcess22  0.081721654
## ManufacturingProcess23 -0.050317947
## ManufacturingProcess24 -0.045885155
## ManufacturingProcess26  0.022289091
## ManufacturingProcess28 -0.131814818
## ManufacturingProcess30  0.021207184
## ManufacturingProcess32  0.444560183
## ManufacturingProcess33  0.160418879
## ManufacturingProcess34  0.325246102
## ManufacturingProcess35 -0.030904398
## ManufacturingProcess36 -0.245350049
## ManufacturingProcess37 -0.209494711
## ManufacturingProcess38 -0.091421304
## ManufacturingProcess39  0.137449887
## ManufacturingProcess41 -0.066613707
## ManufacturingProcess43  0.064631344
## ManufacturingProcess44  0.064103457
## ManufacturingProcess45  0.157706627

According to the coefficients above, ManufacturingProcess32 has the highest positive relationship followed by ManufacturingProcess36 (negative impact).

ManufacturingProcess32 improved the yield tremendously, and it has the highest, positive correlation than the other variables in the model. The ManufacturingProcess32 coefficient in the regression equation is 0.445 which shows the units the yield increases for every additional unit of ManufacturingProcess32.

cor(ChemicalManufacturingProcess$Yield,
    ChemicalManufacturingProcess$ManufacturingProcess32)
## [1] 0.6083321

From the negative coefficients, ManufacturingProcess13 affected the yield tremendously, and it has a negative correlation. The ManufacturingProcess13 coefficient in the regression equation is -0.35 which shows the yield decrease for every additional unit of ManufacturingProcess13.

cor(ChemicalManufacturingProcess$Yield,
    ChemicalManufacturingProcess$ManufacturingProcess13)
## [1] -0.5036797