Data624

Exercises 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

library(dplyr)
library(varImp)
library(elasticnet)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)
str(permeability)

##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:varImp':
## 
##     varImp

## The following objects are masked from 'package:measures':
## 
##     MAE, RMSE

dim(fingerprints)

## [1]  165 1107

We have 1107 predictors.

fp <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fp)

## [1] 165 388

After using near zero function we are left out with 388 predictors for modeling.719 columns are removed.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

Build the model

set.seed(1975)
trainingRows <- createDataPartition(permeability, p = .80, list= FALSE)
x_train <- fp[trainingRows, ]
y_train <- permeability[trainingRows]
x_test <- fp[-trainingRows, ]
y_test <- permeability[-trainingRows]

Pls_Fit <- train(x=x_train,
                y=y_train, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=trainControl(method='cv'),
                preProcess=c('center', 'scale')
                )
Pls_Result <- Pls_Fit$results
Pls_Fit

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 119, 120, 118, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.82437  0.2838024   9.717487
##    2     11.74004  0.4121155   8.380243
##    3     11.75878  0.4183513   8.819616
##    4     11.72542  0.4267349   8.822729
##    5     11.41756  0.4501351   8.535854
##    6     11.36243  0.4558707   8.447026
##    7     11.42753  0.4565745   8.602487
##    8     11.38306  0.4594429   8.520737
##    9     11.43938  0.4652513   8.557379
##   10     11.59106  0.4600797   8.684356
##   11     11.78819  0.4460928   8.779366
##   12     11.79820  0.4460045   8.931621
##   13     12.05356  0.4299189   9.075983
##   14     12.33810  0.4165925   9.265023
##   15     12.66426  0.3920415   9.555767
##   16     13.04760  0.3737891   9.696449
##   17     13.35336  0.3700379   9.785863
##   18     13.56518  0.3693013   9.896085
##   19     13.85977  0.3560775  10.031240
##   20     13.97978  0.3618402  10.046974
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.

The optional ncomp value we got is 9 fwith R2 value as 0.4652513.

plot(Pls_Fit, col ="blue")

Predict the response for the test set. What is the test set estimate of R2?

plsPred <- predict(Pls_Fit, newdata=x_test)
postResample(pred=plsPred, obs=y_test)

##       RMSE   Rsquared        MAE 
## 10.6879771  0.6455296  7.9698362

R2 for test set prediction is 0.6455296

Try building other models discussed in this chapter. Do any have better predictive performance?*

Build Ridge model, lambda (from 0 to 1 by 0.1)

set.seed(1978)
ridgeFit <- train(x=x_train,
                  y=y_train,
                  method='ridge',
                  metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )

ridgeFit

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 120, 120, 118, 119, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     13.30262  0.3699179   9.717486
##   0.1     12.51231  0.3946092   9.073459
##   0.2     12.51692  0.4144288   9.160555
##   0.3     12.70677  0.4281878   9.399541
##   0.4     12.99653  0.4383417   9.732002
##   0.5     13.35574  0.4462718  10.070762
##   0.6     13.77281  0.4525722  10.449211
##   0.7     14.23676  0.4577417  10.843842
##   0.8     14.74042  0.4620608  11.282368
##   0.9     15.28080  0.4656637  11.735541
##   1.0     15.84399  0.4689070  12.184461
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.

plot(ridgeFit, col = "blue")

Build lasso model, fraction (from 0 to 0.5 by 0.05)

set.seed(1979)
lassoFit <- train(x=x_train,
                  y=y_train,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
lassoFit

## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 119, 121, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.30333        NaN  12.101925
##   0.05      12.72322  0.4513424   9.198406
##   0.10      12.45164  0.4791129   8.533772
##   0.15      12.44457  0.4678709   8.615643
##   0.20      12.40386  0.4583045   8.515930
##   0.25      12.38774  0.4502046   8.539393
##   0.30      12.32033  0.4561066   8.509060
##   0.35      12.31641  0.4478454   8.580255
##   0.40      12.49048  0.4265954   8.774272
##   0.45      12.76620  0.4017907   8.951147
##   0.50      12.94915  0.3869022   9.029649
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.1.

plot(lassoFit, col = "blue")

Build ElasticNet model, fraction and lambda (2-D grid with each D from 0 to 1 by 0.1)

set.seed(1980)
enetFit <- train(x=x_train,
                 y=y_train,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
enetFit

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 120, 121, 121, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       14.92814        NaN  11.809472
##   0.0     0.1       12.06302  0.3868374   8.709104
##   0.0     0.2       12.51837  0.3755235   9.059791
##   0.0     0.3       12.58383  0.3840898   9.136017
##   0.0     0.4       12.46541  0.3955282   8.993267
##   0.0     0.5       12.47292  0.4053663   8.981291
##   0.0     0.6       12.77650  0.4028777   9.114889
##   0.0     0.7       13.23335  0.3963114   9.410939
##   0.0     0.8       13.79899  0.3825910   9.795211
##   0.0     0.9       14.24226  0.3746923  10.105872
##   0.0     1.0       14.66576  0.3672121  10.353615
##   0.1     0.0       14.92814        NaN  11.809472
##   0.1     0.1       11.82105  0.4070263   8.352050
##   0.1     0.2       11.77844  0.4072313   8.614019
##   0.1     0.3       11.63624  0.4244399   8.539275
##   0.1     0.4       11.69945  0.4265356   8.455723
##   0.1     0.5       11.80332  0.4293193   8.333748
##   0.1     0.6       11.93407  0.4289248   8.382518
##   0.1     0.7       12.02487  0.4308984   8.544490
##   0.1     0.8       12.12739  0.4322087   8.699894
##   0.1     0.9       12.23057  0.4332321   8.809751
##   0.1     1.0       12.35601  0.4309401   8.906825
##   0.2     0.0       14.92814        NaN  11.809472
##   0.2     0.1       11.81124  0.4111448   8.399745
##   0.2     0.2       11.85087  0.4120686   8.602259
##   0.2     0.3       11.70084  0.4310323   8.624337
##   0.2     0.4       11.71323  0.4360912   8.547983
##   0.2     0.5       11.78824  0.4395082   8.438153
##   0.2     0.6       11.86567  0.4447895   8.473867
##   0.2     0.7       11.99120  0.4452297   8.663189
##   0.2     0.8       12.09507  0.4467855   8.800416
##   0.2     0.9       12.18409  0.4491048   8.922343
##   0.2     1.0       12.27557  0.4498422   9.033042
##   0.3     0.0       14.92814        NaN  11.809472
##   0.3     0.1       11.81084  0.4115808   8.409769
##   0.3     0.2       11.91123  0.4172757   8.543041
##   0.3     0.3       11.82585  0.4344429   8.715412
##   0.3     0.4       11.81445  0.4435193   8.664466
##   0.3     0.5       11.90026  0.4464677   8.626396
##   0.3     0.6       11.98878  0.4516578   8.664102
##   0.3     0.7       12.11464  0.4540042   8.854863
##   0.3     0.8       12.26144  0.4540805   9.060447
##   0.3     0.9       12.36255  0.4564871   9.200626
##   0.3     1.0       12.45419  0.4587318   9.309379
##   0.4     0.0       14.92814        NaN  11.809472
##   0.4     0.1       11.81583  0.4107636   8.409205
##   0.4     0.2       11.98713  0.4209486   8.528233
##   0.4     0.3       11.99155  0.4360406   8.803195
##   0.4     0.4       12.00741  0.4467751   8.825949
##   0.4     0.5       12.10092  0.4506162   8.832692
##   0.4     0.6       12.21504  0.4553178   8.907371
##   0.4     0.7       12.35393  0.4583580   9.093223
##   0.4     0.8       12.52861  0.4585143   9.327192
##   0.4     0.9       12.64886  0.4612142   9.503352
##   0.4     1.0       12.75461  0.4640106   9.642399
##   0.5     0.0       14.92814        NaN  11.809472
##   0.5     0.1       11.82933  0.4090958   8.390258
##   0.5     0.2       12.09390  0.4214220   8.519929
##   0.5     0.3       12.19263  0.4370645   8.923711
##   0.5     0.4       12.24304  0.4488127   8.993902
##   0.5     0.5       12.35957  0.4536972   9.070980
##   0.5     0.6       12.50942  0.4576811   9.182375
##   0.5     0.7       12.67240  0.4607841   9.365318
##   0.5     0.8       12.86390  0.4617598   9.652265
##   0.5     0.9       13.00840  0.4644079   9.862298
##   0.5     1.0       13.13305  0.4676140  10.035603
##   0.6     0.0       14.92814        NaN  11.809472
##   0.6     0.1       11.85094  0.4071436   8.371064
##   0.6     0.2       12.20654  0.4219231   8.534682
##   0.6     0.3       12.41786  0.4378925   9.054915
##   0.6     0.4       12.52176  0.4501736   9.201354
##   0.6     0.5       12.66871  0.4559463   9.357358
##   0.6     0.6       12.85748  0.4594138   9.502046
##   0.6     0.7       13.05411  0.4623182   9.725584
##   0.6     0.8       13.25974  0.4641673  10.025985
##   0.6     0.9       13.42804  0.4668672  10.276843
##   0.6     1.0       13.57348  0.4702076  10.482250
##   0.7     0.0       14.92814        NaN  11.809472
##   0.7     0.1       11.88425  0.4048350   8.367710
##   0.7     0.2       12.33872  0.4221091   8.574851
##   0.7     0.3       12.66816  0.4385333   9.204526
##   0.7     0.4       12.83923  0.4511248   9.432889
##   0.7     0.5       13.02277  0.4572653   9.650004
##   0.7     0.6       13.24790  0.4607659   9.833800
##   0.7     0.7       13.48167  0.4634726  10.101674
##   0.7     0.8       13.70447  0.4659228  10.430506
##   0.7     0.9       13.89593  0.4687864  10.715262
##   0.7     1.0       14.06326  0.4721350  10.935175
##   0.8     0.0       14.92814        NaN  11.809472
##   0.8     0.1       11.92548  0.4024975   8.375759
##   0.8     0.2       12.48844  0.4223267   8.621213
##   0.8     0.3       12.94072  0.4390992   9.362509
##   0.8     0.4       13.19033  0.4515667   9.683356
##   0.8     0.5       13.41449  0.4581204   9.964285
##   0.8     0.6       13.67308  0.4619623  10.181490
##   0.8     0.7       13.94345  0.4644661  10.495549
##   0.8     0.8       14.18686  0.4673771  10.844149
##   0.8     0.9       14.40497  0.4702601  11.149337
##   0.8     1.0       14.59394  0.4735999  11.387036
##   0.9     0.0       14.92814        NaN  11.809472
##   0.9     0.1       11.97323  0.4002734   8.382751
##   0.9     0.2       12.65161  0.4221899   8.678556
##   0.9     0.3       13.23662  0.4393919   9.525255
##   0.9     0.4       13.56997  0.4515703   9.952608
##   0.9     0.5       13.83662  0.4587367  10.278642
##   0.9     0.6       14.13356  0.4627831  10.554721
##   0.9     0.7       14.43922  0.4651515  10.910095
##   0.9     0.8       14.70715  0.4682803  11.259464
##   0.9     0.9       14.94868  0.4713706  11.579854
##   0.9     1.0       15.15859  0.4747033  11.833890
##   1.0     0.0       14.92814        NaN  11.809472
##   1.0     0.1       12.02696  0.3982054   8.387693
##   1.0     0.2       12.82758  0.4219231   8.747192
##   1.0     0.3       13.54789  0.4397650   9.704109
##   1.0     0.4       13.97145  0.4513976  10.225134
##   1.0     0.5       14.28742  0.4590379  10.602224
##   1.0     0.6       14.62330  0.4633605  10.946498
##   1.0     0.7       14.96271  0.4656276  11.333849
##   1.0     0.8       15.25936  0.4688519  11.676644
##   1.0     0.9       15.52007  0.4722818  12.007571
##   1.0     1.0       15.75182  0.4755504  12.276304
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 1 and lambda = 1.

plot(enetFit)

Compare models:

multiResample <- function(models, newdata, obs){
  res = list()
  methods = c()
  i = 1
  for (model in models){
    pred <- predict(model, newdata=newdata)
    metrics <- postResample(pred=pred, obs=obs)
    res[[i]] <- metrics
    methods[[i]] <- model$method
    i <- 1 + i
  }
  names(res) <- methods
  return(res)
}
models <- list(ridgeFit, lassoFit, enetFit)
(resampleResult <- multiResample(models, x_test, y_test))

## $ridge
##       RMSE   Rsquared        MAE 
## 15.6805328  0.6088296 12.3338009 
## 
## $lasso
##      RMSE  Rsquared       MAE 
## 10.510243  0.684831  7.236435 
## 
## $enet
##       RMSE   Rsquared        MAE 
## 15.6805328  0.6088296 12.3338009

According to the evaluation we can say that the best model is lasso with R2:0.684831

Would you recommend any of your models to replace the permeability laboratory experiment?

Plot a histogram to see what target variable permeability:

hist(permeability, col="lightblue")

The above graph of target variable permeability indicates that most of the results are below 10 and many are below 5. I would not recommend any other models to replace permeability laboratory experiment.

Exercises 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
str(ChemicalManufacturingProcess)

## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

We will use the missmap function available in Amelia package to find out the missing values:

library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(ChemicalManufacturingProcess, col = c("purple", "lightblue"))

Use bagImpute method to impute missing values:

cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
cmpImpute

## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - bagged tree imputation (57)
##   - ignored (0)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])

set.seed(1977)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
x_train <- cmp[trainRow, ]
y_train <- ChemicalManufacturingProcess$Yield[trainRow]
x_test <- cmp[-trainRow, ]
y_test <- ChemicalManufacturingProcess$Yield[-trainRow]

Build Elastic Net model. Lambda (from 0 to 1 by 0.1). RMSE is used as metric

set.seed(1981)
enetFit <- train(x=x_train,
                 y=y_train,
                 method='enet',
                 metric='RMSE',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
enetFit

## Elasticnet 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 129, 129, 129, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       1.864708        NaN  1.5124430
##   0.0     0.1       1.239645  0.6135148  1.0033249
##   0.0     0.2       1.849843  0.5972989  1.1663513
##   0.0     0.3       3.057038  0.4832820  1.5325351
##   0.0     0.4       3.508008  0.4603980  1.6644773
##   0.0     0.5       3.704933  0.4488960  1.7383326
##   0.0     0.6       3.793850  0.4390087  1.7894433
##   0.0     0.7       3.737190  0.4318544  1.8004578
##   0.0     0.8       4.606455  0.4194939  2.0526604
##   0.0     0.9       5.364310  0.4097940  2.2733560
##   0.0     1.0       6.025003  0.4075752  2.4531389
##   0.1     0.0       1.864708        NaN  1.5124430
##   0.1     0.1       1.502226  0.5920297  1.2346175
##   0.1     0.2       1.268324  0.5999394  1.0521655
##   0.1     0.3       1.212464  0.6099627  0.9970262
##   0.1     0.4       1.243106  0.6122161  0.9914471
##   0.1     0.5       1.346872  0.5995365  1.0207451
##   0.1     0.6       1.405440  0.5970689  1.0455283
##   0.1     0.7       1.560328  0.5692091  1.0974347
##   0.1     0.8       1.757807  0.5502589  1.1592417
##   0.1     0.9       2.008009  0.5284495  1.2322544
##   0.1     1.0       2.185558  0.5169575  1.2832988
##   0.2     0.0       1.864708        NaN  1.5124430
##   0.2     0.1       1.535985  0.5877786  1.2596079
##   0.2     0.2       1.302940  0.5986808  1.0758782
##   0.2     0.3       1.215772  0.6016854  1.0141343
##   0.2     0.4       1.215240  0.6117582  0.9881412
##   0.2     0.5       1.332598  0.5972761  1.0203729
##   0.2     0.6       1.485125  0.5863286  1.0618717
##   0.2     0.7       1.565605  0.5859770  1.0916767
##   0.2     0.8       1.664744  0.5632208  1.1357395
##   0.2     0.9       1.740270  0.5534231  1.1604963
##   0.2     1.0       1.891957  0.5478582  1.2030571
##   0.3     0.0       1.864708        NaN  1.5124430
##   0.3     0.1       1.545179  0.5857402  1.2661119
##   0.3     0.2       1.312776  0.5982781  1.0838782
##   0.3     0.3       1.217187  0.5995322  1.0185969
##   0.3     0.4       1.201215  0.6125325  0.9843902
##   0.3     0.5       1.291184  0.6001112  1.0112346
##   0.3     0.6       1.484359  0.5816377  1.0708700
##   0.3     0.7       1.576937  0.5815186  1.0953902
##   0.3     0.8       1.617727  0.5773968  1.1245459
##   0.3     0.9       1.693710  0.5604714  1.1595983
##   0.3     1.0       1.773722  0.5583247  1.1842533
##   0.4     0.0       1.864708        NaN  1.5124430
##   0.4     0.1       1.546977  0.5851944  1.2669847
##   0.4     0.2       1.314960  0.5978081  1.0860271
##   0.4     0.3       1.217042  0.5983458  1.0195366
##   0.4     0.4       1.201600  0.6099615  0.9832249
##   0.4     0.5       1.271859  0.6019177  1.0087425
##   0.4     0.6       1.476389  0.5807753  1.0760080
##   0.4     0.7       1.574377  0.5779499  1.1089203
##   0.4     0.8       1.593516  0.5814813  1.1245041
##   0.4     0.9       1.672430  0.5632543  1.1656748
##   0.4     1.0       1.722069  0.5626953  1.1854640
##   0.5     0.0       1.864708        NaN  1.5124430
##   0.5     0.1       1.545827  0.5852066  1.2659469
##   0.5     0.2       1.314156  0.5972491  1.0859746
##   0.5     0.3       1.215507  0.5982284  1.0182419
##   0.5     0.4       1.207684  0.6062406  0.9835650
##   0.5     0.5       1.264389  0.6039307  1.0105478
##   0.5     0.6       1.471034  0.5807430  1.0838435
##   0.5     0.7       1.569032  0.5769134  1.1211745
##   0.5     0.8       1.590807  0.5824616  1.1354444
##   0.5     0.9       1.672189  0.5654378  1.1772818
##   0.5     1.0       1.707850  0.5640827  1.1952201
##   0.6     0.0       1.864708        NaN  1.5124430
##   0.6     0.1       1.543498  0.5854435  1.2640469
##   0.6     0.2       1.311875  0.5968162  1.0846070
##   0.6     0.3       1.214763  0.5975884  1.0168327
##   0.6     0.4       1.216154  0.6026549  0.9856500
##   0.6     0.5       1.266337  0.6052448  1.0164073
##   0.6     0.6       1.468271  0.5809156  1.0925265
##   0.6     0.7       1.571173  0.5768353  1.1356662
##   0.6     0.8       1.598625  0.5826159  1.1516533
##   0.6     0.9       1.676701  0.5688602  1.1939424
##   0.6     1.0       1.717129  0.5642442  1.2123446
##   0.7     0.0       1.864708        NaN  1.5124430
##   0.7     0.1       1.540392  0.5854592  1.2615660
##   0.7     0.2       1.308661  0.5964326  1.0823773
##   0.7     0.3       1.214875  0.5963475  1.0154703
##   0.7     0.4       1.225498  0.5997091  0.9888900
##   0.7     0.5       1.275268  0.6054795  1.0225005
##   0.7     0.6       1.469162  0.5812881  1.1036176
##   0.7     0.7       1.579302  0.5772512  1.1538967
##   0.7     0.8       1.614867  0.5819056  1.1753395
##   0.7     0.9       1.684216  0.5714734  1.2163170
##   0.7     1.0       1.741943  0.5641199  1.2408560
##   0.8     0.0       1.864708        NaN  1.5124430
##   0.8     0.1       1.537101  0.5851944  1.2589236
##   0.8     0.2       1.304940  0.5960604  1.0797883
##   0.8     0.3       1.215243  0.5950762  1.0139589
##   0.8     0.4       1.234480  0.5977314  0.9909266
##   0.8     0.5       1.289135  0.6047042  1.0298212
##   0.8     0.6       1.475169  0.5818443  1.1174357
##   0.8     0.7       1.591459  0.5781491  1.1734493
##   0.8     0.8       1.638014  0.5811556  1.2011245
##   0.8     0.9       1.702300  0.5731856  1.2420994
##   0.8     1.0       1.777503  0.5641104  1.2748777
##   0.9     0.0       1.864708        NaN  1.5124430
##   0.9     0.1       1.533588  0.5848772  1.2561131
##   0.9     0.2       1.301250  0.5954486  1.0778693
##   0.9     0.3       1.216052  0.5936714  1.0125888
##   0.9     0.4       1.244299  0.5959415  0.9937499
##   0.9     0.5       1.306452  0.6032804  1.0404248
##   0.9     0.6       1.487025  0.5821498  1.1345525
##   0.9     0.7       1.610996  0.5787033  1.1956919
##   0.9     0.8       1.666228  0.5805678  1.2279459
##   0.9     0.9       1.729621  0.5742054  1.2722547
##   0.9     1.0       1.820801  0.5643297  1.3123049
##   1.0     0.0       1.864708        NaN  1.5124430
##   1.0     0.1       1.529873  0.5845239  1.2531391
##   1.0     0.2       1.297481  0.5948839  1.0759408
##   1.0     0.3       1.217124  0.5923127  1.0114886
##   1.0     0.4       1.254227  0.5944094  0.9967872
##   1.0     0.5       1.325435  0.6016949  1.0543388
##   1.0     0.6       1.503499  0.5821198  1.1523663
##   1.0     0.7       1.633108  0.5790515  1.2187505
##   1.0     0.8       1.699105  0.5799548  1.2575073
##   1.0     0.9       1.774791  0.5727015  1.3102632
##   1.0     1.0       1.869861  0.5647588  1.3554041
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.3.

plot(enetFit)

The best parameter combo is fraction = 0.4, lambda = 0.3, with the RMSE = 1.217042.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?*

enet_Pred <- predict(enetFit, newdata=x_test)
(predResult <- postResample(pred=enet_Pred, obs=y_test))

##      RMSE  Rsquared       MAE 
## 1.0235202 0.6221641 0.7542969

The test set RMSE is 1.0235202, which is less tham RMSE for training set. Test set should be better.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?*

coeffs <- predict.enet(enetFit$finalModel, s=enetFit$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients

Display the predictors

coeffs

##   BiologicalMaterial01   BiologicalMaterial02   BiologicalMaterial03 
##             0.00000000             0.14523788             0.02967554 
##   BiologicalMaterial04   BiologicalMaterial05   BiologicalMaterial06 
##             0.00000000             0.00000000             0.06689327 
##   BiologicalMaterial07   BiologicalMaterial08   BiologicalMaterial09 
##             0.00000000             0.01045532             0.00000000 
##   BiologicalMaterial10   BiologicalMaterial11   BiologicalMaterial12 
##             0.00000000             0.02390760             0.00000000 
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
##             0.00000000             0.00000000             0.14352750 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09 
##             0.00000000             0.00000000             0.39165718 
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 
##             0.00000000             0.05717589             0.00000000 
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15 
##            -0.29560835             0.00000000             0.07651259 
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18 
##             0.00000000            -0.24378641             0.00000000 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33 
##             0.00000000             0.57871912             0.00000000 
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 
##             0.12555693             0.00000000            -0.28454206 
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39 
##            -0.03954640             0.00000000             0.01710914 
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45 
##             0.00000000             0.03429687             0.06347790

Bassed on above results we can observe some of the predictors are zero.

Lets find out the important predictors

coeffs.sorted <- abs(coeffs)
coeffs.sorted <- coeffs.sorted[coeffs.sorted>0]
(coeffs.sorted <- sort(coeffs.sorted, decreasing = T))

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess13 
##             0.57871912             0.39165718             0.29560835 
## ManufacturingProcess36 ManufacturingProcess17   BiologicalMaterial02 
##             0.28454206             0.24378641             0.14523788 
## ManufacturingProcess06 ManufacturingProcess34 ManufacturingProcess15 
##             0.14352750             0.12555693             0.07651259 
##   BiologicalMaterial06 ManufacturingProcess45 ManufacturingProcess11 
##             0.06689327             0.06347790             0.05717589 
## ManufacturingProcess37 ManufacturingProcess44   BiologicalMaterial03 
##             0.03954640             0.03429687             0.02967554 
##   BiologicalMaterial11 ManufacturingProcess39   BiologicalMaterial08 
##             0.02390760             0.01710914             0.01045532

(temp <- varImp(enetFit))

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   96.86
## ManufacturingProcess17   92.26
## BiologicalMaterial06     85.32
## ManufacturingProcess09   84.61
## BiologicalMaterial12     78.03
## ManufacturingProcess36   76.70
## ManufacturingProcess06   74.87
## BiologicalMaterial03     74.63
## ManufacturingProcess31   68.87
## BiologicalMaterial02     68.73
## BiologicalMaterial11     59.06
## ManufacturingProcess29   54.91
## ManufacturingProcess11   54.00
## BiologicalMaterial08     47.54
## BiologicalMaterial04     45.77
## BiologicalMaterial01     44.54
## ManufacturingProcess33   44.36
## BiologicalMaterial09     42.68
## ManufacturingProcess30   41.52

The data shows the top 20 values out of 57. Of the 20 values, we have 11 values for the Manufacturing process and 9 values for Biological Material. The three most important values are all Manufacturing process

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The coefficients directly explain how the predictors affect the target. Positive coefficients improve performance, while negative coefficients reduce it.

Positive coefficients for Manufacturing Process:

coeffs_mp <- coeffs.sorted[grep('ManufacturingProcess', names(coeffs.sorted))] %>% names() %>% coeffs[.]
coeffs_mp[coeffs_mp>0]

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess06 
##             0.57871912             0.39165718             0.14352750 
## ManufacturingProcess34 ManufacturingProcess15 ManufacturingProcess45 
##             0.12555693             0.07651259             0.06347790 
## ManufacturingProcess11 ManufacturingProcess44 ManufacturingProcess39 
##             0.05717589             0.03429687             0.01710914

Negative coefficients for Manufacturing Process:

coeffs_mp[coeffs_mp<0]

## ManufacturingProcess13 ManufacturingProcess36 ManufacturingProcess17 
##             -0.2956083             -0.2845421             -0.2437864 
## ManufacturingProcess37 
##             -0.0395464

Data624_HW7

Gabriel Santos

2023-04-01

Exercises 6.2

Exercises 6.3