Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is the understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.

A. Loading the data

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

head(ChemicalManufacturingProcess)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

B. Missing Values Imputation

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Missing values will be filled by employing nearest neighbor, using default parameters of 10 number of neighbors, and weight average for method.

library(caret)
library(DMwR)

df.chem <- knnImputation(ChemicalManufacturingProcess)

Checking no missing values exists…..

any(is.na(df.chem))
## [1] FALSE

C. Splitting data & Tuning model

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

We will create a training set as 75% of data, remaining 25% will fill up the test set. Three penalized models will be fitted to the training set. Based on their performance metrics shown below, it appears that Lasso is best among the three.

library(dplyr)

set.seed(123)

X <- as.data.frame(df.chem %>% select(-Yield))
Y <- df.chem %>% select(Yield)

train_index <- sample(1:nrow(df.chem), nrow(df.chem)*.75)

x.train <- df.chem[train_index,]
y.train <- Y$Yield[train_index]

x.test <- X[-train_index,]
y.test <- Y$Yield[-train_index]

lambda <- 10^seq(-3, 3, length = 100)

RIDGE

ridge <- train(
  Yield ~., data = x.train, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 0, lambda = lambda)
  )

Ridge Best Tune

ridge$bestTune
##    alpha    lambda
## 45     0 0.4641589

Ridge Training Set Metrics

ridge.predictions.train <- ridge %>% predict(x.train)
ridge.eval.train = data.frame(obs = y.train, pred=ridge.predictions.train)
defaultSummary(ridge.eval.train)
##      RMSE  Rsquared       MAE 
## 0.9731235 0.7111462 0.7714148

LASSO

lasso <- train(
  Yield ~., data = x.train, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 1, lambda = lambda)
  )

Lasso Best Tune

lasso$bestTune
##    alpha     lambda
## 25     1 0.02848036

Lasso Training Set Metrics

lasso.predictions.train <- lasso %>% predict(x.train)
lasso.eval.train = data.frame(obs = y.train, pred=lasso.predictions.train)
defaultSummary(lasso.eval.train)
##      RMSE  Rsquared       MAE 
## 0.9152127 0.7406965 0.7096063

ELASTIC NET

elastic <- train(
  Yield ~., data = x.train, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
  )

Elastic Net Best Tune

elastic$bestTune
##    alpha     lambda
## 46   0.5 0.03226858

Elastic Net Training Set Metrics

enet.predictions.train <- elastic %>% predict(x.train)
enet.eval.train = data.frame(obs = y.train, pred=enet.predictions.train)
defaultSummary(enet.eval.train)
##      RMSE  Rsquared       MAE 
## 0.8936889 0.7516997 0.6920042

D. Make Predictions

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Although Lasso performed best during training, it appears that Elastic Net outperforms it based on lower error measurement and higher correlation measure.

RIDGE

# Make predictions
ridge.predictions <- ridge %>% predict(x.test)

# Model prediction performance
data.frame(
  RMSE = RMSE(ridge.predictions, y.test),
  Rsquare = R2(ridge.predictions, y.test),
  MAE = mean(abs(y.test - ridge.predictions)),
  MSE = mean((y.test - ridge.predictions ) ^ 2)

)
##       RMSE   Rsquare      MAE      MSE
## 1 1.758692 0.3738959 1.048689 3.092998

LASSO

# Make predictions
lasso.predictions <- lasso %>% predict(x.test)

# Model prediction performance
data.frame(
  RMSE = RMSE(lasso.predictions, y.test),
  Rsquare = R2(lasso.predictions, y.test),
  MAE = mean(abs(y.test - lasso.predictions)),
  MSE = mean((y.test - lasso.predictions ) ^ 2)
)
##       RMSE   Rsquare     MAE      MSE
## 1 1.663724 0.4304365 0.98838 2.767978

ELASTIC NET

# Make predictions
enet.predictions <- elastic %>% predict(x.test)

# Model prediction performance
data.frame(
  RMSE = RMSE(enet.predictions, y.test),
  Rsquare = R2(enet.predictions, y.test),
  MAE = mean(abs(y.test - enet.predictions)),
  MSE = mean((y.test - enet.predictions ) ^ 2)
)
##       RMSE   Rsquare      MAE      MSE
## 1 2.027448 0.3284095 1.052241 4.110546

E. Variable Importance

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Based on number of variables with non-zero coefficients in the final elastic regression model, the Manufacturing Process predictors dominates the list.

#varImp(elastic)
set.seed(123)

elastic.coeff <- as.matrix(coef(elastic$finalModel, elastic$bestTune$lambda))

#nrow(elastic.coeff)
 
elastic.coeff.nonzero <- elastic.coeff[elastic.coeff != 0,1]

#length(elastic.coeff.nonzero)

sort(elastic.coeff.nonzero)
##   BiologicalMaterial07 ManufacturingProcess37 ManufacturingProcess13 
##          -7.790635e-01          -6.428391e-01          -3.935570e-01 
## ManufacturingProcess08 ManufacturingProcess38 ManufacturingProcess33 
##          -2.984492e-01          -2.857054e-01          -1.601561e-01 
##   BiologicalMaterial10 ManufacturingProcess07 ManufacturingProcess28 
##          -1.202950e-01          -1.116944e-01          -6.386898e-02 
## ManufacturingProcess22 ManufacturingProcess24   BiologicalMaterial02 
##          -3.819019e-02          -1.320256e-02          -8.973563e-03 
## ManufacturingProcess23 ManufacturingProcess05 ManufacturingProcess27 
##          -2.282432e-03          -1.304770e-03          -6.214595e-04 
##   BiologicalMaterial11 ManufacturingProcess10   BiologicalMaterial04 
##          -1.143227e-04          -1.549958e-08          -5.754739e-09 
##   BiologicalMaterial05 ManufacturingProcess18 ManufacturingProcess12 
##           1.357049e-09           1.669206e-06           8.534724e-05 
## ManufacturingProcess26 ManufacturingProcess14 ManufacturingProcess19 
##           5.534893e-04           7.545018e-04           9.045948e-04 
## ManufacturingProcess15 ManufacturingProcess41 ManufacturingProcess01 
##           2.950200e-03           8.960178e-03           1.900204e-02 
## ManufacturingProcess06 ManufacturingProcess04   BiologicalMaterial03 
##           2.107960e-02           5.487058e-02           6.597561e-02 
## ManufacturingProcess39 ManufacturingProcess43 ManufacturingProcess30 
##           8.281875e-02           1.283035e-01           1.807123e-01 
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess29 
##           2.193904e-01           2.594002e-01           4.786747e-01 
## ManufacturingProcess45 ManufacturingProcess34            (Intercept) 
##           6.907874e-01           2.822075e+00           6.413222e+01

F. Yield Improvement

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Based on the coefficients’ magnitude impact to Yield, manufacturing processes such as #37, #13, and #08 can be decreased/lessen, while manufacturing processes such as #34, #45, and #29 can be increased/improved.