A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is the understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.
Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Missing values will be filled by employing nearest neighbor, using default parameters of 10 number of neighbors, and weight average for method.
library(caret)
library(DMwR)
df.chem <- knnImputation(ChemicalManufacturingProcess)
Checking no missing values exists…..
any(is.na(df.chem))
## [1] FALSE
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
We will create a training set as 75% of data, remaining 25% will fill up the test set. Three penalized models will be fitted to the training set. Based on their performance metrics shown below, it appears that Lasso is best among the three.
library(dplyr)
set.seed(123)
X <- as.data.frame(df.chem %>% select(-Yield))
Y <- df.chem %>% select(Yield)
train_index <- sample(1:nrow(df.chem), nrow(df.chem)*.75)
x.train <- df.chem[train_index,]
y.train <- Y$Yield[train_index]
x.test <- X[-train_index,]
y.test <- Y$Yield[-train_index]
lambda <- 10^seq(-3, 3, length = 100)
RIDGE
ridge <- train(
Yield ~., data = x.train, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
Ridge Best Tune
ridge$bestTune
## alpha lambda
## 45 0 0.4641589
Ridge Training Set Metrics
ridge.predictions.train <- ridge %>% predict(x.train)
ridge.eval.train = data.frame(obs = y.train, pred=ridge.predictions.train)
defaultSummary(ridge.eval.train)
## RMSE Rsquared MAE
## 0.9731235 0.7111462 0.7714148
LASSO
lasso <- train(
Yield ~., data = x.train, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda)
)
Lasso Best Tune
lasso$bestTune
## alpha lambda
## 25 1 0.02848036
Lasso Training Set Metrics
lasso.predictions.train <- lasso %>% predict(x.train)
lasso.eval.train = data.frame(obs = y.train, pred=lasso.predictions.train)
defaultSummary(lasso.eval.train)
## RMSE Rsquared MAE
## 0.9152127 0.7406965 0.7096063
ELASTIC NET
elastic <- train(
Yield ~., data = x.train, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
Elastic Net Best Tune
elastic$bestTune
## alpha lambda
## 46 0.5 0.03226858
Elastic Net Training Set Metrics
enet.predictions.train <- elastic %>% predict(x.train)
enet.eval.train = data.frame(obs = y.train, pred=enet.predictions.train)
defaultSummary(enet.eval.train)
## RMSE Rsquared MAE
## 0.8936889 0.7516997 0.6920042
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Although Lasso performed best during training, it appears that Elastic Net outperforms it based on lower error measurement and higher correlation measure.
RIDGE
# Make predictions
ridge.predictions <- ridge %>% predict(x.test)
# Model prediction performance
data.frame(
RMSE = RMSE(ridge.predictions, y.test),
Rsquare = R2(ridge.predictions, y.test),
MAE = mean(abs(y.test - ridge.predictions)),
MSE = mean((y.test - ridge.predictions ) ^ 2)
)
## RMSE Rsquare MAE MSE
## 1 1.758692 0.3738959 1.048689 3.092998
LASSO
# Make predictions
lasso.predictions <- lasso %>% predict(x.test)
# Model prediction performance
data.frame(
RMSE = RMSE(lasso.predictions, y.test),
Rsquare = R2(lasso.predictions, y.test),
MAE = mean(abs(y.test - lasso.predictions)),
MSE = mean((y.test - lasso.predictions ) ^ 2)
)
## RMSE Rsquare MAE MSE
## 1 1.663724 0.4304365 0.98838 2.767978
ELASTIC NET
# Make predictions
enet.predictions <- elastic %>% predict(x.test)
# Model prediction performance
data.frame(
RMSE = RMSE(enet.predictions, y.test),
Rsquare = R2(enet.predictions, y.test),
MAE = mean(abs(y.test - enet.predictions)),
MSE = mean((y.test - enet.predictions ) ^ 2)
)
## RMSE Rsquare MAE MSE
## 1 2.027448 0.3284095 1.052241 4.110546
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Based on number of variables with non-zero coefficients in the final elastic regression model, the Manufacturing Process predictors dominates the list.
#varImp(elastic)
set.seed(123)
elastic.coeff <- as.matrix(coef(elastic$finalModel, elastic$bestTune$lambda))
#nrow(elastic.coeff)
elastic.coeff.nonzero <- elastic.coeff[elastic.coeff != 0,1]
#length(elastic.coeff.nonzero)
sort(elastic.coeff.nonzero)
## BiologicalMaterial07 ManufacturingProcess37 ManufacturingProcess13
## -7.790635e-01 -6.428391e-01 -3.935570e-01
## ManufacturingProcess08 ManufacturingProcess38 ManufacturingProcess33
## -2.984492e-01 -2.857054e-01 -1.601561e-01
## BiologicalMaterial10 ManufacturingProcess07 ManufacturingProcess28
## -1.202950e-01 -1.116944e-01 -6.386898e-02
## ManufacturingProcess22 ManufacturingProcess24 BiologicalMaterial02
## -3.819019e-02 -1.320256e-02 -8.973563e-03
## ManufacturingProcess23 ManufacturingProcess05 ManufacturingProcess27
## -2.282432e-03 -1.304770e-03 -6.214595e-04
## BiologicalMaterial11 ManufacturingProcess10 BiologicalMaterial04
## -1.143227e-04 -1.549958e-08 -5.754739e-09
## BiologicalMaterial05 ManufacturingProcess18 ManufacturingProcess12
## 1.357049e-09 1.669206e-06 8.534724e-05
## ManufacturingProcess26 ManufacturingProcess14 ManufacturingProcess19
## 5.534893e-04 7.545018e-04 9.045948e-04
## ManufacturingProcess15 ManufacturingProcess41 ManufacturingProcess01
## 2.950200e-03 8.960178e-03 1.900204e-02
## ManufacturingProcess06 ManufacturingProcess04 BiologicalMaterial03
## 2.107960e-02 5.487058e-02 6.597561e-02
## ManufacturingProcess39 ManufacturingProcess43 ManufacturingProcess30
## 8.281875e-02 1.283035e-01 1.807123e-01
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess29
## 2.193904e-01 2.594002e-01 4.786747e-01
## ManufacturingProcess45 ManufacturingProcess34 (Intercept)
## 6.907874e-01 2.822075e+00 6.413222e+01
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Based on the coefficients’ magnitude impact to Yield, manufacturing processes such as #37, #13, and #08 can be decreased/lessen, while manufacturing processes such as #34, #45, and #29 can be increased/improved.