A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
## Rows: 176
## Columns: 58
## $ Yield <dbl> 38.00, 42.44, 42.03, 41.42, 42.49, 43.57, 43.12…
## $ BiologicalMaterial01 <dbl> 6.25, 8.01, 8.01, 8.01, 7.47, 6.12, 7.48, 6.94,…
## $ BiologicalMaterial02 <dbl> 49.58, 60.97, 60.97, 60.97, 63.33, 58.36, 64.47…
## $ BiologicalMaterial03 <dbl> 56.97, 67.48, 67.48, 67.48, 72.25, 65.31, 72.41…
## $ BiologicalMaterial04 <dbl> 12.74, 14.65, 14.65, 14.65, 14.02, 15.17, 13.82…
## $ BiologicalMaterial05 <dbl> 19.51, 19.36, 19.36, 19.36, 17.91, 21.79, 17.71…
## $ BiologicalMaterial06 <dbl> 43.73, 53.14, 53.14, 53.14, 54.66, 51.23, 54.45…
## $ BiologicalMaterial07 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ BiologicalMaterial08 <dbl> 16.66, 19.04, 19.04, 19.04, 18.22, 18.30, 18.72…
## $ BiologicalMaterial09 <dbl> 11.44, 12.55, 12.55, 12.55, 12.80, 12.13, 12.95…
## $ BiologicalMaterial10 <dbl> 3.46, 3.46, 3.46, 3.46, 3.05, 3.78, 3.04, 3.85,…
## $ BiologicalMaterial11 <dbl> 138.09, 153.67, 153.67, 153.67, 147.61, 151.88,…
## $ BiologicalMaterial12 <dbl> 18.83, 21.05, 21.05, 21.05, 21.05, 20.76, 20.75…
## $ ManufacturingProcess01 <dbl> NA, 0.0, 0.0, 0.0, 10.7, 12.0, 11.5, 12.0, 12.0…
## $ ManufacturingProcess02 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ManufacturingProcess03 <dbl> NA, NA, NA, NA, NA, NA, 1.56, 1.55, 1.56, 1.55,…
## $ ManufacturingProcess04 <dbl> NA, 917, 912, 911, 918, 924, 933, 929, 928, 938…
## $ ManufacturingProcess05 <dbl> NA, 1032.2, 1003.6, 1014.6, 1027.5, 1016.8, 988…
## $ ManufacturingProcess06 <dbl> NA, 210.0, 207.1, 213.3, 205.7, 208.9, 210.0, 2…
## $ ManufacturingProcess07 <dbl> NA, 177, 178, 177, 178, 178, 177, 178, 177, 177…
## $ ManufacturingProcess08 <dbl> NA, 178, 178, 177, 178, 178, 178, 178, 177, 177…
## $ ManufacturingProcess09 <dbl> 43.00, 46.57, 45.07, 44.92, 44.96, 45.32, 49.36…
## $ ManufacturingProcess10 <dbl> NA, NA, NA, NA, NA, NA, 11.6, 10.2, 9.7, 10.1, …
## $ ManufacturingProcess11 <dbl> NA, NA, NA, NA, NA, NA, 11.5, 11.3, 11.1, 10.2,…
## $ ManufacturingProcess12 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ManufacturingProcess13 <dbl> 35.5, 34.0, 34.8, 34.8, 34.6, 34.0, 32.4, 33.6,…
## $ ManufacturingProcess14 <dbl> 4898, 4869, 4878, 4897, 4992, 4985, 4745, 4854,…
## $ ManufacturingProcess15 <dbl> 6108, 6095, 6087, 6102, 6233, 6222, 5999, 6105,…
## $ ManufacturingProcess16 <dbl> 4682, 4617, 4617, 4635, 4733, 4786, 4486, 4626,…
## $ ManufacturingProcess17 <dbl> 35.5, 34.0, 34.8, 34.8, 33.9, 33.4, 33.8, 33.6,…
## $ ManufacturingProcess18 <dbl> 4865, 4867, 4877, 4872, 4886, 4862, 4758, 4766,…
## $ ManufacturingProcess19 <dbl> 6049, 6097, 6078, 6073, 6102, 6115, 6013, 6022,…
## $ ManufacturingProcess20 <dbl> 4665, 4621, 4621, 4611, 4659, 4696, 4522, 4552,…
## $ ManufacturingProcess21 <dbl> 0.0, 0.0, 0.0, 0.0, -0.7, -0.6, 1.4, 0.0, 0.0, …
## $ ManufacturingProcess22 <dbl> NA, 3, 4, 5, 8, 9, 1, 2, 3, 4, 6, 7, 8, 10, 11,…
## $ ManufacturingProcess23 <dbl> NA, 0, 1, 2, 4, 1, 1, 2, 3, 1, 3, 4, 1, 2, 3, 4…
## $ ManufacturingProcess24 <dbl> NA, 3, 4, 5, 18, 1, 1, 2, 3, 4, 6, 7, 8, 2, 15,…
## $ ManufacturingProcess25 <dbl> 4873, 4869, 4897, 4892, 4930, 4871, 4795, 4806,…
## $ ManufacturingProcess26 <dbl> 6074, 6107, 6116, 6111, 6151, 6128, 6057, 6059,…
## $ ManufacturingProcess27 <dbl> 4685, 4630, 4637, 4630, 4684, 4687, 4572, 4586,…
## $ ManufacturingProcess28 <dbl> 10.7, 11.2, 11.1, 11.1, 11.3, 11.4, 11.2, 11.1,…
## $ ManufacturingProcess29 <dbl> 21.0, 21.4, 21.3, 21.3, 21.6, 21.7, 21.2, 21.2,…
## $ ManufacturingProcess30 <dbl> 9.9, 9.9, 9.4, 9.4, 9.0, 10.1, 11.2, 10.9, 10.5…
## $ ManufacturingProcess31 <dbl> 69.1, 68.7, 69.3, 69.3, 69.4, 68.2, 67.6, 67.9,…
## $ ManufacturingProcess32 <dbl> 156, 169, 173, 171, 171, 173, 159, 161, 160, 16…
## $ ManufacturingProcess33 <dbl> 66, 66, 66, 68, 70, 70, 65, 65, 65, 66, 67, 67,…
## $ ManufacturingProcess34 <dbl> 2.4, 2.6, 2.6, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.…
## $ ManufacturingProcess35 <dbl> 486, 508, 509, 496, 468, 490, 475, 478, 491, 48…
## $ ManufacturingProcess36 <dbl> 0.019, 0.019, 0.018, 0.018, 0.017, 0.018, 0.019…
## $ ManufacturingProcess37 <dbl> 0.5, 2.0, 0.7, 1.2, 0.2, 0.4, 0.8, 1.0, 1.2, 1.…
## $ ManufacturingProcess38 <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 3,…
## $ ManufacturingProcess39 <dbl> 7.2, 7.2, 7.2, 7.2, 7.3, 7.2, 7.3, 7.3, 7.4, 7.…
## $ ManufacturingProcess40 <dbl> NA, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ ManufacturingProcess41 <dbl> NA, 0.15, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ ManufacturingProcess42 <dbl> 11.6, 11.1, 12.0, 10.6, 11.0, 11.5, 11.7, 11.4,…
## $ ManufacturingProcess43 <dbl> 3.0, 0.9, 1.0, 1.1, 1.1, 2.2, 0.7, 0.8, 0.9, 0.…
## $ ManufacturingProcess44 <dbl> 1.8, 1.9, 1.8, 1.8, 1.7, 1.8, 2.0, 2.0, 1.9, 1.…
## $ ManufacturingProcess45 <dbl> 2.4, 2.2, 2.3, 2.1, 2.1, 2.0, 2.2, 2.2, 2.1, 2.…
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.
df <- ChemicalManufacturingProcess
df %>% select(everything()) %>% summarise_all(~sum(is.na(.))) %>% gather(var, val) %>% filter(val > 0) %>% ggplot(aes(x=var, y=val)) + geom_col() + coord_flip()
From section 3.8, using preProcess from the caret package and knnImpute to fillna values.
fillna <- preProcess(df, method = c("knnImpute"))
df <- predict(fillna, df)
df$Yield = ChemicalManufacturingProcess$Yield
Using tidymodels we split the data into training and testing, the training set will be used for model tuning.
library(tidymodels)
df_split <- initial_split(df, strata=Yield)
df_train <- training(df_split)
df_test <- testing(df_split)
When mixture = 1, it is a pure lasso model while mixture = 0 indicates that ridge regression is being used.
Build the model and set up the workflow
ctrl <- trainControl(method="cv", number=5)
lmodel <- train(
x = df_train %>% select(-c('Yield')),
y = df_train$Yield,
method = "pls",
tuneLength = 30,
trControl = ctrl
)
Plot all metrics
lmodel$results %>% gather(var, val, -ncomp) %>% ggplot(aes(x=ncomp, y=val)) + geom_line() + facet_wrap(~var, scales='free')
From the plots it looks like a small amount of components gives the highest Rsquare and lowerst RMSE. Therefore we will go with 3 components.
postResample(predict(lmodel, newdata = df_test, ncomp = mdl.ncomp), df_test$Yield)
## RMSE Rsquared MAE
## 1.2923083 0.4805505 1.0720709
getTrainPerf(lmodel)
## TrainRMSE TrainRsquared TrainMAE method
## 1 1.13266 0.654124 0.924489 pls
As we would expect, the model does better on the training set than the testing set, from 1.24 to 1.34 RMSE.
The top 5 most important predictor is are process followed by Biological. The most important process is 32.
important_predictors <- c('ManufacturingProcess32', 'ManufacturingProcess09', 'ManufacturingProcess13', 'ManufacturingProcess36', 'ManufacturingProcess17', 'ManufacturingProcess06', 'Yield')
df %>% select(important_predictors) %>% gather(var, val, -Yield) %>% ggplot(aes(x=val, y=Yield)) + geom_point() + facet_wrap(~var)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(important_predictors)` instead of `important_predictors` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
From the plots above we can see 6,9,13,17 and 32 have a correlation while 36 doesn’t look like it has a strong correlation.