Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:


(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

We’ll create our training and test splits first and then will perform some preprocessing below:

A quick review of the skim data summary below confirms numerous variables have missing data.

Data summary
Name chem_train
Number of rows 132
Number of columns 58
_______________________
Column type frequency:
numeric 58
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Yield 0 1.00 40.37 1.86 35.25 38.98 40.25 41.71 46.34 ▁▆▇▃▁
BiologicalMaterial01 0 1.00 6.44 0.71 4.58 6.00 6.30 6.88 8.81 ▁▇▇▂▁
BiologicalMaterial02 0 1.00 55.76 4.14 46.87 52.70 55.52 58.74 64.75 ▂▇▆▅▃
BiologicalMaterial03 0 1.00 67.67 4.12 56.97 64.54 67.22 70.43 78.25 ▂▆▇▇▁
BiologicalMaterial04 0 1.00 12.43 1.82 9.38 11.25 12.11 13.22 23.09 ▇▆▁▁▁
BiologicalMaterial05 0 1.00 18.60 1.84 13.24 17.25 18.62 19.45 24.85 ▁▅▇▂▁
BiologicalMaterial06 0 1.00 48.90 3.82 40.60 46.01 48.55 51.44 59.38 ▂▇▆▅▁
BiologicalMaterial07 0 1.00 100.01 0.10 100.00 100.00 100.00 100.00 100.83 ▇▁▁▁▁
BiologicalMaterial08 0 1.00 17.52 0.67 15.88 17.09 17.48 17.88 19.14 ▁▆▇▃▂
BiologicalMaterial09 0 1.00 12.85 0.42 11.44 12.60 12.83 13.13 14.08 ▁▂▇▅▁
BiologicalMaterial10 0 1.00 2.83 0.60 1.87 2.46 2.75 3.05 6.87 ▇▅▁▁▁
BiologicalMaterial11 0 1.00 146.95 4.78 135.81 143.84 146.24 149.68 158.73 ▂▆▇▃▂
BiologicalMaterial12 0 1.00 20.20 0.79 18.35 19.74 20.21 20.66 22.18 ▂▆▇▃▂
ManufacturingProcess01 1 0.99 11.06 1.99 0.00 10.70 11.40 12.00 13.40 ▁▁▁▃▇
ManufacturingProcess02 3 0.98 16.33 8.68 0.00 19.00 20.90 21.40 22.50 ▂▁▁▁▇
ManufacturingProcess03 14 0.89 1.54 0.02 1.48 1.53 1.54 1.55 1.60 ▁▂▇▁▁
ManufacturingProcess04 1 0.99 931.35 6.51 911.00 926.00 934.00 936.00 942.00 ▁▃▆▇▇
ManufacturingProcess05 1 0.99 1002.26 31.37 923.00 987.15 999.70 1008.10 1175.30 ▁▇▁▁▁
ManufacturingProcess06 1 0.99 207.58 2.90 203.00 205.90 207.10 208.70 227.40 ▇▃▁▁▁
ManufacturingProcess07 1 0.99 177.47 0.50 177.00 177.00 177.00 178.00 178.00 ▇▁▁▁▇
ManufacturingProcess08 1 0.99 177.56 0.50 177.00 177.00 178.00 178.00 178.00 ▆▁▁▁▇
ManufacturingProcess09 0 1.00 45.70 1.51 39.02 44.95 45.73 46.57 49.04 ▁▁▃▇▃
ManufacturingProcess10 8 0.94 9.17 0.75 7.50 8.70 9.10 9.50 11.20 ▂▆▇▂▁
ManufacturingProcess11 9 0.93 9.41 0.73 7.50 9.00 9.40 9.90 11.30 ▁▅▇▆▁
ManufacturingProcess12 1 0.99 972.31 1872.00 0.00 0.00 0.00 0.00 4549.00 ▇▁▁▁▂
ManufacturingProcess13 0 1.00 34.53 1.05 32.10 33.90 34.60 35.20 38.60 ▃▇▇▁▁
ManufacturingProcess14 0 1.00 4856.66 55.63 4713.00 4828.75 4855.50 4891.50 5055.00 ▁▆▇▂▁
ManufacturingProcess15 0 1.00 6043.14 61.71 5904.00 6014.00 6034.50 6062.25 6233.00 ▂▇▇▂▁
ManufacturingProcess16 0 1.00 4594.56 66.49 4441.00 4559.00 4594.50 4623.25 4852.00 ▂▇▆▁▁
ManufacturingProcess17 0 1.00 34.33 1.23 31.30 33.50 34.35 35.10 40.00 ▂▇▆▁▁
ManufacturingProcess18 0 1.00 4799.12 423.56 0.00 4812.75 4834.00 4861.25 4971.00 ▁▁▁▁▇
ManufacturingProcess19 0 1.00 6027.97 47.31 5890.00 6000.50 6024.00 6048.25 6146.00 ▁▂▇▂▂
ManufacturingProcess20 0 1.00 4547.70 402.35 0.00 4551.75 4582.00 4613.25 4759.00 ▁▁▁▁▇
ManufacturingProcess21 0 1.00 -0.20 0.72 -1.80 -0.60 -0.20 0.00 3.00 ▂▇▁▁▁
ManufacturingProcess22 1 0.99 5.35 3.12 0.00 3.00 5.00 8.00 12.00 ▆▇▇▃▃
ManufacturingProcess23 1 0.99 3.04 1.63 0.00 2.00 3.00 4.00 6.00 ▇▆▇▇▇
ManufacturingProcess24 1 0.99 8.83 5.71 0.00 4.00 8.00 13.50 23.00 ▇▇▅▆▁
ManufacturingProcess25 2 0.98 4817.28 427.61 0.00 4829.75 4853.00 4873.75 4966.00 ▁▁▁▁▇
ManufacturingProcess26 2 0.98 6004.25 532.65 0.00 6019.50 6047.50 6073.00 6161.00 ▁▁▁▁▇
ManufacturingProcess27 2 0.98 4553.88 405.28 0.00 4558.50 4588.00 4612.75 4696.00 ▁▁▁▁▇
ManufacturingProcess28 2 0.98 6.70 5.24 0.00 0.00 10.40 10.70 11.50 ▅▁▁▁▇
ManufacturingProcess29 2 0.98 20.00 1.88 0.00 19.70 20.00 20.40 22.00 ▁▁▁▁▇
ManufacturingProcess30 2 0.98 9.19 1.06 0.00 8.83 9.20 9.78 10.90 ▁▁▁▂▇
ManufacturingProcess31 2 0.98 69.96 6.34 0.00 69.90 70.70 71.40 72.50 ▁▁▁▁▇
ManufacturingProcess32 0 1.00 158.65 5.73 143.00 155.00 158.00 162.00 173.00 ▁▅▇▃▁
ManufacturingProcess33 2 0.98 63.53 2.66 56.00 62.00 64.00 65.00 70.00 ▁▅▇▆▂
ManufacturingProcess34 2 0.98 2.50 0.06 2.30 2.50 2.50 2.50 2.60 ▁▂▁▇▂
ManufacturingProcess35 2 0.98 495.38 11.42 463.00 490.00 495.00 502.00 522.00 ▁▃▇▆▂
ManufacturingProcess36 2 0.98 0.02 0.00 0.02 0.02 0.02 0.02 0.02 ▂▇▆▁▃
ManufacturingProcess37 0 1.00 0.97 0.43 0.00 0.70 1.00 1.22 2.00 ▂▆▇▃▂
ManufacturingProcess38 0 1.00 2.54 0.58 0.00 2.00 3.00 3.00 3.00 ▁▁▁▆▇
ManufacturingProcess39 0 1.00 7.00 1.08 0.00 7.10 7.20 7.30 7.40 ▁▁▁▁▇
ManufacturingProcess40 1 0.99 0.02 0.04 0.00 0.00 0.00 0.00 0.10 ▇▁▁▁▂
ManufacturingProcess41 1 0.99 0.02 0.06 0.00 0.00 0.00 0.00 0.20 ▇▁▁▁▁
ManufacturingProcess42 0 1.00 11.33 1.44 0.00 11.40 11.60 11.70 12.10 ▁▁▁▁▇
ManufacturingProcess43 0 1.00 0.96 0.98 0.00 0.60 0.80 1.10 11.00 ▇▁▁▁▁
ManufacturingProcess44 0 1.00 1.83 0.24 0.00 1.80 1.90 1.90 2.10 ▁▁▁▁▇
ManufacturingProcess45 0 1.00 2.16 0.32 0.00 2.10 2.20 2.30 2.60 ▁▁▁▂▇

Here we create a tidymodels recipe that will impute missing data, perform box-cox transform, normalize all predictors, remove zero variance predictors and remove variables that have a large absoulte correlation with other variables.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         57
## 
## Training data contained 132 data points and 19 incomplete rows. 
## 
## Operations:
## 
## K-nearest neighbor imputation for BiologicalMaterial02, ... [trained]
## Box-Cox transformation on BiologicalMaterial01, ... [trained]
## Centering and scaling for BiologicalMaterial01, ... [trained]
## Sparse, unbalanced variable filter removed BiologicalMaterial07 [trained]
## Correlation filter removed 9 items [trained]

We fit the model here.

The table below sets forth the fitted regression model with a penalty of 0.1.

term estimate
(Intercept) 40.371277995
BiologicalMaterial05 0.099299825
BiologicalMaterial06 0.037348595
ManufacturingProcess01 0.005829599
ManufacturingProcess04 0.118885915
ManufacturingProcess06 0.076358794
ManufacturingProcess09 0.535828930
ManufacturingProcess13 -0.098408877
ManufacturingProcess17 -0.345025939
ManufacturingProcess19 0.034451656
ManufacturingProcess22 0.012403652
ManufacturingProcess32 0.871837857
ManufacturingProcess34 0.065892780
ManufacturingProcess36 -0.158636957
ManufacturingProcess37 -0.024701225
ManufacturingProcess39 0.051818908

Next we use resampling to tune the model and plot the rmse and RSquared.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

## # A tibble: 1 x 2
##   penalty .config
##     <dbl> <chr>  
## 1   0.391 Model48
## == Workflow ==============================================================================================================================================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor ------------------------------------------------------------------------------------------------------------------------------------------
## 5 Recipe Steps
## 
## * step_knnimpute()
## * step_BoxCox()
## * step_normalize()
## * step_nzv()
## * step_corr()
## 
## -- Model -------------------------------------------------------------------------------------------------------------------------------------------------
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.390693993705462
##   mixture = 1
## 
## Computational engine: glmnet

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The plot below set forth the absolute value of predictors. Its a plot that is dominated by Manufacturing Process predictors.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Having this covariance information, such like efficient portfolio therory, should enable process engeineers to more easily optimize the yeild of the chemical portfolio/processes.