A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
A quick review of the skim data summary below confirms numerous variables have missing data.
Data summary
| Name |
chem_train |
| Number of rows |
132 |
| Number of columns |
58 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
58 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| Yield |
0 |
1.00 |
40.37 |
1.86 |
35.25 |
38.98 |
40.25 |
41.71 |
46.34 |
▁▆▇▃▁ |
| BiologicalMaterial01 |
0 |
1.00 |
6.44 |
0.71 |
4.58 |
6.00 |
6.30 |
6.88 |
8.81 |
▁▇▇▂▁ |
| BiologicalMaterial02 |
0 |
1.00 |
55.76 |
4.14 |
46.87 |
52.70 |
55.52 |
58.74 |
64.75 |
▂▇▆▅▃ |
| BiologicalMaterial03 |
0 |
1.00 |
67.67 |
4.12 |
56.97 |
64.54 |
67.22 |
70.43 |
78.25 |
▂▆▇▇▁ |
| BiologicalMaterial04 |
0 |
1.00 |
12.43 |
1.82 |
9.38 |
11.25 |
12.11 |
13.22 |
23.09 |
▇▆▁▁▁ |
| BiologicalMaterial05 |
0 |
1.00 |
18.60 |
1.84 |
13.24 |
17.25 |
18.62 |
19.45 |
24.85 |
▁▅▇▂▁ |
| BiologicalMaterial06 |
0 |
1.00 |
48.90 |
3.82 |
40.60 |
46.01 |
48.55 |
51.44 |
59.38 |
▂▇▆▅▁ |
| BiologicalMaterial07 |
0 |
1.00 |
100.01 |
0.10 |
100.00 |
100.00 |
100.00 |
100.00 |
100.83 |
▇▁▁▁▁ |
| BiologicalMaterial08 |
0 |
1.00 |
17.52 |
0.67 |
15.88 |
17.09 |
17.48 |
17.88 |
19.14 |
▁▆▇▃▂ |
| BiologicalMaterial09 |
0 |
1.00 |
12.85 |
0.42 |
11.44 |
12.60 |
12.83 |
13.13 |
14.08 |
▁▂▇▅▁ |
| BiologicalMaterial10 |
0 |
1.00 |
2.83 |
0.60 |
1.87 |
2.46 |
2.75 |
3.05 |
6.87 |
▇▅▁▁▁ |
| BiologicalMaterial11 |
0 |
1.00 |
146.95 |
4.78 |
135.81 |
143.84 |
146.24 |
149.68 |
158.73 |
▂▆▇▃▂ |
| BiologicalMaterial12 |
0 |
1.00 |
20.20 |
0.79 |
18.35 |
19.74 |
20.21 |
20.66 |
22.18 |
▂▆▇▃▂ |
| ManufacturingProcess01 |
1 |
0.99 |
11.06 |
1.99 |
0.00 |
10.70 |
11.40 |
12.00 |
13.40 |
▁▁▁▃▇ |
| ManufacturingProcess02 |
3 |
0.98 |
16.33 |
8.68 |
0.00 |
19.00 |
20.90 |
21.40 |
22.50 |
▂▁▁▁▇ |
| ManufacturingProcess03 |
14 |
0.89 |
1.54 |
0.02 |
1.48 |
1.53 |
1.54 |
1.55 |
1.60 |
▁▂▇▁▁ |
| ManufacturingProcess04 |
1 |
0.99 |
931.35 |
6.51 |
911.00 |
926.00 |
934.00 |
936.00 |
942.00 |
▁▃▆▇▇ |
| ManufacturingProcess05 |
1 |
0.99 |
1002.26 |
31.37 |
923.00 |
987.15 |
999.70 |
1008.10 |
1175.30 |
▁▇▁▁▁ |
| ManufacturingProcess06 |
1 |
0.99 |
207.58 |
2.90 |
203.00 |
205.90 |
207.10 |
208.70 |
227.40 |
▇▃▁▁▁ |
| ManufacturingProcess07 |
1 |
0.99 |
177.47 |
0.50 |
177.00 |
177.00 |
177.00 |
178.00 |
178.00 |
▇▁▁▁▇ |
| ManufacturingProcess08 |
1 |
0.99 |
177.56 |
0.50 |
177.00 |
177.00 |
178.00 |
178.00 |
178.00 |
▆▁▁▁▇ |
| ManufacturingProcess09 |
0 |
1.00 |
45.70 |
1.51 |
39.02 |
44.95 |
45.73 |
46.57 |
49.04 |
▁▁▃▇▃ |
| ManufacturingProcess10 |
8 |
0.94 |
9.17 |
0.75 |
7.50 |
8.70 |
9.10 |
9.50 |
11.20 |
▂▆▇▂▁ |
| ManufacturingProcess11 |
9 |
0.93 |
9.41 |
0.73 |
7.50 |
9.00 |
9.40 |
9.90 |
11.30 |
▁▅▇▆▁ |
| ManufacturingProcess12 |
1 |
0.99 |
972.31 |
1872.00 |
0.00 |
0.00 |
0.00 |
0.00 |
4549.00 |
▇▁▁▁▂ |
| ManufacturingProcess13 |
0 |
1.00 |
34.53 |
1.05 |
32.10 |
33.90 |
34.60 |
35.20 |
38.60 |
▃▇▇▁▁ |
| ManufacturingProcess14 |
0 |
1.00 |
4856.66 |
55.63 |
4713.00 |
4828.75 |
4855.50 |
4891.50 |
5055.00 |
▁▆▇▂▁ |
| ManufacturingProcess15 |
0 |
1.00 |
6043.14 |
61.71 |
5904.00 |
6014.00 |
6034.50 |
6062.25 |
6233.00 |
▂▇▇▂▁ |
| ManufacturingProcess16 |
0 |
1.00 |
4594.56 |
66.49 |
4441.00 |
4559.00 |
4594.50 |
4623.25 |
4852.00 |
▂▇▆▁▁ |
| ManufacturingProcess17 |
0 |
1.00 |
34.33 |
1.23 |
31.30 |
33.50 |
34.35 |
35.10 |
40.00 |
▂▇▆▁▁ |
| ManufacturingProcess18 |
0 |
1.00 |
4799.12 |
423.56 |
0.00 |
4812.75 |
4834.00 |
4861.25 |
4971.00 |
▁▁▁▁▇ |
| ManufacturingProcess19 |
0 |
1.00 |
6027.97 |
47.31 |
5890.00 |
6000.50 |
6024.00 |
6048.25 |
6146.00 |
▁▂▇▂▂ |
| ManufacturingProcess20 |
0 |
1.00 |
4547.70 |
402.35 |
0.00 |
4551.75 |
4582.00 |
4613.25 |
4759.00 |
▁▁▁▁▇ |
| ManufacturingProcess21 |
0 |
1.00 |
-0.20 |
0.72 |
-1.80 |
-0.60 |
-0.20 |
0.00 |
3.00 |
▂▇▁▁▁ |
| ManufacturingProcess22 |
1 |
0.99 |
5.35 |
3.12 |
0.00 |
3.00 |
5.00 |
8.00 |
12.00 |
▆▇▇▃▃ |
| ManufacturingProcess23 |
1 |
0.99 |
3.04 |
1.63 |
0.00 |
2.00 |
3.00 |
4.00 |
6.00 |
▇▆▇▇▇ |
| ManufacturingProcess24 |
1 |
0.99 |
8.83 |
5.71 |
0.00 |
4.00 |
8.00 |
13.50 |
23.00 |
▇▇▅▆▁ |
| ManufacturingProcess25 |
2 |
0.98 |
4817.28 |
427.61 |
0.00 |
4829.75 |
4853.00 |
4873.75 |
4966.00 |
▁▁▁▁▇ |
| ManufacturingProcess26 |
2 |
0.98 |
6004.25 |
532.65 |
0.00 |
6019.50 |
6047.50 |
6073.00 |
6161.00 |
▁▁▁▁▇ |
| ManufacturingProcess27 |
2 |
0.98 |
4553.88 |
405.28 |
0.00 |
4558.50 |
4588.00 |
4612.75 |
4696.00 |
▁▁▁▁▇ |
| ManufacturingProcess28 |
2 |
0.98 |
6.70 |
5.24 |
0.00 |
0.00 |
10.40 |
10.70 |
11.50 |
▅▁▁▁▇ |
| ManufacturingProcess29 |
2 |
0.98 |
20.00 |
1.88 |
0.00 |
19.70 |
20.00 |
20.40 |
22.00 |
▁▁▁▁▇ |
| ManufacturingProcess30 |
2 |
0.98 |
9.19 |
1.06 |
0.00 |
8.83 |
9.20 |
9.78 |
10.90 |
▁▁▁▂▇ |
| ManufacturingProcess31 |
2 |
0.98 |
69.96 |
6.34 |
0.00 |
69.90 |
70.70 |
71.40 |
72.50 |
▁▁▁▁▇ |
| ManufacturingProcess32 |
0 |
1.00 |
158.65 |
5.73 |
143.00 |
155.00 |
158.00 |
162.00 |
173.00 |
▁▅▇▃▁ |
| ManufacturingProcess33 |
2 |
0.98 |
63.53 |
2.66 |
56.00 |
62.00 |
64.00 |
65.00 |
70.00 |
▁▅▇▆▂ |
| ManufacturingProcess34 |
2 |
0.98 |
2.50 |
0.06 |
2.30 |
2.50 |
2.50 |
2.50 |
2.60 |
▁▂▁▇▂ |
| ManufacturingProcess35 |
2 |
0.98 |
495.38 |
11.42 |
463.00 |
490.00 |
495.00 |
502.00 |
522.00 |
▁▃▇▆▂ |
| ManufacturingProcess36 |
2 |
0.98 |
0.02 |
0.00 |
0.02 |
0.02 |
0.02 |
0.02 |
0.02 |
▂▇▆▁▃ |
| ManufacturingProcess37 |
0 |
1.00 |
0.97 |
0.43 |
0.00 |
0.70 |
1.00 |
1.22 |
2.00 |
▂▆▇▃▂ |
| ManufacturingProcess38 |
0 |
1.00 |
2.54 |
0.58 |
0.00 |
2.00 |
3.00 |
3.00 |
3.00 |
▁▁▁▆▇ |
| ManufacturingProcess39 |
0 |
1.00 |
7.00 |
1.08 |
0.00 |
7.10 |
7.20 |
7.30 |
7.40 |
▁▁▁▁▇ |
| ManufacturingProcess40 |
1 |
0.99 |
0.02 |
0.04 |
0.00 |
0.00 |
0.00 |
0.00 |
0.10 |
▇▁▁▁▂ |
| ManufacturingProcess41 |
1 |
0.99 |
0.02 |
0.06 |
0.00 |
0.00 |
0.00 |
0.00 |
0.20 |
▇▁▁▁▁ |
| ManufacturingProcess42 |
0 |
1.00 |
11.33 |
1.44 |
0.00 |
11.40 |
11.60 |
11.70 |
12.10 |
▁▁▁▁▇ |
| ManufacturingProcess43 |
0 |
1.00 |
0.96 |
0.98 |
0.00 |
0.60 |
0.80 |
1.10 |
11.00 |
▇▁▁▁▁ |
| ManufacturingProcess44 |
0 |
1.00 |
1.83 |
0.24 |
0.00 |
1.80 |
1.90 |
1.90 |
2.10 |
▁▁▁▁▇ |
| ManufacturingProcess45 |
0 |
1.00 |
2.16 |
0.32 |
0.00 |
2.10 |
2.20 |
2.30 |
2.60 |
▁▁▁▂▇ |
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 57
##
## Training data contained 132 data points and 19 incomplete rows.
##
## Operations:
##
## K-nearest neighbor imputation for BiologicalMaterial02, ... [trained]
## Box-Cox transformation on BiologicalMaterial01, ... [trained]
## Centering and scaling for BiologicalMaterial01, ... [trained]
## Sparse, unbalanced variable filter removed BiologicalMaterial07 [trained]
## Correlation filter removed 9 items [trained]
We fit the model here.
The table below sets forth the fitted regression model with a penalty of 0.1.
| term |
estimate |
| (Intercept) |
40.371277995 |
| BiologicalMaterial05 |
0.099299825 |
| BiologicalMaterial06 |
0.037348595 |
| ManufacturingProcess01 |
0.005829599 |
| ManufacturingProcess04 |
0.118885915 |
| ManufacturingProcess06 |
0.076358794 |
| ManufacturingProcess09 |
0.535828930 |
| ManufacturingProcess13 |
-0.098408877 |
| ManufacturingProcess17 |
-0.345025939 |
| ManufacturingProcess19 |
0.034451656 |
| ManufacturingProcess22 |
0.012403652 |
| ManufacturingProcess32 |
0.871837857 |
| ManufacturingProcess34 |
0.065892780 |
| ManufacturingProcess36 |
-0.158636957 |
| ManufacturingProcess37 |
-0.024701225 |
| ManufacturingProcess39 |
0.051818908 |
Next we use resampling to tune the model and plot the rmse and RSquared.

## # A tibble: 1 x 2
## penalty .config
## <dbl> <chr>
## 1 0.391 Model48
## == Workflow ==============================================================================================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ------------------------------------------------------------------------------------------------------------------------------------------
## 5 Recipe Steps
##
## * step_knnimpute()
## * step_BoxCox()
## * step_normalize()
## * step_nzv()
## * step_corr()
##
## -- Model -------------------------------------------------------------------------------------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = 0.390693993705462
## mixture = 1
##
## Computational engine: glmnet
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
The plot below set forth the absolute value of predictors. Its a plot that is dominated by Manufacturing Process predictors.

library(RANN)
library(corrplot)
knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)
df <- df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
df %>%
select(c('ManufacturingProcess32','ManufacturingProcess30','ManufacturingProcess33','ManufacturingProcess25','BiologicalMaterial03',
'BiologicalMaterial10', 'BiologicalMaterial06', 'BiologicalMaterial01', 'ManufacturingProcess28', 'ManufacturingProcess14', 'Yield')) %>%
cor() %>%
corrplot(method = 'square')
