6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)
describe(ChemicalManufacturingProcess)
##                        vars   n    mean      sd  median trimmed   mad     min
## Yield                     1 176   40.18    1.85   39.97   40.12  1.97   35.25
## BiologicalMaterial01      2 176    6.41    0.71    6.30    6.39  0.67    4.58
## BiologicalMaterial02      3 176   55.69    4.03   55.09   55.58  4.58   46.87
## BiologicalMaterial03      4 176   67.70    4.00   67.22   67.68  4.28   56.97
## BiologicalMaterial04      5 176   12.35    1.77   12.10   12.19  1.37    9.38
## BiologicalMaterial05      6 176   18.60    1.84   18.49   18.55  1.88   13.24
## BiologicalMaterial06      7 176   48.91    3.75   48.46   48.74  3.94   40.60
## BiologicalMaterial07      8 176  100.01    0.11  100.00  100.00  0.00  100.00
## BiologicalMaterial08      9 176   17.49    0.68   17.51   17.47  0.59   15.88
## BiologicalMaterial09     10 176   12.85    0.42   12.84   12.86  0.42   11.44
## BiologicalMaterial10     11 176    2.80    0.60    2.71    2.73  0.40    1.77
## BiologicalMaterial11     12 176  146.95    4.82  146.08  146.79  4.11  135.81
## BiologicalMaterial12     13 176   20.20    0.77   20.12   20.18  0.67   18.35
## ManufacturingProcess01   14 175   11.21    1.82   11.40   11.41  1.04    0.00
## ManufacturingProcess02   15 173   16.68    8.47   21.00   18.06  1.48    0.00
## ManufacturingProcess03   16 161    1.54    0.02    1.54    1.54  0.01    1.47
## ManufacturingProcess04   17 175  931.85    6.27  934.00  932.28  5.93  911.00
## ManufacturingProcess05   18 175 1001.69   30.53  999.20  998.62 17.35  923.00
## ManufacturingProcess06   19 174  207.40    2.70  206.80  207.09  1.93  203.00
## ManufacturingProcess07   20 175  177.48    0.50  177.00  177.48  0.00  177.00
## ManufacturingProcess08   21 175  177.55    0.50  178.00  177.57  0.00  177.00
## ManufacturingProcess09   22 176   45.66    1.55   45.73   45.72  1.22   38.89
## ManufacturingProcess10   23 167    9.18    0.77    9.10    9.13  0.59    7.50
## ManufacturingProcess11   24 166    9.39    0.72    9.40    9.39  0.67    7.50
## ManufacturingProcess12   25 175  857.81 1784.53    0.00  516.20  0.00    0.00
## ManufacturingProcess13   26 176   34.51    1.02   34.60   34.51  0.89   32.10
## ManufacturingProcess14   27 175 4853.87   54.52 4856.00 4854.57 40.03 4701.00
## ManufacturingProcess15   28 176 6038.92   58.31 6031.50 6035.52 40.77 5904.00
## ManufacturingProcess16   29 176 4565.80  351.70 4588.00 4588.36 43.00    0.00
## ManufacturingProcess17   30 176   34.34    1.25   34.40   34.31  1.19   31.30
## ManufacturingProcess18   31 176 4809.68  367.48 4835.00 4837.07 34.84    0.00
## ManufacturingProcess19   32 176 6028.20   45.58 6022.00 6026.15 36.32 5890.00
## ManufacturingProcess20   33 176 4556.46  349.01 4582.00 4580.98 43.00    0.00
## ManufacturingProcess21   34 176   -0.16    0.78   -0.30   -0.26  0.44   -1.80
## ManufacturingProcess22   35 175    5.41    3.33    5.00    5.25  4.45    0.00
## ManufacturingProcess23   36 175    3.02    1.66    3.00    2.94  1.48    0.00
## ManufacturingProcess24   37 175    8.83    5.80    8.00    8.57  7.41    0.00
## ManufacturingProcess25   38 171 4828.18  373.48 4855.00 4855.56 34.10    0.00
## ManufacturingProcess26   39 171 6015.60  464.87 6047.00 6048.55 38.55    0.00
## ManufacturingProcess27   40 171 4562.51  353.98 4587.00 4587.45 35.58    0.00
## ManufacturingProcess28   41 171    6.59    5.25   10.40    6.82  1.04    0.00
## ManufacturingProcess29   42 171   20.01    1.66   19.90   20.04  0.44    0.00
## ManufacturingProcess30   43 171    9.16    0.98    9.10    9.21  0.74    0.00
## ManufacturingProcess31   44 171   70.18    5.56   70.80   70.72  0.89    0.00
## ManufacturingProcess32   45 176  158.47    5.40  158.00  158.34  4.45  143.00
## ManufacturingProcess33   46 171   63.54    2.48   64.00   63.55  1.48   56.00
## ManufacturingProcess34   47 171    2.49    0.05    2.50    2.49  0.00    2.30
## ManufacturingProcess35   48 171  495.60   10.82  495.00  495.74  8.90  463.00
## ManufacturingProcess36   49 171    0.02    0.00    0.02    0.02  0.00    0.02
## ManufacturingProcess37   50 176    1.01    0.45    1.00    1.00  0.44    0.00
## ManufacturingProcess38   51 176    2.53    0.65    3.00    2.61  0.00    0.00
## ManufacturingProcess39   52 176    6.85    1.51    7.20    7.17  0.15    0.00
## ManufacturingProcess40   53 175    0.02    0.04    0.00    0.01  0.00    0.00
## ManufacturingProcess41   54 175    0.02    0.05    0.00    0.01  0.00    0.00
## ManufacturingProcess42   55 176   11.21    1.94   11.60   11.54  0.30    0.00
## ManufacturingProcess43   56 176    0.91    0.87    0.80    0.81  0.30    0.00
## ManufacturingProcess44   57 176    1.81    0.32    1.90    1.85  0.15    0.00
## ManufacturingProcess45   58 176    2.14    0.41    2.20    2.20  0.15    0.00
##                            max   range   skew kurtosis     se
## Yield                    46.34   11.09   0.31    -0.11   0.14
## BiologicalMaterial01      8.81    4.23   0.27     0.46   0.05
## BiologicalMaterial02     64.75   17.88   0.24    -0.71   0.30
## BiologicalMaterial03     78.25   21.28   0.03    -0.12   0.30
## BiologicalMaterial04     23.09   13.71   1.73     7.06   0.13
## BiologicalMaterial05     24.85   11.61   0.30     0.22   0.14
## BiologicalMaterial06     59.38   18.78   0.37    -0.37   0.28
## BiologicalMaterial07    100.83    0.83   7.40    53.04   0.01
## BiologicalMaterial08     19.14    3.26   0.22     0.06   0.05
## BiologicalMaterial09     14.08    2.64  -0.27     0.29   0.03
## BiologicalMaterial10      6.87    5.10   2.40    11.65   0.05
## BiologicalMaterial11    158.73   22.92   0.36     0.02   0.36
## BiologicalMaterial12     22.21    3.86   0.30     0.01   0.06
## ManufacturingProcess01   14.10   14.10  -3.92    21.87   0.14
## ManufacturingProcess02   22.50   22.50  -1.43     0.11   0.64
## ManufacturingProcess03    1.60    0.13  -0.48     1.73   0.00
## ManufacturingProcess04  946.00   35.00  -0.70     0.06   0.47
## ManufacturingProcess05 1175.30  252.30   2.59    11.74   2.31
## ManufacturingProcess06  227.40   24.40   3.04    17.38   0.20
## ManufacturingProcess07  178.00    1.00   0.08    -2.01   0.04
## ManufacturingProcess08  178.00    1.00  -0.22    -1.96   0.04
## ManufacturingProcess09   49.36   10.47  -0.94     3.27   0.12
## ManufacturingProcess10   11.60    4.10   0.65     0.63   0.06
## ManufacturingProcess11   11.50    4.00  -0.02     0.32   0.06
## ManufacturingProcess12 4549.00 4549.00   1.58     0.50 134.90
## ManufacturingProcess13   38.60    6.50   0.48     1.96   0.08
## ManufacturingProcess14 5055.00  354.00  -0.01     1.08   4.12
## ManufacturingProcess15 6233.00  329.00   0.67     1.22   4.40
## ManufacturingProcess16 4852.00 4852.00 -12.42   158.40  26.51
## ManufacturingProcess17   40.00    8.70   1.16     4.66   0.09
## ManufacturingProcess18 4971.00 4971.00 -12.74   163.74  27.70
## ManufacturingProcess19 6146.00  256.00   0.30     0.30   3.44
## ManufacturingProcess20 4759.00 4759.00 -12.64   162.07  26.31
## ManufacturingProcess21    3.60    5.40   1.73     5.03   0.06
## ManufacturingProcess22   12.00   12.00   0.31    -1.02   0.25
## ManufacturingProcess23    6.00    6.00   0.20    -1.00   0.13
## ManufacturingProcess24   23.00   23.00   0.36    -1.02   0.44
## ManufacturingProcess25 4990.00 4990.00 -12.63   160.33  28.56
## ManufacturingProcess26 6161.00 6161.00 -12.67   160.98  35.55
## ManufacturingProcess27 4710.00 4710.00 -12.52   158.39  27.07
## ManufacturingProcess28   11.50   11.50  -0.46    -1.79   0.40
## ManufacturingProcess29   22.00   22.00 -10.08   119.44   0.13
## ManufacturingProcess30   11.20   11.20  -4.76    43.08   0.07
## ManufacturingProcess31   72.50   72.50 -11.82   146.01   0.42
## ManufacturingProcess32  173.00   30.00   0.21     0.06   0.41
## ManufacturingProcess33   70.00   14.00  -0.13     0.27   0.19
## ManufacturingProcess34    2.60    0.30  -0.26     1.00   0.00
## ManufacturingProcess35  522.00   59.00  -0.16     0.41   0.83
## ManufacturingProcess36    0.02    0.00   0.15    -0.06   0.00
## ManufacturingProcess37    2.30    2.30   0.38     0.07   0.03
## ManufacturingProcess38    3.00    3.00  -1.68     3.92   0.05
## ManufacturingProcess39    7.50    7.50  -4.27    16.50   0.11
## ManufacturingProcess40    0.10    0.10   1.68     0.82   0.00
## ManufacturingProcess41    0.20    0.20   2.17     3.63   0.00
## ManufacturingProcess42   12.10   12.10  -5.45    28.53   0.15
## ManufacturingProcess43   11.00   11.00   9.05   101.03   0.07
## ManufacturingProcess44    2.10    2.10  -4.97    25.09   0.02
## ManufacturingProcess45    2.60    2.60  -4.08    18.76   0.03

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

features <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select=Yield)
correlations <- cor(cbind(yield,features),use="pairwise.complete.obs")
corrplot::corrplot(correlations, type="lower", tl.cex = 0.5) 

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We will use caret’s preProcess functionality to impute missing values using K-nearest neigbors of bagged trees.

prep <- preProcess(features, method=c('scale','center','knnImpute'))
prep_features <- predict(prep,features)

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# Train and test
set.seed(1)
split <- createDataPartition(yield$Yield,p=0.75,list=FALSE)
x_train <- prep_features[split,]
y_train <- yield[split,]
x_test <- prep_features[-split,]
y_test <- yield[-split,]

# Additional preprocessing
## remove near zero variance predictors that carry no information
pred_to_remove <- nearZeroVar(features)
x_train <- x_train[-pred_to_remove]
x_test <- x_test[-pred_to_remove]

## Remove highly correlated features
corThresh <- 0.9
tooHigh <- findCorrelation(cor(x_train),corThresh)
x_train <- x_train[,-tooHigh]
x_test <- x_test[,-tooHigh]

set.seed(1)
ctrl <- trainControl(method='cv',number=10)
# PLS
# The tuneLength parameter tells the algorithm to try different default values for the main parameter
pls_model <- train(x=x_train,y=y_train, method='pls', trControl=ctrl, tuneLength = 10) 
pls_model
## Partial Least Squares 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.305196  0.4779344  1.0553969
##    2     1.211998  0.5735842  0.9687460
##    3     1.163304  0.6225187  0.9547229
##    4     1.162563  0.6166847  0.9541342
##    5     1.165913  0.6261225  0.9513625
##    6     1.190823  0.6117254  0.9626405
##    7     1.216524  0.5933394  0.9864804
##    8     1.226136  0.5824105  0.9946055
##    9     1.230344  0.5740984  0.9938555
##   10     1.248397  0.5592218  1.0050279
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
plot(pls_model,metric="Rsquared")

\(R^2\) is maximized when using 5 components, however, given 3 components yield appriximately the same metric, that could be considered sufficient as well to reduce model complexity.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

set.seed(1)
predictions <- predict(pls_model,x_test)
values <- data.frame(obs = y_test, pred = predictions)
defaultSummary(values)
##      RMSE  Rsquared       MAE 
## 1.2080628 0.6388046 0.9626881

Test RMSE is within the range that was observed with 10 cross validations, indicating good performance and the model neither over- nor underfitting.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

plot(varImp(pls_model, scale = FALSE), top=20,scales = list(y = list(cex = 0.8)))

ManufacturingProcess32, ManufacturingProcess13 and ManufacturingProcess09 are top 3 most important predictors. In general, majority of top 20 features are related to manufacturing process as opposed to biological material.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

feature_imp <- varImp(pls_model, scale = FALSE)
feature_imp_order <- order(feature_imp$importance,decreasing=TRUE)
top5 = rownames(feature_imp$importance)[feature_imp_order[c(1:5)]]

featurePlot(x_train[, top5],y_train,plot = "scatter")

Out of the top 5 important features, ManufacturingProcess32, and ManufacturingProcess09 have a positive relationship with Yield, whereas the remaining three features have a negative relationship. This information can be used to adjust the process to get a higher yield.