A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
data(ChemicalManufacturingProcess)
describe(ChemicalManufacturingProcess)
## vars n mean sd median trimmed mad min
## Yield 1 176 40.18 1.85 39.97 40.12 1.97 35.25
## BiologicalMaterial01 2 176 6.41 0.71 6.30 6.39 0.67 4.58
## BiologicalMaterial02 3 176 55.69 4.03 55.09 55.58 4.58 46.87
## BiologicalMaterial03 4 176 67.70 4.00 67.22 67.68 4.28 56.97
## BiologicalMaterial04 5 176 12.35 1.77 12.10 12.19 1.37 9.38
## BiologicalMaterial05 6 176 18.60 1.84 18.49 18.55 1.88 13.24
## BiologicalMaterial06 7 176 48.91 3.75 48.46 48.74 3.94 40.60
## BiologicalMaterial07 8 176 100.01 0.11 100.00 100.00 0.00 100.00
## BiologicalMaterial08 9 176 17.49 0.68 17.51 17.47 0.59 15.88
## BiologicalMaterial09 10 176 12.85 0.42 12.84 12.86 0.42 11.44
## BiologicalMaterial10 11 176 2.80 0.60 2.71 2.73 0.40 1.77
## BiologicalMaterial11 12 176 146.95 4.82 146.08 146.79 4.11 135.81
## BiologicalMaterial12 13 176 20.20 0.77 20.12 20.18 0.67 18.35
## ManufacturingProcess01 14 175 11.21 1.82 11.40 11.41 1.04 0.00
## ManufacturingProcess02 15 173 16.68 8.47 21.00 18.06 1.48 0.00
## ManufacturingProcess03 16 161 1.54 0.02 1.54 1.54 0.01 1.47
## ManufacturingProcess04 17 175 931.85 6.27 934.00 932.28 5.93 911.00
## ManufacturingProcess05 18 175 1001.69 30.53 999.20 998.62 17.35 923.00
## ManufacturingProcess06 19 174 207.40 2.70 206.80 207.09 1.93 203.00
## ManufacturingProcess07 20 175 177.48 0.50 177.00 177.48 0.00 177.00
## ManufacturingProcess08 21 175 177.55 0.50 178.00 177.57 0.00 177.00
## ManufacturingProcess09 22 176 45.66 1.55 45.73 45.72 1.22 38.89
## ManufacturingProcess10 23 167 9.18 0.77 9.10 9.13 0.59 7.50
## ManufacturingProcess11 24 166 9.39 0.72 9.40 9.39 0.67 7.50
## ManufacturingProcess12 25 175 857.81 1784.53 0.00 516.20 0.00 0.00
## ManufacturingProcess13 26 176 34.51 1.02 34.60 34.51 0.89 32.10
## ManufacturingProcess14 27 175 4853.87 54.52 4856.00 4854.57 40.03 4701.00
## ManufacturingProcess15 28 176 6038.92 58.31 6031.50 6035.52 40.77 5904.00
## ManufacturingProcess16 29 176 4565.80 351.70 4588.00 4588.36 43.00 0.00
## ManufacturingProcess17 30 176 34.34 1.25 34.40 34.31 1.19 31.30
## ManufacturingProcess18 31 176 4809.68 367.48 4835.00 4837.07 34.84 0.00
## ManufacturingProcess19 32 176 6028.20 45.58 6022.00 6026.15 36.32 5890.00
## ManufacturingProcess20 33 176 4556.46 349.01 4582.00 4580.98 43.00 0.00
## ManufacturingProcess21 34 176 -0.16 0.78 -0.30 -0.26 0.44 -1.80
## ManufacturingProcess22 35 175 5.41 3.33 5.00 5.25 4.45 0.00
## ManufacturingProcess23 36 175 3.02 1.66 3.00 2.94 1.48 0.00
## ManufacturingProcess24 37 175 8.83 5.80 8.00 8.57 7.41 0.00
## ManufacturingProcess25 38 171 4828.18 373.48 4855.00 4855.56 34.10 0.00
## ManufacturingProcess26 39 171 6015.60 464.87 6047.00 6048.55 38.55 0.00
## ManufacturingProcess27 40 171 4562.51 353.98 4587.00 4587.45 35.58 0.00
## ManufacturingProcess28 41 171 6.59 5.25 10.40 6.82 1.04 0.00
## ManufacturingProcess29 42 171 20.01 1.66 19.90 20.04 0.44 0.00
## ManufacturingProcess30 43 171 9.16 0.98 9.10 9.21 0.74 0.00
## ManufacturingProcess31 44 171 70.18 5.56 70.80 70.72 0.89 0.00
## ManufacturingProcess32 45 176 158.47 5.40 158.00 158.34 4.45 143.00
## ManufacturingProcess33 46 171 63.54 2.48 64.00 63.55 1.48 56.00
## ManufacturingProcess34 47 171 2.49 0.05 2.50 2.49 0.00 2.30
## ManufacturingProcess35 48 171 495.60 10.82 495.00 495.74 8.90 463.00
## ManufacturingProcess36 49 171 0.02 0.00 0.02 0.02 0.00 0.02
## ManufacturingProcess37 50 176 1.01 0.45 1.00 1.00 0.44 0.00
## ManufacturingProcess38 51 176 2.53 0.65 3.00 2.61 0.00 0.00
## ManufacturingProcess39 52 176 6.85 1.51 7.20 7.17 0.15 0.00
## ManufacturingProcess40 53 175 0.02 0.04 0.00 0.01 0.00 0.00
## ManufacturingProcess41 54 175 0.02 0.05 0.00 0.01 0.00 0.00
## ManufacturingProcess42 55 176 11.21 1.94 11.60 11.54 0.30 0.00
## ManufacturingProcess43 56 176 0.91 0.87 0.80 0.81 0.30 0.00
## ManufacturingProcess44 57 176 1.81 0.32 1.90 1.85 0.15 0.00
## ManufacturingProcess45 58 176 2.14 0.41 2.20 2.20 0.15 0.00
## max range skew kurtosis se
## Yield 46.34 11.09 0.31 -0.11 0.14
## BiologicalMaterial01 8.81 4.23 0.27 0.46 0.05
## BiologicalMaterial02 64.75 17.88 0.24 -0.71 0.30
## BiologicalMaterial03 78.25 21.28 0.03 -0.12 0.30
## BiologicalMaterial04 23.09 13.71 1.73 7.06 0.13
## BiologicalMaterial05 24.85 11.61 0.30 0.22 0.14
## BiologicalMaterial06 59.38 18.78 0.37 -0.37 0.28
## BiologicalMaterial07 100.83 0.83 7.40 53.04 0.01
## BiologicalMaterial08 19.14 3.26 0.22 0.06 0.05
## BiologicalMaterial09 14.08 2.64 -0.27 0.29 0.03
## BiologicalMaterial10 6.87 5.10 2.40 11.65 0.05
## BiologicalMaterial11 158.73 22.92 0.36 0.02 0.36
## BiologicalMaterial12 22.21 3.86 0.30 0.01 0.06
## ManufacturingProcess01 14.10 14.10 -3.92 21.87 0.14
## ManufacturingProcess02 22.50 22.50 -1.43 0.11 0.64
## ManufacturingProcess03 1.60 0.13 -0.48 1.73 0.00
## ManufacturingProcess04 946.00 35.00 -0.70 0.06 0.47
## ManufacturingProcess05 1175.30 252.30 2.59 11.74 2.31
## ManufacturingProcess06 227.40 24.40 3.04 17.38 0.20
## ManufacturingProcess07 178.00 1.00 0.08 -2.01 0.04
## ManufacturingProcess08 178.00 1.00 -0.22 -1.96 0.04
## ManufacturingProcess09 49.36 10.47 -0.94 3.27 0.12
## ManufacturingProcess10 11.60 4.10 0.65 0.63 0.06
## ManufacturingProcess11 11.50 4.00 -0.02 0.32 0.06
## ManufacturingProcess12 4549.00 4549.00 1.58 0.50 134.90
## ManufacturingProcess13 38.60 6.50 0.48 1.96 0.08
## ManufacturingProcess14 5055.00 354.00 -0.01 1.08 4.12
## ManufacturingProcess15 6233.00 329.00 0.67 1.22 4.40
## ManufacturingProcess16 4852.00 4852.00 -12.42 158.40 26.51
## ManufacturingProcess17 40.00 8.70 1.16 4.66 0.09
## ManufacturingProcess18 4971.00 4971.00 -12.74 163.74 27.70
## ManufacturingProcess19 6146.00 256.00 0.30 0.30 3.44
## ManufacturingProcess20 4759.00 4759.00 -12.64 162.07 26.31
## ManufacturingProcess21 3.60 5.40 1.73 5.03 0.06
## ManufacturingProcess22 12.00 12.00 0.31 -1.02 0.25
## ManufacturingProcess23 6.00 6.00 0.20 -1.00 0.13
## ManufacturingProcess24 23.00 23.00 0.36 -1.02 0.44
## ManufacturingProcess25 4990.00 4990.00 -12.63 160.33 28.56
## ManufacturingProcess26 6161.00 6161.00 -12.67 160.98 35.55
## ManufacturingProcess27 4710.00 4710.00 -12.52 158.39 27.07
## ManufacturingProcess28 11.50 11.50 -0.46 -1.79 0.40
## ManufacturingProcess29 22.00 22.00 -10.08 119.44 0.13
## ManufacturingProcess30 11.20 11.20 -4.76 43.08 0.07
## ManufacturingProcess31 72.50 72.50 -11.82 146.01 0.42
## ManufacturingProcess32 173.00 30.00 0.21 0.06 0.41
## ManufacturingProcess33 70.00 14.00 -0.13 0.27 0.19
## ManufacturingProcess34 2.60 0.30 -0.26 1.00 0.00
## ManufacturingProcess35 522.00 59.00 -0.16 0.41 0.83
## ManufacturingProcess36 0.02 0.00 0.15 -0.06 0.00
## ManufacturingProcess37 2.30 2.30 0.38 0.07 0.03
## ManufacturingProcess38 3.00 3.00 -1.68 3.92 0.05
## ManufacturingProcess39 7.50 7.50 -4.27 16.50 0.11
## ManufacturingProcess40 0.10 0.10 1.68 0.82 0.00
## ManufacturingProcess41 0.20 0.20 2.17 3.63 0.00
## ManufacturingProcess42 12.10 12.10 -5.45 28.53 0.15
## ManufacturingProcess43 11.00 11.00 9.05 101.03 0.07
## ManufacturingProcess44 2.10 2.10 -4.97 25.09 0.02
## ManufacturingProcess45 2.60 2.60 -4.08 18.76 0.03
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
features <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select=Yield)
correlations <- cor(cbind(yield,features),use="pairwise.complete.obs")
corrplot::corrplot(correlations, type="lower", tl.cex = 0.5)
We will use caret
’s preProcess
functionality to impute missing values using K-nearest neigbors of bagged trees.
prep <- preProcess(features, method=c('scale','center','knnImpute'))
prep_features <- predict(prep,features)
# Train and test
set.seed(1)
split <- createDataPartition(yield$Yield,p=0.75,list=FALSE)
x_train <- prep_features[split,]
y_train <- yield[split,]
x_test <- prep_features[-split,]
y_test <- yield[-split,]
# Additional preprocessing
## remove near zero variance predictors that carry no information
pred_to_remove <- nearZeroVar(features)
x_train <- x_train[-pred_to_remove]
x_test <- x_test[-pred_to_remove]
## Remove highly correlated features
corThresh <- 0.9
tooHigh <- findCorrelation(cor(x_train),corThresh)
x_train <- x_train[,-tooHigh]
x_test <- x_test[,-tooHigh]
set.seed(1)
ctrl <- trainControl(method='cv',number=10)
# PLS
# The tuneLength parameter tells the algorithm to try different default values for the main parameter
pls_model <- train(x=x_train,y=y_train, method='pls', trControl=ctrl, tuneLength = 10)
pls_model
## Partial Least Squares
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.305196 0.4779344 1.0553969
## 2 1.211998 0.5735842 0.9687460
## 3 1.163304 0.6225187 0.9547229
## 4 1.162563 0.6166847 0.9541342
## 5 1.165913 0.6261225 0.9513625
## 6 1.190823 0.6117254 0.9626405
## 7 1.216524 0.5933394 0.9864804
## 8 1.226136 0.5824105 0.9946055
## 9 1.230344 0.5740984 0.9938555
## 10 1.248397 0.5592218 1.0050279
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
plot(pls_model,metric="Rsquared")
\(R^2\) is maximized when using 5 components, however, given 3 components yield appriximately the same metric, that could be considered sufficient as well to reduce model complexity.
set.seed(1)
predictions <- predict(pls_model,x_test)
values <- data.frame(obs = y_test, pred = predictions)
defaultSummary(values)
## RMSE Rsquared MAE
## 1.2080628 0.6388046 0.9626881
Test RMSE is within the range that was observed with 10 cross validations, indicating good performance and the model neither over- nor underfitting.
plot(varImp(pls_model, scale = FALSE), top=20,scales = list(y = list(cex = 0.8)))
ManufacturingProcess32, ManufacturingProcess13 and ManufacturingProcess09 are top 3 most important predictors. In general, majority of top 20 features are related to manufacturing process as opposed to biological material.
feature_imp <- varImp(pls_model, scale = FALSE)
feature_imp_order <- order(feature_imp$importance,decreasing=TRUE)
top5 = rownames(feature_imp$importance)[feature_imp_order[c(1:5)]]
featurePlot(x_train[, top5],y_train,plot = "scatter")
Out of the top 5 important features, ManufacturingProcess32, and ManufacturingProcess09 have a positive relationship with Yield, whereas the remaining three features have a negative relationship. This information can be used to adjust the process to get a higher yield.