Kuhn and Johnson
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the re- sponse. How could this information be helpful in improving yield in future runs of the manufacturing process?
#clear the workspace
rm(list = ls())
library(AppliedPredictiveModeling)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
data(package="AppliedPredictiveModeling")
data(ChemicalManufacturingProcess)
# summary(ChemicalManufacturingProcess)
The data set is in the library of applied predictive modeling. It contains 176 observations and 58 variables .The data set has the outcome variable as yield ,and 12 variables describing biological materials, and 45 variables describing manufacturing process.
I am using the mice package for the imputation, with the PMM method, and two rounds of imputations are done.
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.23 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.55
## Mean :11.21 Mean :16.64 Mean :1.54
## 3rd Qu.:12.12 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:927.8 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.0 Median :206.8
## Mean :931.8 Mean :1001.6 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.7 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.000 Median : 9.400 Median : 0.0
## Mean : 9.164 Mean : 9.407 Mean : 852.9
## 3rd Qu.: 9.525 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4827 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4853 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.386 Mean :3.017 Mean : 8.898
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4828 1st Qu.:6019 1st Qu.:4563
## Median :4854 Median :6046 Median :4586
## Mean :4827 Mean :6014 Mean :4564
## 3rd Qu.:4876 3rd Qu.:6069 3rd Qu.:4610
## Max. :4990 Max. :6161 Max. :4710
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.80
## Median :10.400 Median :20.00 Median : 9.15
## Mean : 6.405 Mean :20.06 Mean : 9.17
## 3rd Qu.:10.700 3rd Qu.:20.50 3rd Qu.: 9.70
## Max. :11.500 Max. :22.00 Max. :11.20
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:69.90 1st Qu.:155.0 1st Qu.:62.00
## Median :70.75 Median :158.0 Median :64.00
## Mean :70.04 Mean :158.5 Mean :63.61
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :496.0 Median :0.01900
## Mean :2.491 Mean :495.9 Mean :0.01952
## 3rd Qu.:2.500 3rd Qu.:502.0 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01818 Mean :0.02415 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
Now the data is completed and ready to be analyzed.
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
train_chem <- C_M_P2[1:130,]
test_chem <- C_M_P2 [131:176,]
The data is split into training dataset of the 1st 130 observations, and testing dataset with the rest of 46 observations. This is 80 /20 split .They both contain the original 58 variables. But without missing data.
## [1] 19
## [1] 39
The caret package has a function to find correlations among variables. I used that, with the cutoff points of 0.7 as the threshold, which results in 19 variables that are highly correlated. After filtering out these 19 variables, there are remaining 39 explainable variables to be used in analysis.
library(corrplot)
## corrplot 0.84 loaded
correlation_chem <- cor(filtered_train_chem, method = "pearson")
corrplot(correlation_chem, type="upper", method="color")
This correlation plot is a visualization of the remaining 39 variables, which is already taken out the highly correlated ones.
library(pls)
set.seed(100)
ctrl <- trainControl(method = "cv", number = 10)
plsTune2 <- train(filtered_train_chem[,2:39],filtered_train_chem[,1],
method="pls",
tuneLength=20,
trControl=ctrl,
preProc=c("center","scale"))
plsTune2$bestTune
## ncomp
## 19 19
plsTune2$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 2.073545 0.7570869 1.6158044 0.8764292 0.14509107 0.5065408
## 2 2 1.915597 0.7915543 1.3506882 0.8390431 0.17535559 0.3746553
## 3 3 1.525990 0.8696215 1.1587579 0.4349058 0.06143772 0.2579549
## 4 4 1.897189 0.8187616 1.2378116 1.1832356 0.18303063 0.5389155
## 5 5 1.963573 0.8128172 1.2331389 1.6160088 0.23467799 0.6560934
## 6 6 2.146138 0.7952321 1.2536089 1.9703495 0.26040360 0.7387702
## 7 7 2.390184 0.7857642 1.3286480 2.4596378 0.27353801 0.8801379
## 8 8 2.326514 0.7874180 1.2523691 2.3026289 0.26839864 0.7798256
## 9 9 2.365424 0.7912818 1.2339710 2.4896982 0.27286811 0.8200504
## 10 10 2.488107 0.7855064 1.2607733 2.7546992 0.28258928 0.8794479
## 11 11 2.357707 0.7954019 1.1954375 2.5781113 0.26711059 0.7918826
## 12 12 2.237697 0.8054063 1.1397488 2.4442107 0.25380639 0.7450187
## 13 13 2.144837 0.8148693 1.0941577 2.3682589 0.24474339 0.7056096
## 14 14 1.930114 0.8356747 1.0230721 2.0719448 0.21511426 0.6229877
## 15 15 1.654861 0.8690194 0.9347058 1.6431573 0.15297974 0.4970126
## 16 16 1.497130 0.8910031 0.8895384 1.3525178 0.10659545 0.3909605
## 17 17 1.227332 0.9170523 0.8061245 0.6593939 0.05145399 0.1997950
## 18 18 1.122657 0.9229582 0.7798127 0.3576277 0.04987600 0.1549173
## 19 19 1.067462 0.9207771 0.7512899 0.4563599 0.07186039 0.2082580
## 20 20 1.242128 0.8905825 0.8054621 0.6001110 0.10614256 0.2311430
summary(plsTune2)
## Data: X dimension: 130 38
## Y dimension: 130 1
## Fit method: oscorespls
## Number of components considered: 19
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 15.71 20.71 24.92 29.84 35.35 38.76 42.11
## .outcome 80.35 88.61 91.55 92.73 93.47 94.57 95.28
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 45.45 48.98 52.19 55.39 58.72 60.92 63.62
## .outcome 95.74 96.07 96.30 96.47 96.58 96.65 96.69
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 66.76 69.86 72.80 76.07 78.69
## .outcome 96.71 96.73 96.75 96.77 96.78
plot(plsTune2)
The detailed Performance evaluation in terms RMSE, R ^2, MAE etc are listed as results . The plot of num of components to RMSE shows that the RMSE is hightest at 10 components, drop down gradually as components are added. But interesting , with three components, the first 3, RMSE is also relatively low. The optimum number of components is 10.
CMP_Prediction2 <- predict(plsTune2, newdata=test_chem[,2:58], ncomp=1:10)
cor(as.numeric(CMP_Prediction2), test_chem[,1]) ^ 2
## [1] 0.1257143
Next we used this model with the 10 components to predict the test data. And the correlation of outcome yield on the test data is 0.125.
library(caret)
varImp(plsTune2)
## pls variable importance
##
## only 20 most important variables shown (out of 38)
##
## Overall
## BiologicalMaterial06 100.00
## BiologicalMaterial11 71.11
## BiologicalMaterial12 67.92
## BiologicalMaterial04 67.89
## BiologicalMaterial08 67.15
## ManufacturingProcess02 57.44
## ManufacturingProcess33 46.93
## ManufacturingProcess36 39.02
## ManufacturingProcess15 38.49
## ManufacturingProcess05 33.30
## ManufacturingProcess04 31.81
## ManufacturingProcess19 30.23
## ManufacturingProcess42 29.25
## ManufacturingProcess09 28.81
## ManufacturingProcess06 28.40
## BiologicalMaterial09 23.99
## ManufacturingProcess11 23.95
## ManufacturingProcess23 17.67
## ManufacturingProcess01 16.24
## ManufacturingProcess34 14.58
The caret package has the function varImp, which is what we used to determine the most important predictors in our training model. The first 5 most important predictors are biological ones, which we understand cannot be modified. But the good news is the next nine variables ranked belong to manufacturing process, which can be modified. And remaining 20 variables are split between manufacturing process and the biological ones.
Imp <- varImp(plsTune2)
plot(Imp, top =20)
The top 20 most important variables are plotted out as well below.
The correlation among these variables our shown in this correlation plot.
par(mfrow = c(2, 3))
## why I do not have the line in red color
plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess02)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess02), col="red")
plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess33)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess33), col="red")
plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess36)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess36), col="red")
plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess15)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess15), col="red")
plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess05)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess05), col="red")
plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess04)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess04), col="red")
I have selected the top six most important predictors to their relationship of the outcome yield. As we can see some are positively correlated with outcome, while others are negatively correlated with the outcome. I intended to plot out a linear line in red color for the outcome, but for some unknown reason the linear model line failed to show up do too some bucks in the program which I will fix later.