Question: A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 3.5.3
library(VIM)
## Warning: package 'VIM' was built under R version 3.5.3
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 3.5.3
## Loading required package: grid
## Loading required package: data.table
## Warning: package 'data.table' was built under R version 3.5.3
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
data(ChemicalManufacturingProcess)
We wil user kNN imputation method to impute the missing values from VIM package.
summary(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
##
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
##
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
##
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
##
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
impu_data <- kNN(ChemicalManufacturingProcess, imp_var = FALSE)
summary((ChemicalManufacturingProcess$ManufacturingProcess02))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 19.30 21.00 16.68 21.50 22.50 3
summary(impu_data$ManufacturingProcess02)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 19.30 21.00 16.76 21.50 22.50
data.frame("OLD"=ChemicalManufacturingProcess$ManufacturingProcess02,
"Imputed"=impu_data$ManufacturingProcess02)
## OLD Imputed
## 1 NA 21.0
## 2 0.0 0.0
## 3 0.0 0.0
## 4 0.0 0.0
## 5 0.0 0.0
## 6 0.0 0.0
## 7 0.0 0.0
## 8 0.0 0.0
## 9 0.0 0.0
## 10 0.0 0.0
## 11 0.0 0.0
## 12 0.0 0.0
## 13 0.0 0.0
## 14 0.0 0.0
## 15 0.0 0.0
## 16 0.0 0.0
## 17 0.0 0.0
## 18 0.0 0.0
## 19 0.0 0.0
## 20 0.0 0.0
## 21 0.0 0.0
## 22 0.0 0.0
## 23 0.0 0.0
## 24 0.0 0.0
## 25 0.0 0.0
## 26 0.0 0.0
## 27 0.0 0.0
## 28 0.0 0.0
## 29 0.0 0.0
## 30 0.0 0.0
## 31 0.0 0.0
## 32 0.0 0.0
## 33 0.0 0.0
## 34 0.0 0.0
## 35 0.0 0.0
## 36 0.0 0.0
## 37 19.7 19.7
## 38 19.9 19.9
## 39 19.3 19.3
## 40 19.5 19.5
## 41 19.3 19.3
## 42 22.5 22.5
## 43 20.5 20.5
## 44 21.5 21.5
## 45 20.5 20.5
## 46 20.5 20.5
## 47 20.5 20.5
## 48 20.0 20.0
## 49 18.0 18.0
## 50 19.0 19.0
## 51 18.0 18.0
## 52 19.5 19.5
## 53 19.5 19.5
## 54 19.5 19.5
## 55 19.5 19.5
## 56 19.5 19.5
## 57 19.5 19.5
## 58 19.5 19.5
## 59 18.0 18.0
## 60 20.0 20.0
## 61 19.0 19.0
## 62 20.0 20.0
## 63 19.5 19.5
## 64 19.5 19.5
## 65 20.0 20.0
## 66 19.5 19.5
## 67 19.5 19.5
## 68 19.5 19.5
## 69 20.0 20.0
## 70 19.0 19.0
## 71 19.0 19.0
## 72 19.0 19.0
## 73 19.5 19.5
## 74 21.5 21.5
## 75 22.2 22.2
## 76 22.0 22.0
## 77 22.5 22.5
## 78 21.5 21.5
## 79 21.5 21.5
## 80 22.0 22.0
## 81 22.0 22.0
## 82 22.0 22.0
## 83 20.5 20.5
## 84 21.0 21.0
## 85 22.0 22.0
## 86 21.0 21.0
## 87 21.5 21.5
## 88 21.5 21.5
## 89 21.5 21.5
## 90 21.5 21.5
## 91 21.7 21.7
## 92 22.0 22.0
## 93 21.5 21.5
## 94 21.5 21.5
## 95 21.5 21.5
## 96 22.0 22.0
## 97 22.0 22.0
## 98 20.9 20.9
## 99 22.0 22.0
## 100 21.0 21.0
## 101 21.5 21.5
## 102 21.9 21.9
## 103 21.7 21.7
## 104 21.6 21.6
## 105 21.8 21.8
## 106 20.8 20.8
## 107 22.0 22.0
## 108 21.9 21.9
## 109 22.4 22.4
## 110 22.0 22.0
## 111 20.5 20.5
## 112 22.2 22.2
## 113 22.3 22.3
## 114 22.0 22.0
## 115 21.2 21.2
## 116 21.1 21.1
## 117 21.0 21.0
## 118 21.0 21.0
## 119 20.9 20.9
## 120 21.1 21.1
## 121 21.2 21.2
## 122 21.5 21.5
## 123 21.2 21.2
## 124 20.8 20.8
## 125 20.9 20.9
## 126 21.2 21.2
## 127 21.3 21.3
## 128 21.3 21.3
## 129 21.4 21.4
## 130 21.5 21.5
## 131 21.4 21.4
## 132 21.5 21.5
## 133 21.2 21.2
## 134 NA 21.4
## 135 21.4 21.4
## 136 21.3 21.3
## 137 21.3 21.3
## 138 21.6 21.6
## 139 NA 20.9
## 140 21.3 21.3
## 141 21.2 21.2
## 142 21.2 21.2
## 143 21.4 21.4
## 144 21.4 21.4
## 145 21.4 21.4
## 146 21.6 21.6
## 147 21.6 21.6
## 148 21.4 21.4
## 149 21.4 21.4
## 150 21.4 21.4
## 151 21.1 21.1
## 152 21.5 21.5
## 153 21.7 21.7
## 154 21.3 21.3
## 155 21.2 21.2
## 156 21.3 21.3
## 157 21.0 21.0
## 158 21.2 21.2
## 159 21.4 21.4
## 160 21.3 21.3
## 161 21.5 21.5
## 162 21.1 21.1
## 163 21.0 21.0
## 164 21.2 21.2
## 165 21.2 21.2
## 166 21.2 21.2
## 167 21.2 21.2
## 168 20.0 20.0
## 169 20.8 20.8
## 170 19.9 19.9
## 171 20.0 20.0
## 172 21.5 21.5
## 173 21.5 21.5
## 174 20.4 20.4
## 175 21.6 21.6
## 176 20.8 20.8
n <- nrow(impu_data)
i.training <- sort(sample(n,round(n*0.8)))
L.training <- impu_data[i.training,]
L.test <- impu_data[-i.training,]
X_train <- L.training[,-1]
Y_train <- L.training[,1]
X_test <- L.test[,-1]
Y_test <- L.test[,1]
ctrl <- trainControl(method = "cv", number = 10)
model_lm <- lm(Yield~.,data=L.training )
# model_lm <- train(x = X_train, y = Y_train,
# method = "lm",
# trControl = ctrl)
summary(model_lm)
##
## Call:
## lm(formula = Yield ~ ., data = L.training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81632 -0.49431 -0.06162 0.48675 2.19256
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.855e+01 2.099e+02 0.327 0.744751
## BiologicalMaterial01 3.407e-01 3.979e-01 0.856 0.394245
## BiologicalMaterial02 -1.860e-01 1.499e-01 -1.241 0.218073
## BiologicalMaterial03 5.356e-01 3.095e-01 1.730 0.087269 .
## BiologicalMaterial04 -4.979e-01 6.468e-01 -0.770 0.443579
## BiologicalMaterial05 2.422e-01 1.262e-01 1.918 0.058449 .
## BiologicalMaterial06 -3.594e-01 3.957e-01 -0.908 0.366402
## BiologicalMaterial07 -1.803e+00 1.187e+00 -1.518 0.132694
## BiologicalMaterial08 9.236e-01 7.973e-01 1.158 0.249993
## BiologicalMaterial09 -2.204e+00 1.769e+00 -1.246 0.216266
## BiologicalMaterial10 9.662e-01 1.679e+00 0.576 0.566451
## BiologicalMaterial11 -1.208e-01 9.490e-02 -1.272 0.206709
## BiologicalMaterial12 8.466e-01 7.290e-01 1.161 0.248803
## ManufacturingProcess01 8.844e-02 1.088e-01 0.813 0.418652
## ManufacturingProcess02 -9.604e-03 6.142e-02 -0.156 0.876123
## ManufacturingProcess03 -4.287e+00 6.737e+00 -0.636 0.526299
## ManufacturingProcess04 6.468e-02 3.600e-02 1.797 0.075967 .
## ManufacturingProcess05 8.987e-04 4.405e-03 0.204 0.838853
## ManufacturingProcess06 -1.584e-03 4.955e-02 -0.032 0.974574
## ManufacturingProcess07 -2.312e-01 2.471e-01 -0.936 0.352127
## ManufacturingProcess08 -9.281e-02 3.378e-01 -0.275 0.784210
## ManufacturingProcess09 2.221e-01 2.073e-01 1.072 0.286940
## ManufacturingProcess10 2.259e-01 6.229e-01 0.363 0.717743
## ManufacturingProcess11 5.030e-01 8.072e-01 0.623 0.534882
## ManufacturingProcess12 1.593e-04 1.395e-04 1.142 0.256645
## ManufacturingProcess13 -3.586e-01 4.603e-01 -0.779 0.438077
## ManufacturingProcess14 4.716e-03 1.175e-02 0.401 0.689193
## ManufacturingProcess15 2.474e-04 1.163e-02 0.021 0.983085
## ManufacturingProcess16 2.627e-04 5.335e-04 0.493 0.623648
## ManufacturingProcess17 1.099e-02 3.860e-01 0.028 0.977360
## ManufacturingProcess18 8.869e-03 6.270e-03 1.415 0.160897
## ManufacturingProcess19 -9.901e-03 1.246e-02 -0.795 0.428930
## ManufacturingProcess20 -2.624e-03 1.039e-02 -0.252 0.801302
## ManufacturingProcess21 NA NA NA NA
## ManufacturingProcess22 -3.883e-03 4.977e-02 -0.078 0.937990
## ManufacturingProcess23 -2.561e-03 1.018e-01 -0.025 0.979988
## ManufacturingProcess24 -3.015e-02 2.726e-02 -1.106 0.271886
## ManufacturingProcess25 2.042e-02 2.590e-02 0.788 0.432643
## ManufacturingProcess26 1.031e-02 1.416e-02 0.728 0.468573
## ManufacturingProcess27 -1.413e-02 1.298e-02 -1.088 0.279585
## ManufacturingProcess28 -1.317e-01 3.981e-02 -3.307 0.001387 **
## ManufacturingProcess29 8.536e-01 1.602e+00 0.533 0.595511
## ManufacturingProcess30 7.183e-01 1.476e+00 0.487 0.627846
## ManufacturingProcess31 8.014e-02 1.296e-01 0.619 0.537869
## ManufacturingProcess32 2.875e-01 7.361e-02 3.906 0.000189 ***
## ManufacturingProcess33 -2.954e-01 1.452e-01 -2.035 0.045020 *
## ManufacturingProcess34 -3.541e-01 3.172e+00 -0.112 0.911370
## ManufacturingProcess35 -9.016e-03 1.922e-02 -0.469 0.640182
## ManufacturingProcess36 2.794e+02 3.451e+02 0.810 0.420408
## ManufacturingProcess37 -4.519e-01 3.496e-01 -1.293 0.199666
## ManufacturingProcess38 -5.445e-01 3.022e-01 -1.802 0.075181 .
## ManufacturingProcess39 2.857e-02 1.609e-01 0.178 0.859478
## ManufacturingProcess40 2.308e+00 7.962e+00 0.290 0.772601
## ManufacturingProcess41 -6.233e-01 5.650e+00 -0.110 0.912414
## ManufacturingProcess42 1.229e-01 2.531e-01 0.486 0.628398
## ManufacturingProcess43 2.330e-01 4.020e-01 0.580 0.563721
## ManufacturingProcess44 -1.908e-01 1.412e+00 -0.135 0.892840
## ManufacturingProcess45 1.073e+00 6.444e-01 1.665 0.099599 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 84 degrees of freedom
## Multiple R-squared: 0.7948, Adjusted R-squared: 0.6581
## F-statistic: 5.812 on 56 and 84 DF, p-value: 3.462e-13
# # The train function generates a resampling estimate of performance. Because
# the training set size is not small, 10-fold cross-validation should produce
# reasonable estimates of model performance. The function trainControl specifies
# the type of resampling:
ctrl <- trainControl(method = "cv", number = 10)
model_lm1 <- train(x = X_train, y = Y_train, method = "lm", trControl = ctrl)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
model_lm1
## Linear Regression
##
## 141 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 126, 126, 126, 128, 126, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.598888 0.4552606 1.222505
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
xyplot(Y_train ~ predict(model_lm1),
## plot the points (type = 'p') and a background grid ('g')
type = c("p", "g"),
xlab = "Predicted", ylab = "Observed")
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
xyplot(resid(model_lm1) ~ predict(model_lm1),
type = c("p", "g"),
xlab = "Predicted", ylab = "Residuals")
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
# To build a smaller model without predictors with extremely high correlations,
corThresh <- .9
tooHigh <- findCorrelation(cor(X_train), corThresh)
print(paste0(names(X_train)[tooHigh]))
## [1] "BiologicalMaterial02" "ManufacturingProcess26" "BiologicalMaterial11"
## [4] "BiologicalMaterial04" "ManufacturingProcess11" "ManufacturingProcess20"
## [7] "ManufacturingProcess42" "ManufacturingProcess40"
corrPred <- names(X_train)[tooHigh]
X_train_no_cor <- X_train[, -tooHigh]
X_test_no_cor <- X_test[, -tooHigh]
model_lm1_no_cor <- train(X_train_no_cor, Y_train, method = "lm",
trControl = ctrl)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
model_lm1_no_cor
## Linear Regression
##
## 141 samples
## 49 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 127, 125, 126, 127, 128, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.430235 0.4855718 1.112113
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
xyplot(Y_train ~ predict(model_lm1_no_cor),
## plot the points (type = 'p') and a background grid ('g')
type = c("p", "g"),
xlab = "Predicted", ylab = "Observed")
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
xyplot(resid(model_lm1_no_cor) ~ predict(model_lm1_no_cor),
type = c("p", "g"),
xlab = "Predicted", ylab = "Residuals")
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
#PLS
# Useing train perform to perfrom pre-process and tuning together. The function first preprocess the training set by centering it and scaling it. Then the function uses 10-fold cross validation to try the number of components, i.e. latent variables, of the PLS model from 1 to 20.
model_pls_no_cor <- train(x=X_train_no_cor, y=Y_train,
method = "pls",
tuneLength = 20,
metric='Rsquared',
trControl = ctrl,
preProc = c("center", "scale"))
model_pls_no_cor
## Partial Least Squares
##
## 141 samples
## 49 predictor
##
## Pre-processing: centered (49), scaled (49)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 127, 128, 127, 128, 126, 128, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.352559 0.4237399 1.119060
## 2 1.306717 0.4916857 1.038064
## 3 1.280462 0.5205842 1.026928
## 4 1.358211 0.5082162 1.066627
## 5 1.496745 0.4939937 1.098108
## 6 1.535209 0.4863527 1.133805
## 7 1.595225 0.4877445 1.140873
## 8 1.665308 0.4821861 1.160681
## 9 1.728322 0.4855999 1.166969
## 10 1.760620 0.4812682 1.168557
## 11 1.772001 0.4801923 1.175224
## 12 1.803225 0.4745867 1.182951
## 13 1.795151 0.4771398 1.177729
## 14 1.802661 0.4816548 1.176691
## 15 1.806402 0.4852451 1.181149
## 16 1.802705 0.4839794 1.178522
## 17 1.785123 0.4848508 1.170325
## 18 1.748027 0.4905052 1.154816
## 19 1.706958 0.4938117 1.139354
## 20 1.667764 0.4959363 1.127674
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
summary(model_pls_no_cor)
## Data: X dimension: 141 49
## Y dimension: 141 1
## Fit method: oscorespls
## Number of components considered: 3
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 15.42 27.72 35.52
## .outcome 48.31 59.06 65.74
#enet
# The optimal Lasso model had fraction = 0.25 and lambda = 0.1
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
model_ener_no_cor <- train(x=X_train_no_cor, y=Y_train,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
model_ener_no_cor
## Elasticnet
##
## 141 samples
## 49 predictor
##
## Pre-processing: centered (49), scaled (49)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 126, 126, 129, 127, 126, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 1.360451 0.5649771 1.0987539
## 0.00 0.10 1.203622 0.5969312 0.9915067
## 0.00 0.15 1.150417 0.6336913 0.9491414
## 0.00 0.20 1.121107 0.6619976 0.9166624
## 0.00 0.25 1.202878 0.6444083 0.9416305
## 0.00 0.30 1.299302 0.6280623 0.9653624
## 0.00 0.35 1.317518 0.6218992 0.9787508
## 0.00 0.40 1.348375 0.6134200 0.9992157
## 0.00 0.45 1.389790 0.6013857 1.0241425
## 0.00 0.50 1.425620 0.5882141 1.0481056
## 0.00 0.55 1.455330 0.5771522 1.0682668
## 0.00 0.60 1.467587 0.5691349 1.0815281
## 0.00 0.65 1.458131 0.5622400 1.0837084
## 0.00 0.70 1.478584 0.5498588 1.1011325
## 0.00 0.75 1.482400 0.5425422 1.1109586
## 0.00 0.80 1.492231 0.5364563 1.1215561
## 0.00 0.85 1.503322 0.5306192 1.1329810
## 0.00 0.90 1.513973 0.5257701 1.1440097
## 0.00 0.95 1.521371 0.5227101 1.1525218
## 0.00 1.00 1.529539 0.5202044 1.1606609
## 0.01 0.05 1.487815 0.5512335 1.2082114
## 0.01 0.10 1.293331 0.5909135 1.0516703
## 0.01 0.15 1.202649 0.6016451 0.9958021
## 0.01 0.20 1.170199 0.6136118 0.9677014
## 0.01 0.25 1.147817 0.6299576 0.9508650
## 0.01 0.30 1.127903 0.6473051 0.9294303
## 0.01 0.35 1.127934 0.6547241 0.9247375
## 0.01 0.40 1.144058 0.6538066 0.9351020
## 0.01 0.45 1.238617 0.6216153 0.9651042
## 0.01 0.50 1.301059 0.6085674 0.9815664
## 0.01 0.55 1.342435 0.6023693 0.9931510
## 0.01 0.60 1.362799 0.5984926 1.0056976
## 0.01 0.65 1.381257 0.5926075 1.0205206
## 0.01 0.70 1.418299 0.5834401 1.0430805
## 0.01 0.75 1.448181 0.5745608 1.0631203
## 0.01 0.80 1.468924 0.5674403 1.0776766
## 0.01 0.85 1.486189 0.5616798 1.0891677
## 0.01 0.90 1.502679 0.5562293 1.1000668
## 0.01 0.95 1.517548 0.5517330 1.1109006
## 0.01 1.00 1.515870 0.5492250 1.1163279
## 0.10 0.05 1.591250 0.5022393 1.2909029
## 0.10 0.10 1.442471 0.5613759 1.1695561
## 0.10 0.15 1.320348 0.5891172 1.0713270
## 0.10 0.20 1.232800 0.5988620 1.0139788
## 0.10 0.25 1.199567 0.6007628 0.9959314
## 0.10 0.30 1.177890 0.6075256 0.9764046
## 0.10 0.35 1.164760 0.6157855 0.9624020
## 0.10 0.40 1.151994 0.6264089 0.9506656
## 0.10 0.45 1.139487 0.6377487 0.9395184
## 0.10 0.50 1.136233 0.6440727 0.9360874
## 0.10 0.55 1.149283 0.6438029 0.9482037
## 0.10 0.60 1.203148 0.6250287 0.9717133
## 0.10 0.65 1.260161 0.6103432 0.9906007
## 0.10 0.70 1.297422 0.6024558 1.0034303
## 0.10 0.75 1.326758 0.5963442 1.0144447
## 0.10 0.80 1.359087 0.5904091 1.0271307
## 0.10 0.85 1.390280 0.5850707 1.0400624
## 0.10 0.90 1.418278 0.5814148 1.0512445
## 0.10 0.95 1.446559 0.5781043 1.0619710
## 0.10 1.00 1.473148 0.5748002 1.0718274
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.
test_model <- function(modelName,predData){
options(warn=-1) #turn off warnings
predicted_result <- predict(modelName, predData)
options(warn=1)
#We can collect the observed and predicted values into a data frame, then use
# the caret function defaultSummary to estimate the test set performance
DT_model_lm_pred <- data.frame(obs=Y_test,pred=predicted_result)
return(defaultSummary(DT_model_lm_pred))
}
model_lm1$results[,2:4]
## RMSE Rsquared MAE
## 1 1.598888 0.4552606 1.222505
test_model(model_lm1,X_test)
## RMSE Rsquared MAE
## 21.796607337 0.002641053 5.433700967
model_lm1_no_cor$results[,2:4]
## RMSE Rsquared MAE
## 1 1.430235 0.4855718 1.112113
test_model(model_lm1_no_cor,X_test)
## RMSE Rsquared MAE
## 11.478834110 0.006228559 3.648167475
model_pls_no_cor$results[3,2:4]
## RMSE Rsquared MAE
## 3 1.280462 0.5205842 1.026928
test_model(model_pls_no_cor,X_test_no_cor)
## RMSE Rsquared MAE
## 1.3619794 0.5796786 1.0054176
model_ener_no_cor$results[2,2:4]
## fraction RMSE Rsquared
## 21 0.05 1.487815 0.5512335
test_model(model_ener_no_cor,X_test_no_cor)
## RMSE Rsquared MAE
## 1.4990011 0.4716679 1.0660539
model_pls_no_cor$finalModel$coefficients
## , , 1 comps
##
## .outcome
## BiologicalMaterial01 0.0811917468
## BiologicalMaterial03 0.1106421140
## BiologicalMaterial05 0.0622645198
## BiologicalMaterial06 0.1171179762
## BiologicalMaterial07 -0.0257500462
## BiologicalMaterial08 0.0875351619
## BiologicalMaterial09 0.0147886029
## BiologicalMaterial10 0.0415764598
## BiologicalMaterial12 0.0869532281
## ManufacturingProcess01 -0.0222539211
## ManufacturingProcess02 -0.0387653149
## ManufacturingProcess03 -0.0266235258
## ManufacturingProcess04 -0.0641538334
## ManufacturingProcess05 0.0281483292
## ManufacturingProcess06 0.0979870252
## ManufacturingProcess07 -0.0091753827
## ManufacturingProcess08 -0.0002137590
## ManufacturingProcess09 0.1193612782
## ManufacturingProcess10 0.0608774544
## ManufacturingProcess12 0.1007042305
## ManufacturingProcess13 -0.1353524933
## ManufacturingProcess14 -0.0107695509
## ManufacturingProcess15 0.0440265789
## ManufacturingProcess16 -0.0109265287
## ManufacturingProcess17 -0.1141015834
## ManufacturingProcess18 -0.0326011136
## ManufacturingProcess19 0.0287368824
## ManufacturingProcess21 -0.0074632652
## ManufacturingProcess22 0.0110370864
## ManufacturingProcess23 -0.0247043182
## ManufacturingProcess24 -0.0570364481
## ManufacturingProcess25 -0.0200951042
## ManufacturingProcess27 -0.0367147669
## ManufacturingProcess28 0.0571427772
## ManufacturingProcess29 0.0866245715
## ManufacturingProcess30 0.0720875065
## ManufacturingProcess31 -0.0820753867
## ManufacturingProcess32 0.1681311671
## ManufacturingProcess33 0.1143910170
## ManufacturingProcess34 0.0466278258
## ManufacturingProcess35 -0.0486929273
## ManufacturingProcess36 -0.1534944068
## ManufacturingProcess37 -0.0379425395
## ManufacturingProcess38 -0.0218278231
## ManufacturingProcess39 0.0135977527
## ManufacturingProcess41 0.0008307844
## ManufacturingProcess43 0.0459291539
## ManufacturingProcess44 0.0196029935
## ManufacturingProcess45 0.0099836621
##
## , , 2 comps
##
## .outcome
## BiologicalMaterial01 0.038728848
## BiologicalMaterial03 0.125294125
## BiologicalMaterial05 0.045689593
## BiologicalMaterial06 0.101309761
## BiologicalMaterial07 -0.068979671
## BiologicalMaterial08 0.039333790
## BiologicalMaterial09 -0.011806305
## BiologicalMaterial10 -0.011027118
## BiologicalMaterial12 0.045596186
## ManufacturingProcess01 0.011953281
## ManufacturingProcess02 0.026584953
## ManufacturingProcess03 -0.028776611
## ManufacturingProcess04 -0.005191144
## ManufacturingProcess05 -0.003271607
## ManufacturingProcess06 0.140708135
## ManufacturingProcess07 -0.012400682
## ManufacturingProcess08 0.023415046
## ManufacturingProcess09 0.188809339
## ManufacturingProcess10 0.060623045
## ManufacturingProcess12 0.132390227
## ManufacturingProcess13 -0.225619811
## ManufacturingProcess14 -0.021704397
## ManufacturingProcess15 0.042825857
## ManufacturingProcess16 -0.016027994
## ManufacturingProcess17 -0.221037575
## ManufacturingProcess18 -0.073063809
## ManufacturingProcess19 -0.005623751
## ManufacturingProcess21 -0.064868419
## ManufacturingProcess22 0.030744952
## ManufacturingProcess23 -0.019827781
## ManufacturingProcess24 -0.051491311
## ManufacturingProcess25 -0.057138217
## ManufacturingProcess27 -0.093870931
## ManufacturingProcess28 -0.010603997
## ManufacturingProcess29 0.069599750
## ManufacturingProcess30 0.097386951
## ManufacturingProcess31 -0.075013646
## ManufacturingProcess32 0.254288589
## ManufacturingProcess33 0.144151279
## ManufacturingProcess34 0.110326436
## ManufacturingProcess35 -0.054750386
## ManufacturingProcess36 -0.222137286
## ManufacturingProcess37 -0.102647668
## ManufacturingProcess38 -0.018903746
## ManufacturingProcess39 0.061484189
## ManufacturingProcess41 -0.016901673
## ManufacturingProcess43 0.028799213
## ManufacturingProcess44 0.069634004
## ManufacturingProcess45 0.051078205
##
## , , 3 comps
##
## .outcome
## BiologicalMaterial01 0.016519671
## BiologicalMaterial03 0.138197492
## BiologicalMaterial05 0.061528275
## BiologicalMaterial06 0.097940280
## BiologicalMaterial07 -0.129465235
## BiologicalMaterial08 0.006614475
## BiologicalMaterial09 -0.061456843
## BiologicalMaterial10 -0.041986656
## BiologicalMaterial12 0.013111877
## ManufacturingProcess01 0.033499363
## ManufacturingProcess02 0.044413703
## ManufacturingProcess03 -0.011359615
## ManufacturingProcess04 0.086305210
## ManufacturingProcess05 -0.029540856
## ManufacturingProcess06 0.149073712
## ManufacturingProcess07 -0.033639851
## ManufacturingProcess08 0.037474073
## ManufacturingProcess09 0.196218539
## ManufacturingProcess10 0.017759283
## ManufacturingProcess12 0.091342167
## ManufacturingProcess13 -0.244466778
## ManufacturingProcess14 0.030057896
## ManufacturingProcess15 0.097093565
## ManufacturingProcess16 -0.032197078
## ManufacturingProcess17 -0.249101118
## ManufacturingProcess18 -0.019991511
## ManufacturingProcess19 0.052289701
## ManufacturingProcess21 -0.086605459
## ManufacturingProcess22 0.042136105
## ManufacturingProcess23 -0.021173787
## ManufacturingProcess24 -0.043474516
## ManufacturingProcess25 0.008451286
## ManufacturingProcess27 -0.072953203
## ManufacturingProcess28 -0.093560592
## ManufacturingProcess29 0.118542636
## ManufacturingProcess30 0.061625205
## ManufacturingProcess31 -0.043179763
## ManufacturingProcess32 0.389278924
## ManufacturingProcess33 0.197038942
## ManufacturingProcess34 0.186939703
## ManufacturingProcess35 -0.048284180
## ManufacturingProcess36 -0.316363155
## ManufacturingProcess37 -0.179202964
## ManufacturingProcess38 -0.014733380
## ManufacturingProcess39 0.135910129
## ManufacturingProcess41 -0.046145686
## ManufacturingProcess43 0.027099839
## ManufacturingProcess44 0.135790131
## ManufacturingProcess45 0.113996708
# it appears that ManufacturingProcess are more important. Alternatively, varImp function can be used to rank the importance of predictors:
varImp(model_ener_no_cor)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 49)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 88.93
## ManufacturingProcess36 83.35
## ManufacturingProcess17 76.17
## BiologicalMaterial06 69.06
## BiologicalMaterial03 61.18
## BiologicalMaterial12 55.41
## ManufacturingProcess09 54.30
## ManufacturingProcess31 48.38
## ManufacturingProcess06 46.80
## ManufacturingProcess33 46.29
## ManufacturingProcess12 35.88
## BiologicalMaterial08 33.97
## BiologicalMaterial09 33.38
## ManufacturingProcess27 32.82
## ManufacturingProcess18 30.36
## ManufacturingProcess29 26.55
## ManufacturingProcess25 25.90
## BiologicalMaterial01 25.46
## ManufacturingProcess01 24.56
varImp(model_pls_no_cor)
## Warning: package 'pls' was built under R version 3.5.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 49)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 87.83
## ManufacturingProcess13 75.86
## ManufacturingProcess17 67.67
## ManufacturingProcess09 64.68
## ManufacturingProcess33 60.89
## BiologicalMaterial06 57.25
## BiologicalMaterial03 54.53
## ManufacturingProcess12 53.39
## ManufacturingProcess06 50.99
## BiologicalMaterial08 48.03
## BiologicalMaterial12 46.95
## ManufacturingProcess29 45.21
## BiologicalMaterial01 43.46
## ManufacturingProcess04 41.56
## ManufacturingProcess31 40.61
## ManufacturingProcess28 38.42
## ManufacturingProcess30 37.90
## ManufacturingProcess34 32.22
## BiologicalMaterial05 30.58
Looking at only 3 comps, The Manufacturing Process seems to have the most importance, as generally their scores are higher than the Biological Materials. ManufacturingProcess32 has the highest score at 0.3687089330.
The evaluation on the test sets seems to suggest that the PLS model is best, with R^2 = 0.7202954 Here we noted that when we apply all the models on not correalted data then RMSE and Rsquared for bothe test and train PLS model is better compare to other model. Train: RMSE : 1.666406 Rsquared :0.4722788 TEST: RMSE : 1.0391511 Rsquared :0.7202954
13 out of the 20 in the list are ManufacturingProcess predictors, which makes it more important than BiologicalMaterial.
We can compare the non-zero coefficients, Elastic net is a linear regression model. The coefficients directly explain how the predictors affect the target. Positive coefficients improve the yield, while negative coefficients decrease the yield.
coeffs <- elasticnet::predict.enet(model_ener_no_cor$finalModel, s=model_ener_no_cor$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
# We can compare the non-zero coefficients by taking their absolute value, and then sorting them:
coeffs.sorted <- abs(coeffs)
coeffs.sorted <- coeffs.sorted[coeffs.sorted>0]
(coeffs.sorted <- sort(coeffs.sorted, decreasing = T))
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess13
## 0.9154689661 0.3505546312 0.2680540630
## ManufacturingProcess17 ManufacturingProcess28 ManufacturingProcess29
## 0.2504668291 0.2135184795 0.2047236372
## ManufacturingProcess39 BiologicalMaterial05 ManufacturingProcess04
## 0.2039668257 0.1604384955 0.1570178065
## ManufacturingProcess37 ManufacturingProcess34 BiologicalMaterial03
## 0.1483763891 0.1451908440 0.1318903648
## ManufacturingProcess45 ManufacturingProcess36 ManufacturingProcess07
## 0.0991057249 0.0748144128 0.0618134692
## ManufacturingProcess35 ManufacturingProcess03 ManufacturingProcess06
## 0.0594179081 0.0408640219 0.0226108232
## ManufacturingProcess15 ManufacturingProcess01 BiologicalMaterial12
## 0.0217963478 0.0167725117 0.0128089987
## BiologicalMaterial10 BiologicalMaterial07 ManufacturingProcess44
## 0.0101762033 0.0066443935 0.0004514611
coeffs.mp <- coeffs[names(coeffs.sorted[grep('ManufacturingProcess', names(coeffs.sorted))])]
coeffs.mp[coeffs.mp>0]
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess29
## 0.9154689661 0.3505546312 0.2047236372
## ManufacturingProcess39 ManufacturingProcess04 ManufacturingProcess34
## 0.2039668257 0.1570178065 0.1451908440
## ManufacturingProcess45 ManufacturingProcess06 ManufacturingProcess15
## 0.0991057249 0.0226108232 0.0217963478
## ManufacturingProcess01 ManufacturingProcess44
## 0.0167725117 0.0004514611
coeffs.mp[coeffs.mp<0]
## ManufacturingProcess13 ManufacturingProcess17 ManufacturingProcess28
## -0.26805406 -0.25046683 -0.21351848
## ManufacturingProcess37 ManufacturingProcess36 ManufacturingProcess07
## -0.14837639 -0.07481441 -0.06181347
## ManufacturingProcess35 ManufacturingProcess03
## -0.05941791 -0.04086402
For the ManufacturingProcess having the negative coefficients, we would change the process so that it would decrease the Yeald. Similarly ManufacturingProcess with surge in coefficients would help in increasitng the yeald.