Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
As you can see, the fingerprints matrix contains 1,107 columns or predictors.
dim(fingerprints)
## [1] 165 1107
only 388 predictors remain after applying nearZeroVar function.
fingerprints_reduced <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fingerprints_reduced)
## [1] 165 388
80% is used for training.
With R^2 used to select the model, ncomp 7 was found to be optimal with corresponding R^2 value of 0.5394.
set.seed(1)
train_selection <- createDataPartition(permeability, p = .80, list = FALSE)
train_x <- fingerprints_reduced[train_selection, ]
test_x <- fingerprints_reduced[-train_selection, ]
train_y <- permeability[train_selection, ]
test_y <- permeability[-train_selection, ]
PLS_fit <- train(x = train_x, y = train_y,
method = "pls",
metric = "Rsquared",
tuneLength = 20,
trControl = trainControl(method = "cv"),
preProcess = c('center', 'scale')
)
PLS_result <- PLS_fit$results
PLS_fit
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 121, 119, 120, 120, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.35980 0.3077288 9.992025
## 2 12.02165 0.4264926 8.622295
## 3 11.61784 0.4770111 8.806476
## 4 11.48283 0.4836406 8.724123
## 5 11.26663 0.5082593 8.723191
## 6 11.27614 0.5240074 8.677272
## 7 11.27779 0.5394212 8.699414
## 8 11.75960 0.5085816 9.070277
## 9 12.09836 0.4868781 9.217395
## 10 12.17397 0.4923159 9.270026
## 11 12.26365 0.4945505 9.284900
## 12 12.30174 0.4931934 9.280955
## 13 12.54734 0.4838175 9.507611
## 14 12.79749 0.4702011 9.705002
## 15 12.86872 0.4653057 9.643805
## 16 13.10561 0.4553447 9.865946
## 17 13.49633 0.4417598 10.152908
## 18 13.81425 0.4309356 10.373471
## 19 13.95274 0.4321740 10.471727
## 20 14.32902 0.4230356 10.692783
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 7.
plot(PLS_fit)
The R^2 is 0.4457228.
PLS_predict <- predict(PLS_fit, newdata=test_x)
postResample(pred=PLS_predict, obs=test_y)
## RMSE Rsquared MAE
## 10.8979377 0.4457228 8.4569465
R^2 was used to choose the optimal model, which corresponds to lambda 0.4.
set.seed(1)
ridge_fit <- train(x=train_x, y=train_y,
method='ridge', metric='Rsquared',
tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
ridge_fit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 24.02881 0.2627497 16.031438
## 0.1 11.83292 0.5081518 9.092937
## 0.2 11.58981 0.5330773 8.857077
## 0.3 11.79414 0.5390176 9.009380
## 0.4 12.15851 0.5405512 9.284284
## 0.5 12.62564 0.5397839 9.662863
## 0.6 13.17255 0.5384263 10.131791
## 0.7 13.76370 0.5366428 10.603431
## 0.8 14.40875 0.5347296 11.117149
## 0.9 15.08011 0.5328512 11.635077
## 1.0 15.78570 0.5309111 12.226047
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.4.
plot(ridge_fit)
Lasso regression.
R^2 was used to select the optimal model, which corresponds to fraction = 0.15.
set.seed(1)
lasso_fit <- train(x=train_x, y=train_y,
method='lasso',metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = train_x, y = train_y, method = "lasso", metric
## = "Rsquared", : missing values found in aggregated results
lasso_fit
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 15.69625 NaN 12.538724
## 0.05 12.24969 0.4843311 9.116047
## 0.10 11.56814 0.4836731 8.205552
## 0.15 11.56574 0.4898894 8.221542
## 0.20 11.79666 0.4782172 8.428298
## 0.25 12.13109 0.4736244 8.810463
## 0.30 12.59829 0.4660510 9.277051
## 0.35 13.32724 0.4463834 9.791144
## 0.40 14.14116 0.4245198 10.327665
## 0.45 14.99053 0.4048185 10.809800
## 0.50 15.82414 0.3884961 11.289059
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.15.
plot(lasso_fit)
It appears that Ridge has the best R^2 0.5405512 among the three models.
summary(resamples(list(PLS=PLS_fit, Ridge=ridge_fit, Lasso=lasso_fit)))
##
## Call:
## summary.resamples(object = resamples(list(PLS = PLS_fit, Ridge
## = ridge_fit, Lasso = lasso_fit)))
##
## Models: PLS, Ridge, Lasso
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 6.543535 7.220561 8.824647 8.699414 9.785602 10.93386 0
## Ridge 6.802083 7.680192 9.047882 9.284284 11.031392 11.95240 0
## Lasso 6.660690 7.295599 7.918300 8.221542 8.329301 11.98442 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 8.018162 9.396614 10.99087 11.27779 13.14769 15.24661 0
## Ridge 9.902231 10.446352 11.48805 12.15851 13.75999 15.48795 0
## Lasso 8.975670 10.637842 11.16021 11.56574 12.78378 14.84629 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.04977656 0.4365052 0.5847232 0.5394212 0.7029962 0.8759470 0
## Ridge 0.18314118 0.4019344 0.5990213 0.5405512 0.6504959 0.8222652 0
## Lasso 0.18010290 0.3056181 0.4640828 0.4898894 0.6740984 0.8337504 0
The mean absolute error for the three models I investigate are around +/- 8 to 9. Histogram of permeability below shows that distribution is left skewed with most permeability under 10. So, I would not recommend replacing the lab experiment.
hist(permeability)
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
Below is summary of processPredictors. As you can see, there are NA’s.
summary(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02
## Min. :35.25 Min. :4.580 Min. :46.87
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68
## Median :39.97 Median :6.305 Median :55.09
## Mean :40.18 Mean :6.411 Mean :55.69
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74
## Max. :46.34 Max. :8.810 Max. :64.75
##
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## Min. :56.97 Min. : 9.38 Min. :13.24
## 1st Qu.:64.98 1st Qu.:11.24 1st Qu.:17.23
## Median :67.22 Median :12.10 Median :18.49
## Mean :67.70 Mean :12.35 Mean :18.60
## 3rd Qu.:70.43 3rd Qu.:13.22 3rd Qu.:19.90
## Max. :78.25 Max. :23.09 Max. :24.85
##
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## Min. :40.60 Min. :100.0 Min. :15.88
## 1st Qu.:46.05 1st Qu.:100.0 1st Qu.:17.06
## Median :48.46 Median :100.0 Median :17.51
## Mean :48.91 Mean :100.0 Mean :17.49
## 3rd Qu.:51.34 3rd Qu.:100.0 3rd Qu.:17.88
## Max. :59.38 Max. :100.8 Max. :19.14
##
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## Min. :11.44 Min. :1.770 Min. :135.8
## 1st Qu.:12.60 1st Qu.:2.460 1st Qu.:143.8
## Median :12.84 Median :2.710 Median :146.1
## Mean :12.85 Mean :2.801 Mean :147.0
## 3rd Qu.:13.13 3rd Qu.:2.990 3rd Qu.:149.6
## Max. :14.08 Max. :6.870 Max. :158.7
##
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## Min. :18.35 Min. : 0.00 Min. : 0.00
## 1st Qu.:19.73 1st Qu.:10.80 1st Qu.:19.30
## Median :20.12 Median :11.40 Median :21.00
## Mean :20.20 Mean :11.21 Mean :16.68
## 3rd Qu.:20.75 3rd Qu.:12.15 3rd Qu.:21.50
## Max. :22.21 Max. :14.10 Max. :22.50
## NA's :1 NA's :3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## Min. :1.47 Min. :911.0 Min. : 923.0
## 1st Qu.:1.53 1st Qu.:928.0 1st Qu.: 986.8
## Median :1.54 Median :934.0 Median : 999.2
## Mean :1.54 Mean :931.9 Mean :1001.7
## 3rd Qu.:1.55 3rd Qu.:936.0 3rd Qu.:1008.9
## Max. :1.60 Max. :946.0 Max. :1175.3
## NA's :15 NA's :1 NA's :1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## Min. :203.0 Min. :177.0 Min. :177.0
## 1st Qu.:205.7 1st Qu.:177.0 1st Qu.:177.0
## Median :206.8 Median :177.0 Median :178.0
## Mean :207.4 Mean :177.5 Mean :177.6
## 3rd Qu.:208.7 3rd Qu.:178.0 3rd Qu.:178.0
## Max. :227.4 Max. :178.0 Max. :178.0
## NA's :2 NA's :1 NA's :1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## Min. :38.89 Min. : 7.500 Min. : 7.500
## 1st Qu.:44.89 1st Qu.: 8.700 1st Qu.: 9.000
## Median :45.73 Median : 9.100 Median : 9.400
## Mean :45.66 Mean : 9.179 Mean : 9.386
## 3rd Qu.:46.52 3rd Qu.: 9.550 3rd Qu.: 9.900
## Max. :49.36 Max. :11.600 Max. :11.500
## NA's :9 NA's :10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## Min. : 0.0 Min. :32.10 Min. :4701
## 1st Qu.: 0.0 1st Qu.:33.90 1st Qu.:4828
## Median : 0.0 Median :34.60 Median :4856
## Mean : 857.8 Mean :34.51 Mean :4854
## 3rd Qu.: 0.0 3rd Qu.:35.20 3rd Qu.:4882
## Max. :4549.0 Max. :38.60 Max. :5055
## NA's :1 NA's :1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## Min. :5904 Min. : 0 Min. :31.30
## 1st Qu.:6010 1st Qu.:4561 1st Qu.:33.50
## Median :6032 Median :4588 Median :34.40
## Mean :6039 Mean :4566 Mean :34.34
## 3rd Qu.:6061 3rd Qu.:4619 3rd Qu.:35.10
## Max. :6233 Max. :4852 Max. :40.00
##
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## Min. : 0 Min. :5890 Min. : 0
## 1st Qu.:4813 1st Qu.:6001 1st Qu.:4553
## Median :4835 Median :6022 Median :4582
## Mean :4810 Mean :6028 Mean :4556
## 3rd Qu.:4862 3rd Qu.:6050 3rd Qu.:4610
## Max. :4971 Max. :6146 Max. :4759
##
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## Min. :-1.8000 Min. : 0.000 Min. :0.000
## 1st Qu.:-0.6000 1st Qu.: 3.000 1st Qu.:2.000
## Median :-0.3000 Median : 5.000 Median :3.000
## Mean :-0.1642 Mean : 5.406 Mean :3.017
## 3rd Qu.: 0.0000 3rd Qu.: 8.000 3rd Qu.:4.000
## Max. : 3.6000 Max. :12.000 Max. :6.000
## NA's :1 NA's :1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## Min. : 0.000 Min. : 0 Min. : 0
## 1st Qu.: 4.000 1st Qu.:4832 1st Qu.:6020
## Median : 8.000 Median :4855 Median :6047
## Mean : 8.834 Mean :4828 Mean :6016
## 3rd Qu.:14.000 3rd Qu.:4877 3rd Qu.:6070
## Max. :23.000 Max. :4990 Max. :6161
## NA's :1 NA's :5 NA's :5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.:4560 1st Qu.: 0.000 1st Qu.:19.70
## Median :4587 Median :10.400 Median :19.90
## Mean :4563 Mean : 6.592 Mean :20.01
## 3rd Qu.:4609 3rd Qu.:10.750 3rd Qu.:20.40
## Max. :4710 Max. :11.500 Max. :22.00
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## Min. : 0.000 Min. : 0.00 Min. :143.0
## 1st Qu.: 8.800 1st Qu.:70.10 1st Qu.:155.0
## Median : 9.100 Median :70.80 Median :158.0
## Mean : 9.161 Mean :70.18 Mean :158.5
## 3rd Qu.: 9.700 3rd Qu.:71.40 3rd Qu.:162.0
## Max. :11.200 Max. :72.50 Max. :173.0
## NA's :5 NA's :5
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## Min. :56.00 Min. :2.300 Min. :463.0
## 1st Qu.:62.00 1st Qu.:2.500 1st Qu.:490.0
## Median :64.00 Median :2.500 Median :495.0
## Mean :63.54 Mean :2.494 Mean :495.6
## 3rd Qu.:65.00 3rd Qu.:2.500 3rd Qu.:501.5
## Max. :70.00 Max. :2.600 Max. :522.0
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## Min. :0.01700 Min. :0.000 Min. :0.000
## 1st Qu.:0.01900 1st Qu.:0.700 1st Qu.:2.000
## Median :0.02000 Median :1.000 Median :3.000
## Mean :0.01957 Mean :1.014 Mean :2.534
## 3rd Qu.:0.02000 3rd Qu.:1.300 3rd Qu.:3.000
## Max. :0.02200 Max. :2.300 Max. :3.000
## NA's :5
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## Min. :0.000 Min. :0.00000 Min. :0.00000
## 1st Qu.:7.100 1st Qu.:0.00000 1st Qu.:0.00000
## Median :7.200 Median :0.00000 Median :0.00000
## Mean :6.851 Mean :0.01771 Mean :0.02371
## 3rd Qu.:7.300 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7.500 Max. :0.10000 Max. :0.20000
## NA's :1 NA's :1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## Min. : 0.00 Min. : 0.0000 Min. :0.000
## 1st Qu.:11.40 1st Qu.: 0.6000 1st Qu.:1.800
## Median :11.60 Median : 0.8000 Median :1.900
## Mean :11.21 Mean : 0.9119 Mean :1.805
## 3rd Qu.:11.70 3rd Qu.: 1.0250 3rd Qu.:1.900
## Max. :12.10 Max. :11.0000 Max. :2.100
##
## ManufacturingProcess45
## Min. :0.000
## 1st Qu.:2.100
## Median :2.200
## Mean :2.138
## 3rd Qu.:2.300
## Max. :2.600
##
K-nearest neighbors is used with the preProcess function to impute missing data.
(cmp_knn_impute <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute')))
## Created from 152 samples and 58 variables
##
## Pre-processing:
## - centered (58)
## - ignored (0)
## - 5 nearest neighbor imputation (58)
## - scaled (58)
cmp_df <- predict(cmp_knn_impute, ChemicalManufacturingProcess)
We no longer see any NA’s.
summary(cmp_df)
## Yield BiologicalMaterial01 BiologicalMaterial02
## Min. :-2.6692 Min. :-2.5653 Min. :-2.1858
## 1st Qu.:-0.7716 1st Qu.:-0.6078 1st Qu.:-0.7457
## Median :-0.1119 Median :-0.1491 Median :-0.1484
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7035 3rd Qu.: 0.6423 3rd Qu.: 0.7557
## Max. : 3.3394 Max. : 3.3597 Max. : 2.2459
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## Min. :-2.6830 Min. :-1.6731 Min. :-2.90576
## 1st Qu.:-0.6811 1st Qu.:-0.6222 1st Qu.:-0.73944
## Median :-0.1212 Median :-0.1405 Median :-0.05891
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6804 3rd Qu.: 0.4907 3rd Qu.: 0.70568
## Max. : 2.6355 Max. : 6.0523 Max. : 3.38985
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## Min. :-2.2184 Min. :-0.1313 Min. :-2.38535
## 1st Qu.:-0.7622 1st Qu.:-0.1313 1st Qu.:-0.64225
## Median :-0.1202 Median :-0.1313 Median : 0.02249
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6499 3rd Qu.:-0.1313 3rd Qu.: 0.56906
## Max. : 2.7948 Max. : 7.5723 Max. : 2.43034
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## Min. :-3.39629 Min. :-1.7202 Min. :-2.3116
## 1st Qu.:-0.59627 1st Qu.:-0.5685 1st Qu.:-0.6505
## Median :-0.03627 Median :-0.1513 Median :-0.1811
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67428 3rd Qu.: 0.3161 3rd Qu.: 0.5491
## Max. : 2.96246 Max. : 6.7920 Max. : 2.4431
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## Min. :-2.3914 Min. :-6.149703 Min. :-1.969253
## 1st Qu.:-0.6074 1st Qu.:-0.223563 1st Qu.: 0.308956
## Median :-0.1033 Median : 0.105667 Median : 0.509627
## Mean : 0.0000 Mean : 0.001224 Mean : 0.009518
## 3rd Qu.: 0.7112 3rd Qu.: 0.503487 3rd Qu.: 0.568648
## Max. : 2.5986 Max. : 1.587202 Max. : 0.686690
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## Min. :-3.10582 Min. :-3.323233 Min. :-2.577803
## 1st Qu.:-0.42705 1st Qu.:-0.613828 1st Qu.:-0.487046
## Median : 0.37658 Median : 0.342432 Median :-0.086583
## Mean : 0.04123 Mean : 0.003213 Mean :-0.002534
## 3rd Qu.: 0.46587 3rd Qu.: 0.661186 3rd Qu.: 0.230347
## Max. : 2.69818 Max. : 2.254953 Max. : 5.686954
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## Min. :-1.630631 Min. :-0.9580199 Min. :-1.111973
## 1st Qu.:-0.630408 1st Qu.:-0.9580199 1st Qu.:-1.111973
## Median :-0.222910 Median :-0.9580199 Median : 0.894164
## Mean :-0.006574 Mean :-0.0009072 Mean :-0.001759
## 3rd Qu.: 0.480950 3rd Qu.: 1.0378549 3rd Qu.: 0.894164
## Max. : 7.408415 Max. : 1.0378549 Max. : 0.894164
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## Min. :-4.37787 Min. :-2.18999 Min. :-2.63442
## 1st Qu.:-0.49799 1st Qu.:-0.62482 1st Qu.:-0.53867
## Median : 0.04519 Median :-0.10310 Median : 0.02020
## Mean : 0.00000 Mean : 0.02156 Mean : 0.03163
## 3rd Qu.: 0.55281 3rd Qu.: 0.54906 3rd Qu.: 0.71878
## Max. : 2.39252 Max. : 3.15768 Max. : 2.95425
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## Min. :-0.480694 Min. :-2.37172 Min. :-2.803712
## 1st Qu.:-0.480694 1st Qu.:-0.59881 1st Qu.:-0.488202
## Median :-0.480694 Median : 0.09066 Median : 0.029921
## Mean :-0.002731 Mean : 0.00000 Mean :-0.004071
## 3rd Qu.:-0.480694 3rd Qu.: 0.68163 3rd Qu.: 0.520534
## Max. : 2.068439 Max. : 4.03046 Max. : 3.688885
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## Min. :-2.3137 Min. :-12.98219 Min. :-2.43850
## 1st Qu.:-0.4960 1st Qu.: -0.01436 1st Qu.:-0.67597
## Median :-0.1273 Median : 0.06312 Median : 0.04507
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.3786 3rd Qu.: 0.15126 3rd Qu.: 0.60587
## Max. : 3.3283 Max. : 0.81376 Max. : 4.53150
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## Min. :-13.08836 Min. :-3.0321 Min. :-13.05542
## 1st Qu.: 0.00903 1st Qu.:-0.6022 1st Qu.: -0.01063
## Median : 0.06890 Median :-0.1360 Median : 0.07318
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.14237 3rd Qu.: 0.4838 3rd Qu.: 0.15197
## Max. : 0.43899 Max. : 2.5846 Max. : 0.58033
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## Min. :-2.1018 Min. :-1.6230324 Min. :-1.814768
## 1st Qu.:-0.5599 1st Qu.:-0.7223009 1st Qu.:-0.611797
## Median :-0.1745 Median :-0.1218132 Median :-0.010311
## Mean : 0.0000 Mean : 0.0003314 Mean : 0.004726
## 3rd Qu.: 0.2110 3rd Qu.: 0.7789183 3rd Qu.: 0.591175
## Max. : 4.8365 Max. : 1.9798937 Max. : 1.794146
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## Min. :-1.523304 Min. :-12.927496 Min. :-12.940454
## 1st Qu.:-0.833580 1st Qu.: 0.002208 1st Qu.: 0.008505
## Median :-0.143857 Median : 0.069146 Median : 0.063251
## Mean : 0.005061 Mean : 0.000105 Mean : 0.000404
## 3rd Qu.: 0.890729 3rd Qu.: 0.128720 3rd Qu.: 0.115417
## Max. : 2.442608 Max. : 0.433287 Max. : 0.312785
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## Min. :-12.888994 Min. :-1.25583 Min. :-12.026718
## 1st Qu.: 0.000681 1st Qu.:-1.25583 1st Qu.: -0.186978
## Median : 0.066362 Median : 0.72551 Median : -0.066778
## Mean : 0.001121 Mean :-0.02868 Mean : -0.003263
## 3rd Qu.: 0.131337 3rd Qu.: 0.78266 3rd Qu.: 0.233723
## Max. : 0.416660 Max. : 0.93507 Max. : 1.195326
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## Min. :-9.38589 Min. :-12.632749 Min. :-2.86552
## 1st Qu.:-0.37026 1st Qu.: -0.015263 1st Qu.:-0.64216
## Median : 0.03954 Median : 0.110732 Median :-0.08632
## Mean : 0.01114 Mean : -0.001722 Mean : 0.00000
## 3rd Qu.: 0.55179 3rd Qu.: 0.218728 3rd Qu.: 0.65480
## Max. : 2.08855 Max. : 0.416720 Max. : 2.69287
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## Min. :-3.037737 Min. :-3.558813 Min. :-3.01270
## 1st Qu.:-0.621676 1st Qu.: 0.118269 1st Qu.:-0.51725
## Median : 0.183677 Median : 0.118269 Median :-0.05513
## Mean :-0.005764 Mean :-0.009176 Mean :-0.02362
## 3rd Qu.: 0.586354 3rd Qu.: 0.118269 3rd Qu.: 0.49941
## Max. : 2.599738 Max. : 1.956810 Max. : 2.44032
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## Min. :-2.944307 Min. :-2.27741 Min. :-3.9024
## 1st Qu.:-0.655777 1st Qu.:-0.70467 1st Qu.:-0.8225
## Median :-0.083645 Median :-0.03064 Median : 0.7175
## Mean :-0.008228 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.488487 3rd Qu.: 0.64339 3rd Qu.: 0.7175
## Max. : 2.777017 Max. : 2.89017 Max. : 0.7175
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## Min. :-4.5508 Min. :-0.4626528 Min. :-0.440588
## 1st Qu.: 0.1653 1st Qu.:-0.4626528 1st Qu.:-0.440588
## Median : 0.2317 Median :-0.4626528 Median :-0.440588
## Mean : 0.0000 Mean : 0.0003392 Mean :-0.000392
## 3rd Qu.: 0.2982 3rd Qu.:-0.4626528 3rd Qu.:-0.440588
## Max. : 0.4310 Max. : 2.1490969 Max. : 3.275213
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## Min. :-5.77163 Min. :-1.0506 Min. :-5.60583
## 1st Qu.: 0.09979 1st Qu.:-0.3594 1st Qu.:-0.01588
## Median : 0.20280 Median :-0.1290 Median : 0.29467
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.25430 3rd Qu.: 0.1303 3rd Qu.: 0.29467
## Max. : 0.46031 Max. :11.6224 Max. : 0.91578
## ManufacturingProcess45
## Min. :-5.25447
## 1st Qu.:-0.09356
## Median : 0.15220
## Mean : 0.00000
## 3rd Qu.: 0.39796
## Max. : 1.13523
dim(cmp_df)
## [1] 176 58
Apply nearZeroVar. One predictor was dropped.
cmp_df2 <- cmp_df[, -nearZeroVar(cmp_df)]
dim(cmp_df2)
## [1] 176 57
Highest R^2 is 0.5753564, which corresponds to ncomp of 3, RMSE of 0.7947649, and MAE of 0.6188437.
set.seed(1)
train_selection <- createDataPartition(cmp_df2$Yield, times = 1, p = .80, list = FALSE)
train_x2 <- cmp_df2[train_selection, ][, -c(1)] #remove Yield
test_x2 <- cmp_df2[-train_selection, ][, -c(1)] #remove Yield
train_y2 <- cmp_df2[train_selection, ]$Yield
test_y2 <- cmp_df2[-train_selection, ]$Yield
(PLS_fit2 <- train(x = train_x2, y = train_y2,
method = "pls",
metric = "Rsquared",
tuneLength = 25,
trControl = trainControl(method = "cv", number=10),
preProcess = c('center', 'scale')
))
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 129, 130, 132, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.7835955 0.4798923 0.6435414
## 2 0.9037000 0.5302901 0.6625127
## 3 0.7947649 0.5753564 0.6188437
## 4 0.9309008 0.5337241 0.6538771
## 5 1.0876695 0.5205717 0.6974711
## 6 1.1939811 0.5064324 0.7262966
## 7 1.3384657 0.4934207 0.7748374
## 8 1.4551234 0.4911846 0.8123308
## 9 1.6262222 0.4866754 0.8584908
## 10 1.8605613 0.4732849 0.9097498
## 11 2.0180454 0.4569387 0.9452939
## 12 2.2476387 0.4465673 1.0009581
## 13 2.3997689 0.4432034 1.0361214
## 14 2.5216608 0.4399495 1.0769015
## 15 2.6136424 0.4415797 1.1110390
## 16 2.6609321 0.4413105 1.1272948
## 17 2.6973758 0.4430938 1.1321804
## 18 2.8105843 0.4439771 1.1693947
## 19 2.8235256 0.4417006 1.1698933
## 20 2.8828360 0.4395277 1.1837202
## 21 2.9284335 0.4322490 1.2039932
## 22 2.9829845 0.4253542 1.2243262
## 23 3.1163856 0.4173888 1.2704676
## 24 3.2500373 0.4069647 1.3146505
## 25 3.3925822 0.4039123 1.3556809
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
plot(PLS_fit2)
The R^2 is 0.6516400. Training set R^2 is 0.5753564, which is lower.
PLS_predict2 <- predict(PLS_fit2, newdata=test_x2)
(postResample(pred=PLS_predict2, obs=test_y2))
## RMSE Rsquared MAE
## 0.5980589 0.6516400 0.4721775
As you can see, Manufacturing Process predictors dominate.
plot(varImp(PLS_fit2, scale = FALSE), top=20, scales = list(y = list(cex = 0.8)))
## Warning: package 'pls' was built under R version 3.5.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
The top 3 predictors are ManufacturingProcess32, ManufacturingProcess36, and ManufacturingProcess13.
ManufacturingProcess32 has a somewhat strong positive correlation with Yield. Increasing this process will likely to increase the yield. It appears that ManufacturingProcess32 and ManufacturingProcess36 are negatively correlated with each other. ManufacturingPocess36 and ManufacturingProcess13 are moderately correlated with Yield.
correlation <- cor(select(cmp_df2, 'ManufacturingProcess32','ManufacturingProcess36','ManufacturingProcess13','Yield'))
corrplot::corrplot(correlation, method='pie', type="upper")