Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
Start R and use these commands to load the data:
data(permeability)
summary(permeability)
## permeability
## Min. : 0.06
## 1st Qu.: 1.55
## Median : 4.91
## Mean :12.24
## 3rd Qu.:15.47
## Max. :55.60
The matrix fingerprints contains the 1,107 binary
molecular predictors for the 165 compounds, while
permeability contains permeability response.
str(fingerprints)
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
The fingerprint predictors indicate the presence or absence of
substructures of a molecule and are often sparse meaning that relatively
few of the molecules contain each substructure. Filter out the
predictors that have low frequencies using the nearZeroVar
function from the caret package. How many predictors are
left for modeling?
After removing the predictors with low frequencies there appear to be 388 predictors remaining out of the original 1107.
# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(fingerprints)
# remove the above set from the data and store the result in a dataframe
fingerprintsPred <- fingerprints[,-lowFreqPredictors]
# see what is left
str(fingerprintsPred)
## num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...
Split the data into a training and a test set, pre-process the data,
and tune a PLS model. How many latent variables are optimal
and what is the corresponding resampled estimate of \(R^2\)?
Based on the PLS model created using the training set 3 components seem optimal with a corresponding estimate of \(R^2 = 0.4657791\).
set.seed(132435)
# pre-process the data to center and scale it
pred_prep <- preProcess(fingerprintsPred, method = c("center","scale"))
resp_preo <- predict(pred_prep, fingerprintsPred)
# create a set of randomly selected training rows containing 75% of the data
trainingRows <- createDataPartition(permeability, p = 0.75, list = FALSE)
# use the training rows to split the data into training and test sets for the predictor and the response
pred_train <- resp_preo[trainingRows,]
pred_test <- resp_preo[-trainingRows,]
resp_train <- permeability[trainingRows,]
resp_test <- permeability[-trainingRows,]
# create the PLS model with the training data to maximize covariance between the
# predictor and target variables using 10-fold cross validation and measure using R2
pls_model <- train(pred_train, resp_train
, method = "pls"
, metric = "Rsquared"
, tuneLength = 20
, trControl = trainControl(method = "cv", number = 10))
pls_model
## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 113, 113, 111, 112, 113, 112, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.26533 0.2980803 10.009040
## 2 12.22901 0.4302612 8.707710
## 3 12.15172 0.4657791 9.036021
## 4 12.40468 0.4433031 9.404771
## 5 12.19534 0.4458150 9.080123
## 6 12.22558 0.4285594 9.156984
## 7 12.21694 0.4349349 9.241427
## 8 12.14031 0.4492858 9.415098
## 9 12.51810 0.4403712 9.670983
## 10 12.77960 0.4338531 9.857639
## 11 13.29300 0.4067931 10.252781
## 12 13.83929 0.3754990 10.843743
## 13 14.32021 0.3516469 11.032974
## 14 14.53883 0.3494497 11.059203
## 15 14.75606 0.3444324 11.176290
## 16 15.29865 0.3228011 11.459704
## 17 15.85285 0.3035451 11.792657
## 18 16.11621 0.3059340 11.914719
## 19 16.25239 0.2954203 12.019663
## 20 16.63190 0.2699180 12.227163
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
We can see visually that the model with three variables returns the highest R-squared value.
ggplot(pls_model)
Predict the response for the test set. What is the test set estimate of \(R^2\)?
The \(R^2\) estimate for the test set is 0.5520256, which is even better than the training result.
# predict the test set response for the PLS model
pls_test <- predict(pls_model, newdata = pred_test)
postResample(pred=pls_test, obs = resp_test)
## RMSE Rsquared MAE
## 11.6075535 0.5520256 9.0000542
Try building other models discussed in this chapter. Do any have better predictive performance?
Trying principal component analysis the best fit is 8 components with \(R^2 = 0.4032165\).
set.seed(45678)
pcrFit <- train(x=pred_train,
y=resp_train,
method='pcr',
metric = "Rsquared",
tuneLength = 10,
trControl = trainControl('cv'),
preProc=c('center','scale')
)
pcrFit
## Principal Component Analysis
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 113, 113, 111, 112, 113, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 14.54915 0.1259773 11.224252
## 2 14.61353 0.1240144 11.302737
## 3 13.25980 0.3219973 10.009028
## 4 12.30918 0.4003002 9.298525
## 5 12.68988 0.3837063 9.577919
## 6 12.75886 0.3819379 9.421907
## 7 12.71902 0.3736417 9.484800
## 8 12.53080 0.4032165 9.171209
## 9 12.72833 0.3631219 9.337410
## 10 12.46369 0.3861440 9.127543
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 8.
ggplot(pcrFit)
Using ridge regression the best fit is where labmda = 0.1 with an \(R^2 = 0.4438373\).
set.seed(45689)
ridgeGrid <- data.frame(lambda = seq(0, .1, length = 10))
ridgeFit <- train(x=pred_train,
y=resp_train,
method='ridge',
metric = "Rsquared",
tuneGrid = ridgeGrid,
trControl = trainControl('cv'),
preProc=c('center','scale')
)
ridgeFit
## Ridge Regression
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 113, 113, 113, 113, 112, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00000000 28.08335 0.3914377 15.374091
## 0.01111111 14.88808 0.3610279 11.094464
## 0.02222222 14.02495 0.3962475 10.345275
## 0.03333333 13.57048 0.4129958 10.032558
## 0.04444444 13.33585 0.4227866 9.867680
## 0.05555556 13.17754 0.4299039 9.770660
## 0.06666667 13.07850 0.4343141 9.718247
## 0.07777778 13.00843 0.4381398 9.675278
## 0.08888889 12.94566 0.4412164 9.636646
## 0.10000000 12.89890 0.4438373 9.606251
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.1.
ggplot(ridgeFit)
Would you recommend any of your models to replace the permeability laboratory experiment?
The PLS model appears to have the best performance where \(R^2 = 0.4657791\).
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
data("ChemicalManufacturingProcess")
chem_mfg <- ChemicalManufacturingProcess
The matrix processPredictors contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
Looking at the described strucutre of the data.
str(chem_mfg)
## 'data.frame': 176 obs. of 58 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.6 14.6 14.6 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Since biological predictors cannot be changed, it is assumed that we can only impute missing values on the manufacturing process predictors. Checking the columns that contain the word ‘Biological’ in the label, there are none that have missing values.
# check the biologial predictors for missingness
bio <- chem_mfg %>%
select(starts_with("Biological"))
names(which(colSums(is.na(bio))>0))
## character(0)
Checking the ‘Manufacturing’ variables for missingness we find 28
variables that have NA values.
# check the Manufacturing predictors for missingness
mfg <- chem_mfg %>%
select(starts_with("Manufacturing"))
names(which(colSums(is.na(mfg))>0))
## [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
## [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
## [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"
For most of the variables with NA’s the median and mean are very close, so imputing the mean would be the best approach. Here are the stats prior to imputation.
summary(mfg)
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
and here are the stats after imputation. The distributions hardly
changed at all, and now there are no NA values.
mfg <- mfg %>%
mutate(ManufacturingProcess01 = ifelse(is.na(ManufacturingProcess01), mean(ManufacturingProcess01, na.rm = TRUE), ManufacturingProcess01)) %>%
mutate(ManufacturingProcess02 = ifelse(is.na(ManufacturingProcess02), mean(ManufacturingProcess02, na.rm = TRUE), ManufacturingProcess02)) %>%
mutate(ManufacturingProcess03 = ifelse(is.na(ManufacturingProcess03), mean(ManufacturingProcess03, na.rm = TRUE), ManufacturingProcess03)) %>%
mutate(ManufacturingProcess04 = ifelse(is.na(ManufacturingProcess04), mean(ManufacturingProcess04, na.rm = TRUE), ManufacturingProcess04)) %>%
mutate(ManufacturingProcess05 = ifelse(is.na(ManufacturingProcess05), mean(ManufacturingProcess05, na.rm = TRUE), ManufacturingProcess05)) %>%
mutate(ManufacturingProcess06 = ifelse(is.na(ManufacturingProcess06), mean(ManufacturingProcess06, na.rm = TRUE), ManufacturingProcess06)) %>%
mutate(ManufacturingProcess07 = ifelse(is.na(ManufacturingProcess07), mean(ManufacturingProcess07, na.rm = TRUE), ManufacturingProcess07)) %>%
mutate(ManufacturingProcess08 = ifelse(is.na(ManufacturingProcess08), mean(ManufacturingProcess08, na.rm = TRUE), ManufacturingProcess08)) %>%
mutate(ManufacturingProcess10 = ifelse(is.na(ManufacturingProcess10), mean(ManufacturingProcess10, na.rm = TRUE), ManufacturingProcess10)) %>%
mutate(ManufacturingProcess11 = ifelse(is.na(ManufacturingProcess11), mean(ManufacturingProcess11, na.rm = TRUE), ManufacturingProcess11)) %>%
mutate(ManufacturingProcess12 = ifelse(is.na(ManufacturingProcess12), mean(ManufacturingProcess12, na.rm = TRUE), ManufacturingProcess12)) %>%
mutate(ManufacturingProcess14 = ifelse(is.na(ManufacturingProcess14), mean(ManufacturingProcess14, na.rm = TRUE), ManufacturingProcess14)) %>%
mutate(ManufacturingProcess22 = ifelse(is.na(ManufacturingProcess22), mean(ManufacturingProcess22, na.rm = TRUE), ManufacturingProcess22)) %>%
mutate(ManufacturingProcess23 = ifelse(is.na(ManufacturingProcess23), mean(ManufacturingProcess23, na.rm = TRUE), ManufacturingProcess23)) %>%
mutate(ManufacturingProcess24 = ifelse(is.na(ManufacturingProcess24), mean(ManufacturingProcess24, na.rm = TRUE), ManufacturingProcess24)) %>%
mutate(ManufacturingProcess25 = ifelse(is.na(ManufacturingProcess25), mean(ManufacturingProcess25, na.rm = TRUE), ManufacturingProcess25)) %>%
mutate(ManufacturingProcess26 = ifelse(is.na(ManufacturingProcess26), mean(ManufacturingProcess26, na.rm = TRUE), ManufacturingProcess26)) %>%
mutate(ManufacturingProcess27 = ifelse(is.na(ManufacturingProcess27), mean(ManufacturingProcess27, na.rm = TRUE), ManufacturingProcess27)) %>%
mutate(ManufacturingProcess28 = ifelse(is.na(ManufacturingProcess28), mean(ManufacturingProcess28, na.rm = TRUE), ManufacturingProcess28)) %>%
mutate(ManufacturingProcess29 = ifelse(is.na(ManufacturingProcess29), mean(ManufacturingProcess29, na.rm = TRUE), ManufacturingProcess29)) %>%
mutate(ManufacturingProcess30 = ifelse(is.na(ManufacturingProcess30), mean(ManufacturingProcess30, na.rm = TRUE), ManufacturingProcess30)) %>%
mutate(ManufacturingProcess31 = ifelse(is.na(ManufacturingProcess31), mean(ManufacturingProcess31, na.rm = TRUE), ManufacturingProcess31)) %>%
mutate(ManufacturingProcess33 = ifelse(is.na(ManufacturingProcess33), mean(ManufacturingProcess33, na.rm = TRUE), ManufacturingProcess33)) %>%
mutate(ManufacturingProcess34 = ifelse(is.na(ManufacturingProcess34), mean(ManufacturingProcess34, na.rm = TRUE), ManufacturingProcess34)) %>%
mutate(ManufacturingProcess35 = ifelse(is.na(ManufacturingProcess35), mean(ManufacturingProcess35, na.rm = TRUE), ManufacturingProcess35)) %>%
mutate(ManufacturingProcess36 = ifelse(is.na(ManufacturingProcess36), mean(ManufacturingProcess36, na.rm = TRUE), ManufacturingProcess36)) %>%
mutate(ManufacturingProcess40 = ifelse(is.na(ManufacturingProcess40), mean(ManufacturingProcess40, na.rm = TRUE), ManufacturingProcess40)) %>%
mutate(ManufacturingProcess41 = ifelse(is.na(ManufacturingProcess41), mean(ManufacturingProcess41, na.rm = TRUE), ManufacturingProcess41))
summary(mfg)
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.00 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.12 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.4 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.7 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.386 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.500 3rd Qu.: 9.825 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4829 1st Qu.:6019 1st Qu.:4562
## Median :4854 Median :6045 Median :4586
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4876 3rd Qu.:6069 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :20.00 Median : 9.161
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.700 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.75 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.6 Median :0.01957
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.0 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
Put the data all back together.
chem_mfg <- cbind(chem_mfg$Yield,bio,mfg)
names(chem_mfg)[1] <- "Yield"
Check for low frequency values and remove them. Only one variable was removed
# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(chem_mfg)
# remove the above set from the data and store the result in a dataframe
chem_mfgPred <- chem_mfg[,-lowFreqPredictors]
# see what is left
str(chem_mfgPred)
## 'data.frame': 176 obs. of 57 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.6 14.6 14.6 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num 11.2 0 0 0 10.7 ...
## $ ManufacturingProcess02: num 16.7 0 0 0 0 ...
## $ ManufacturingProcess03: num 1.54 1.54 1.54 1.54 1.54 ...
## $ ManufacturingProcess04: num 932 917 912 911 918 ...
## $ ManufacturingProcess05: num 1002 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num 207 210 207 213 206 ...
## $ ManufacturingProcess07: num 177 177 178 177 178 ...
## $ ManufacturingProcess08: num 178 178 178 177 178 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num 9.18 9.18 9.18 9.18 9.18 ...
## $ ManufacturingProcess11: num 9.39 9.39 9.39 9.39 9.39 ...
## $ ManufacturingProcess12: num 858 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num 5.41 3 4 5 8 ...
## $ ManufacturingProcess23: num 3.02 0 1 2 4 ...
## $ ManufacturingProcess24: num 8.83 3 4 5 18 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num 0.0177 0.1 0 0 0 ...
## $ ManufacturingProcess41: num 0.0237 0.15 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
The optimal value of the performance metric is \(R^2 = 0.4919181\).
set.seed(86756453)
trainingRows <- sample(c(rep(0, 0.75 * nrow(chem_mfgPred)),
rep(1, 0.25 * nrow(chem_mfgPred))))
pred_train2 <- chem_mfgPred[trainingRows == 0,]
pred_test2 <- chem_mfgPred[trainingRows == 1,]
# create the PLS model with the training data to maximize covariance between the
# predictor and target variables using 10-fold cross validation and measure using R2
pls_model2 <- train(Yield ~ .
, pred_train2
, method = "pls"
, metric = "Rsquared"
, tuneLength = 20
, trControl = trainControl(method = "cv")
, preProc=c('center','scale')
)
pls_model2
## Partial Least Squares
##
## 132 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 119, 119, 120, 117, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.525766 0.4118582 1.196626
## 2 2.055335 0.4045343 1.286997
## 3 1.545976 0.4919181 1.101313
## 4 1.869902 0.4657163 1.214377
## 5 2.231519 0.4544057 1.316088
## 6 2.420712 0.4466847 1.360377
## 7 2.465001 0.4324138 1.403582
## 8 2.796541 0.4195232 1.514503
## 9 2.922994 0.4077897 1.584853
## 10 3.056973 0.4085545 1.642520
## 11 3.149211 0.3918894 1.672440
## 12 3.095849 0.3891789 1.647143
## 13 3.064822 0.3830482 1.657215
## 14 3.018326 0.3847448 1.648539
## 15 2.827756 0.3942972 1.596415
## 16 2.584247 0.4051255 1.518665
## 17 2.294588 0.4204892 1.430888
## 18 2.169118 0.4373245 1.377378
## 19 2.070201 0.4486404 1.345298
## 20 2.303367 0.3917599 1.443707
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
…which can be seen graphically in the plot below.
ggplot(pls_model2)
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
The \(R^2\) estimate for the test set is 0.5412972. Once again the test set performed better than the training set.
# predict the test set response for the PLS model
pls_test2 <- predict(pls_model2, newdata = pred_test2)
postResample(pred=pls_test2, obs = pred_test2$Yield)
## RMSE Rsquared MAE
## 1.1952589 0.5412972 0.9793152
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
It appears that ManufacturingProcess 32 is the most important predictor in the model, and it appears that Manufacturing process predictors slighty edge out Biological predictors 12:8.
varImp(pls_model2)
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 84.70
## ManufacturingProcess17 84.28
## ManufacturingProcess36 83.77
## ManufacturingProcess09 79.59
## BiologicalMaterial02 56.38
## ManufacturingProcess06 54.10
## ManufacturingProcess12 53.67
## BiologicalMaterial06 53.18
## ManufacturingProcess11 52.93
## BiologicalMaterial08 52.76
## ManufacturingProcess33 51.15
## ManufacturingProcess34 50.89
## BiologicalMaterial04 50.23
## BiologicalMaterial12 49.49
## BiologicalMaterial11 47.99
## BiologicalMaterial01 47.12
## BiologicalMaterial03 46.01
## ManufacturingProcess28 43.95
## ManufacturingProcess04 41.07
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
The Yield response has a high correlation with quite a
few of the predictors, BiologicalMaterial12 and ManufacturingProcess33
being among the strongest. Given the correlations in the plot below,
manufacturing could experiment with adjusting the mix of processes to
optimize production.
predictors <- chem_mfgPred %>%
select('Yield', 'ManufacturingProcess32',
'ManufacturingProcess36','ManufacturingProcess13',
'ManufacturingProcess17','ManufacturingProcess09',
'ManufacturingProcess12','ManufacturingProcess11',
'ManufacturingProcess33','BiologicalMaterial02',
'BiologicalMaterial08','BiologicalMaterial06',
'BiologicalMaterial12')
corr_pred <- cor(predictors)
ggcorr(corr_pred,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
color = "royalblue",
layout.exp = 5,
low = "darkorange",
mid = "gray95",
high = "lightgreen",
name = "Correlation")