6.2, 6.3
## permeability
## Min. : 0.06
## 1st Qu.: 1.55
## Median : 4.91
## Mean :12.24
## 3rd Qu.:15.47
## Max. :55.60
## 'data.frame': 165 obs. of 1 variable:
## $ permeability: num 12.52 1.12 19.41 1.73 1.68 ...
6.2b Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
The number of predictors drop from 1107 to 388.
## [1] 1107
## [1] 388
6.12c Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
First, some investigation. There are no missing values. Multicollinearity is a big issue here - there are 152 factors pairs that have correlations of 1. If we were to model this data properly we would remove many of these predictors. The following are just the first 10 perfectly correlated pairs:
## col1 col2 correlation
## 1 X16 X133 1
## 2 X16 X138 1
## 3 X25 X37 1
## 4 X25 X49 1
## 5 X25 X71 1
## 6 X25 X72 1
## 8 X37 X49 1
## 9 X37 X71 1
## 10 X37 X72 1
## 13 X49 X71 1
PCA demonstrates that 95% of the variance is explained by only 27 components.
Thus, PLS should prove to be an effective model for this dataset. Now we separate the data into train and test sets and tune a PLS model.
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 21.19 34.57 39.41 43.95 46.92
## .outcome 30.06 46.23 56.31 62.99 69.42
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 120, 119, 119, 117, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.11181 0.3071258 9.869550
## 2 11.95488 0.4281023 8.627336
## 3 11.90545 0.4379496 8.980025
## 4 12.08244 0.4276741 9.010439
## 5 11.94569 0.4401998 8.775094
## 6 12.07100 0.4396672 8.871424
## 7 12.13576 0.4351574 9.175004
## 8 12.27294 0.4291498 9.316122
## 9 12.31745 0.4359112 9.406491
## 10 12.56289 0.4220734 9.438185
## 11 12.81099 0.4077647 9.634144
## 12 13.08730 0.3975245 9.990265
## 13 13.29408 0.3876000 10.101604
## 14 13.41767 0.3825028 10.079740
## 15 13.91744 0.3608912 10.370610
## 16 14.48368 0.3437767 10.801940
## 17 14.73114 0.3354244 10.943813
## 18 14.92742 0.3366052 11.041042
## 19 14.94452 0.3315317 10.935378
## 20 15.11929 0.3258842 11.063864
## 21 15.27214 0.3152969 11.149444
## 22 15.20330 0.3123669 11.082855
## 23 15.26548 0.3090357 11.116738
## 24 15.24906 0.3085545 11.128537
## 25 15.36545 0.3106150 11.197423
## 26 15.55322 0.3048577 11.316846
## 27 15.77497 0.2964688 11.447383
## 28 15.90463 0.2951896 11.543086
## 29 16.13207 0.2860465 11.601928
## 30 16.36892 0.2754329 11.714083
## 31 16.58035 0.2648424 11.802269
## 32 16.53970 0.2680147 11.728692
## 33 16.44082 0.2741822 11.679299
## 34 16.47567 0.2746826 11.729323
## 35 16.54971 0.2722948 11.817746
## 36 16.55575 0.2724576 11.897099
## 37 16.60075 0.2700206 11.997000
## 38 16.68565 0.2673892 12.159780
## 39 16.72789 0.2681962 12.204228
## 40 16.81429 0.2690302 12.291826
## 41 16.99750 0.2695893 12.444222
## 42 17.20090 0.2664292 12.592565
## 43 17.49048 0.2622320 12.776624
## 44 17.74335 0.2572482 12.927515
## 45 17.91479 0.2563532 13.044397
## 46 18.11550 0.2537098 13.168929
## 47 18.26521 0.2524919 13.277247
## 48 18.39843 0.2508626 13.378727
## 49 18.54046 0.2494562 13.478581
## 50 18.64519 0.2497707 13.564966
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.
The highest R2 is .44, with the number of components considered only 5.
6.d (d) Predict the response for the test set. What is the test set estimate of R2?
Surprisingly, the model performed better on the test set: R Squared = .50.
## [1] 0.5037753
3.2e Try building other models discussed in this chapter. Do any have better predictive performance? Would you recommend any of your models to replace the permeability laboratory experiment? .
We try LASSO. It performs better with an R squared of .60. I would possibly use this model instead because of the better R Squared.
However, I would most likely do this instead:
There are more variables than observations, so we won’t be able to perform OLS. However, there are over 2,000 unique pairs of predictors with correlations over 90%, including 152 pairs of perfectly correlated predictors. If were to remove indicators which give us virtually no information, we may be able to reduce the number of predictors down far enough to perform OLS. In any case, cleaning the dataset first would be the best option, no matter what technique we use.
## Length Class Mode
## a0 1 -none- numeric
## beta 388 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## [1] 0.6026641
## [1] "Unique pairs of predictors: 2029"
6.3
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.
6.3(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values
The number of missing values is very low, (although one predictor has 15 missing values, or 8%.) Using a “simple is best approach”, we will impute the median (since we can see some skewing, and the median is more robust than the mean.)
The data is marked by a great degree of multicollinearity. For example, 15 predictor pairs have correlations of .90 or higher. This is especially true among manufacturing process pairs.
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
##
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
##
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
##
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
##
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
## [[1]]
##
## [[2]]
##
## [[3]]
## col1 col2 correlation
## 1 BiologicalMaterial02 BiologicalMaterial06 0.9543113
## 2 BiologicalMaterial04 BiologicalMaterial10 0.9205870
## 5 BiologicalMaterial11 BiologicalMaterial12 0.9037209
## 7 ManufacturingProcess18 ManufacturingProcess20 0.9917474
## 9 ManufacturingProcess25 ManufacturingProcess26 0.9975405
## 10 ManufacturingProcess25 ManufacturingProcess27 0.9935088
## 11 ManufacturingProcess25 ManufacturingProcess29 0.9371902
## 12 ManufacturingProcess25 ManufacturingProcess31 0.9713249
## 14 ManufacturingProcess26 ManufacturingProcess27 0.9960843
## 15 ManufacturingProcess26 ManufacturingProcess29 0.9516584
## 16 ManufacturingProcess26 ManufacturingProcess31 0.9606619
## 19 ManufacturingProcess27 ManufacturingProcess29 0.9509045
## 20 ManufacturingProcess27 ManufacturingProcess31 0.9542242
## 27 ManufacturingProcess40 ManufacturingProcess41 0.9245071
## 29 ManufacturingProcess42 ManufacturingProcess44 0.9487495
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
We choose LASSO regression for our model. LASSO has the advantage that it will eliminate unneeded indicators.
In order to compare models, we will use the RMSE, which gives us the best intuitive sense of the distribution of errors our model is generating.
The RMSE on the training set is .40.
## [1] "Train"
## [1] 0.4032026
On the test set, the RMSE is .56. This is somewhat higher. Given that the mean of Yield is 40, an RMSE of .56 suggests we are getting reasonably good predictions.
## [1] "Test"
## [1] 0.5644791
## [1] "Mean of yield: 40.1765340909091"
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
“Dominate the list” is not here defined.There are far more process predictors to begin with, so they automatically dominate the list.
However, as expected, the LASSO model eliminates a number of predictors from the model. In all, only 13 predictors are preserved. The percentage of biological predictors preserved is higher than those of process, and they have the highest coefficents. Since the predictors are scaled, we may consider them to “dominate the list.”
It must be noted, however that when we ran a ridge regression, the results were more mixed. Interpretation is tricky when regularization is present.
## 58 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.0057754459
## BiologicalMaterial02 0.1787194114
## BiologicalMaterial03 0.0779703979
## BiologicalMaterial04 .
## BiologicalMaterial05 .
## BiologicalMaterial06 .
## BiologicalMaterial07 .
## BiologicalMaterial08 0.1468960077
## BiologicalMaterial09 .
## BiologicalMaterial10 0.4169903971
## BiologicalMaterial11 .
## BiologicalMaterial12 .
## ManufacturingProcess01 -0.0847315414
## ManufacturingProcess02 .
## ManufacturingProcess03 .
## ManufacturingProcess04 -0.0273440419
## ManufacturingProcess05 .
## ManufacturingProcess06 .
## ManufacturingProcess07 -0.0254131586
## ManufacturingProcess08 .
## ManufacturingProcess09 .
## ManufacturingProcess10 .
## ManufacturingProcess11 .
## ManufacturingProcess12 0.0194014621
## ManufacturingProcess13 .
## ManufacturingProcess14 .
## ManufacturingProcess15 .
## ManufacturingProcess16 .
## ManufacturingProcess17 .
## ManufacturingProcess18 .
## ManufacturingProcess19 .
## ManufacturingProcess20 .
## ManufacturingProcess21 .
## ManufacturingProcess22 .
## ManufacturingProcess23 -0.0255878054
## ManufacturingProcess24 .
## ManufacturingProcess25 .
## ManufacturingProcess26 .
## ManufacturingProcess27 .
## ManufacturingProcess28 .
## ManufacturingProcess29 .
## ManufacturingProcess30 .
## ManufacturingProcess31 .
## ManufacturingProcess32 .
## ManufacturingProcess33 0.0558848630
## ManufacturingProcess34 .
## ManufacturingProcess35 .
## ManufacturingProcess36 -0.0346237528
## ManufacturingProcess37 .
## ManufacturingProcess38 .
## ManufacturingProcess39 -0.0482853543
## ManufacturingProcess40 .
## ManufacturingProcess41 .
## ManufacturingProcess42 -0.0421448363
## ManufacturingProcess43 .
## ManufacturingProcess44 .
## ManufacturingProcess45 -0.0001154511
## Yield .
6.3f Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
The key to this exercise is navigating multicollinearity. By reducing the model to a smaller set of key indicators that capture the information of most of the other indicators, the manufacturing process is greatly simplified. All of the factors are correlated with yield - but with the process predictors we can have more influence over yields. We might drill down to find the highest process indicators, or we might use the biological ones to better assess quality.
## [1] "Top 5 indicators using Lasso"
## # A tibble: 5 × 2
## term estimate
## <chr> <dbl>
## 1 BiologicalMaterial10 0.417
## 2 BiologicalMaterial02 0.179
## 3 BiologicalMaterial08 0.147
## 4 ManufacturingProcess01 -0.0847
## 5 BiologicalMaterial03 0.0780