6.2, 6.3

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60
## 'data.frame':    165 obs. of  1 variable:
##  $ permeability: num  12.52 1.12 19.41 1.73 1.68 ...

6.2b Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

The number of predictors drop from 1107 to 388.

## [1] 1107
## [1] 388

6.12c Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

First, some investigation. There are no missing values. Multicollinearity is a big issue here - there are 152 factors pairs that have correlations of 1. If we were to model this data properly we would remove many of these predictors. The following are just the first 10 perfectly correlated pairs:

##    col1 col2 correlation
## 1   X16 X133           1
## 2   X16 X138           1
## 3   X25  X37           1
## 4   X25  X49           1
## 5   X25  X71           1
## 6   X25  X72           1
## 8   X37  X49           1
## 9   X37  X71           1
## 10  X37  X72           1
## 13  X49  X71           1

PCA demonstrates that 95% of the variance is explained by only 27 components.

Thus, PLS should prove to be an effective model for this dataset. Now we separate the data into train and test sets and tune a PLS model.

## Data:    X dimension: 133 388 
##  Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           21.19    34.57    39.41    43.95    46.92
## .outcome    30.06    46.23    56.31    62.99    69.42

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 120, 119, 119, 117, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.11181  0.3071258   9.869550
##    2     11.95488  0.4281023   8.627336
##    3     11.90545  0.4379496   8.980025
##    4     12.08244  0.4276741   9.010439
##    5     11.94569  0.4401998   8.775094
##    6     12.07100  0.4396672   8.871424
##    7     12.13576  0.4351574   9.175004
##    8     12.27294  0.4291498   9.316122
##    9     12.31745  0.4359112   9.406491
##   10     12.56289  0.4220734   9.438185
##   11     12.81099  0.4077647   9.634144
##   12     13.08730  0.3975245   9.990265
##   13     13.29408  0.3876000  10.101604
##   14     13.41767  0.3825028  10.079740
##   15     13.91744  0.3608912  10.370610
##   16     14.48368  0.3437767  10.801940
##   17     14.73114  0.3354244  10.943813
##   18     14.92742  0.3366052  11.041042
##   19     14.94452  0.3315317  10.935378
##   20     15.11929  0.3258842  11.063864
##   21     15.27214  0.3152969  11.149444
##   22     15.20330  0.3123669  11.082855
##   23     15.26548  0.3090357  11.116738
##   24     15.24906  0.3085545  11.128537
##   25     15.36545  0.3106150  11.197423
##   26     15.55322  0.3048577  11.316846
##   27     15.77497  0.2964688  11.447383
##   28     15.90463  0.2951896  11.543086
##   29     16.13207  0.2860465  11.601928
##   30     16.36892  0.2754329  11.714083
##   31     16.58035  0.2648424  11.802269
##   32     16.53970  0.2680147  11.728692
##   33     16.44082  0.2741822  11.679299
##   34     16.47567  0.2746826  11.729323
##   35     16.54971  0.2722948  11.817746
##   36     16.55575  0.2724576  11.897099
##   37     16.60075  0.2700206  11.997000
##   38     16.68565  0.2673892  12.159780
##   39     16.72789  0.2681962  12.204228
##   40     16.81429  0.2690302  12.291826
##   41     16.99750  0.2695893  12.444222
##   42     17.20090  0.2664292  12.592565
##   43     17.49048  0.2622320  12.776624
##   44     17.74335  0.2572482  12.927515
##   45     17.91479  0.2563532  13.044397
##   46     18.11550  0.2537098  13.168929
##   47     18.26521  0.2524919  13.277247
##   48     18.39843  0.2508626  13.378727
##   49     18.54046  0.2494562  13.478581
##   50     18.64519  0.2497707  13.564966
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.

The highest R2 is .44, with the number of components considered only 5.

6.d (d) Predict the response for the test set. What is the test set estimate of R2?

Surprisingly, the model performed better on the test set: R Squared = .50.

## [1] 0.5037753

3.2e Try building other models discussed in this chapter. Do any have better predictive performance? Would you recommend any of your models to replace the permeability laboratory experiment? .

We try LASSO. It performs better with an R squared of .60. I would possibly use this model instead because of the better R Squared.

However, I would most likely do this instead:

There are more variables than observations, so we won’t be able to perform OLS. However, there are over 2,000 unique pairs of predictors with correlations over 90%, including 152 pairs of perfectly correlated predictors. If were to remove indicators which give us virtually no information, we may be able to reduce the number of predictors down far enough to perform OLS. In any case, cleaning the dataset first would be the best option, no matter what technique we use.

##           Length Class     Mode   
## a0          1    -none-    numeric
## beta      388    dgCMatrix S4     
## df          1    -none-    numeric
## dim         2    -none-    numeric
## lambda      1    -none-    numeric
## dev.ratio   1    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric
## [1] 0.6026641
## [1] "Unique pairs of predictors: 2029"

6.3

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.

6.3(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values

The number of missing values is very low, (although one predictor has 15 missing values, or 8%.) Using a “simple is best approach”, we will impute the median (since we can see some skewing, and the median is more robust than the mean.)

The data is marked by a great degree of multicollinearity. For example, 15 predictor pairs have correlations of .90 or higher. This is especially true among manufacturing process pairs.

##      Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##  Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
##  Median :39.97   Median :6.305        Median :55.09        Median :67.22       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
##                                                                                
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##  Min.   : 9.38        Min.   :13.24        Min.   :40.60       
##  1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
##  Median :12.10        Median :18.49        Median :48.46       
##  Mean   :12.35        Mean   :18.60        Mean   :48.91       
##  3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
##  Max.   :23.09        Max.   :24.85        Max.   :59.38       
##                                                                
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##  Min.   :100.0        Min.   :15.88        Min.   :11.44       
##  1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
##  Median :100.0        Median :17.51        Median :12.84       
##  Mean   :100.0        Mean   :17.49        Mean   :12.85       
##  3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
##  Max.   :100.8        Max.   :19.14        Max.   :14.08       
##                                                                
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##  Min.   :1.770        Min.   :135.8        Min.   :18.35       
##  1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
##  Median :2.710        Median :146.1        Median :20.12       
##  Mean   :2.801        Mean   :147.0        Mean   :20.20       
##  3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
##  Max.   :6.870        Max.   :158.7        Max.   :22.21       
##                                                                
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  NA's   :1              NA's   :3              NA's   :15            
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.2         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  NA's   :1              NA's   :1              NA's   :2             
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.400         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  NA's   :9              NA's   :10             NA's   :1             
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##                         NA's   :1                                    
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##                                                                      
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##                                                                      
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  NA's   :1              NA's   :1              NA's   :1             
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
##  Median :4855           Median :6047           Median :4587          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :19.90          Median : 9.100        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.80          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  NA's   :5                                     NA's   :5             
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.0          Median :0.02000       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##                                                                      
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600         
## 
## [[1]]

## 
## [[2]]

## 
## [[3]]

##                      col1                   col2 correlation
## 1    BiologicalMaterial02   BiologicalMaterial06   0.9543113
## 2    BiologicalMaterial04   BiologicalMaterial10   0.9205870
## 5    BiologicalMaterial11   BiologicalMaterial12   0.9037209
## 7  ManufacturingProcess18 ManufacturingProcess20   0.9917474
## 9  ManufacturingProcess25 ManufacturingProcess26   0.9975405
## 10 ManufacturingProcess25 ManufacturingProcess27   0.9935088
## 11 ManufacturingProcess25 ManufacturingProcess29   0.9371902
## 12 ManufacturingProcess25 ManufacturingProcess31   0.9713249
## 14 ManufacturingProcess26 ManufacturingProcess27   0.9960843
## 15 ManufacturingProcess26 ManufacturingProcess29   0.9516584
## 16 ManufacturingProcess26 ManufacturingProcess31   0.9606619
## 19 ManufacturingProcess27 ManufacturingProcess29   0.9509045
## 20 ManufacturingProcess27 ManufacturingProcess31   0.9542242
## 27 ManufacturingProcess40 ManufacturingProcess41   0.9245071
## 29 ManufacturingProcess42 ManufacturingProcess44   0.9487495

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

We choose LASSO regression for our model. LASSO has the advantage that it will eliminate unneeded indicators.

In order to compare models, we will use the RMSE, which gives us the best intuitive sense of the distribution of errors our model is generating.

The RMSE on the training set is .40.

## [1] "Train"
## [1] 0.4032026
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

On the test set, the RMSE is .56. This is somewhat higher. Given that the mean of Yield is 40, an RMSE of .56 suggests we are getting reasonably good predictions.

## [1] "Test"
## [1] 0.5644791
## [1] "Mean of yield: 40.1765340909091"

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

“Dominate the list” is not here defined.There are far more process predictors to begin with, so they automatically dominate the list.

However, as expected, the LASSO model eliminates a number of predictors from the model. In all, only 13 predictors are preserved. The percentage of biological predictors preserved is higher than those of process, and they have the highest coefficents. Since the predictors are scaled, we may consider them to “dominate the list.”

It must be noted, however that when we ran a ridge regression, the results were more mixed. Interpretation is tricky when regularization is present.

## 58 x 1 sparse Matrix of class "dgCMatrix"
##                                   s0
## (Intercept)             0.0057754459
## BiologicalMaterial02    0.1787194114
## BiologicalMaterial03    0.0779703979
## BiologicalMaterial04    .           
## BiologicalMaterial05    .           
## BiologicalMaterial06    .           
## BiologicalMaterial07    .           
## BiologicalMaterial08    0.1468960077
## BiologicalMaterial09    .           
## BiologicalMaterial10    0.4169903971
## BiologicalMaterial11    .           
## BiologicalMaterial12    .           
## ManufacturingProcess01 -0.0847315414
## ManufacturingProcess02  .           
## ManufacturingProcess03  .           
## ManufacturingProcess04 -0.0273440419
## ManufacturingProcess05  .           
## ManufacturingProcess06  .           
## ManufacturingProcess07 -0.0254131586
## ManufacturingProcess08  .           
## ManufacturingProcess09  .           
## ManufacturingProcess10  .           
## ManufacturingProcess11  .           
## ManufacturingProcess12  0.0194014621
## ManufacturingProcess13  .           
## ManufacturingProcess14  .           
## ManufacturingProcess15  .           
## ManufacturingProcess16  .           
## ManufacturingProcess17  .           
## ManufacturingProcess18  .           
## ManufacturingProcess19  .           
## ManufacturingProcess20  .           
## ManufacturingProcess21  .           
## ManufacturingProcess22  .           
## ManufacturingProcess23 -0.0255878054
## ManufacturingProcess24  .           
## ManufacturingProcess25  .           
## ManufacturingProcess26  .           
## ManufacturingProcess27  .           
## ManufacturingProcess28  .           
## ManufacturingProcess29  .           
## ManufacturingProcess30  .           
## ManufacturingProcess31  .           
## ManufacturingProcess32  .           
## ManufacturingProcess33  0.0558848630
## ManufacturingProcess34  .           
## ManufacturingProcess35  .           
## ManufacturingProcess36 -0.0346237528
## ManufacturingProcess37  .           
## ManufacturingProcess38  .           
## ManufacturingProcess39 -0.0482853543
## ManufacturingProcess40  .           
## ManufacturingProcess41  .           
## ManufacturingProcess42 -0.0421448363
## ManufacturingProcess43  .           
## ManufacturingProcess44  .           
## ManufacturingProcess45 -0.0001154511
## Yield                   .

6.3f Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The key to this exercise is navigating multicollinearity. By reducing the model to a smaller set of key indicators that capture the information of most of the other indicators, the manufacturing process is greatly simplified. All of the factors are correlated with yield - but with the process predictors we can have more influence over yields. We might drill down to find the highest process indicators, or we might use the biological ones to better assess quality.

## [1] "Top 5 indicators using Lasso"
## # A tibble: 5 × 2
##   term                   estimate
##   <chr>                     <dbl>
## 1 BiologicalMaterial10     0.417 
## 2 BiologicalMaterial02     0.179 
## 3 BiologicalMaterial08     0.147 
## 4 ManufacturingProcess01  -0.0847
## 5 BiologicalMaterial03     0.0780