6.2.

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

As you can see, the fingerprints matrix contains 1,107 columns or predictors.

dim(fingerprints)
## [1]  165 1107

only 388 predictors remain after applying nearZeroVar function.

fingerprints_reduced <- fingerprints[, -nearZeroVar(fingerprints)]
                                     
dim(fingerprints_reduced)
## [1] 165 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

80% is used for training.

With R^2 used to select the model, ncomp 7 was found to be optimal with corresponding R^2 value of 0.5394.

set.seed(1)

train_selection <- createDataPartition(permeability, p = .80, list = FALSE)

train_x <- fingerprints_reduced[train_selection, ]
test_x <- fingerprints_reduced[-train_selection, ]
train_y <- permeability[train_selection, ]
test_y <- permeability[-train_selection, ]

PLS_fit <- train(x = train_x, y = train_y,
                method = "pls",
                metric = "Rsquared",
                tuneLength = 20, 
                trControl = trainControl(method = "cv"), 
                preProcess = c('center', 'scale')
          )
PLS_result <- PLS_fit$results
PLS_fit
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 121, 119, 120, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.35980  0.3077288   9.992025
##    2     12.02165  0.4264926   8.622295
##    3     11.61784  0.4770111   8.806476
##    4     11.48283  0.4836406   8.724123
##    5     11.26663  0.5082593   8.723191
##    6     11.27614  0.5240074   8.677272
##    7     11.27779  0.5394212   8.699414
##    8     11.75960  0.5085816   9.070277
##    9     12.09836  0.4868781   9.217395
##   10     12.17397  0.4923159   9.270026
##   11     12.26365  0.4945505   9.284900
##   12     12.30174  0.4931934   9.280955
##   13     12.54734  0.4838175   9.507611
##   14     12.79749  0.4702011   9.705002
##   15     12.86872  0.4653057   9.643805
##   16     13.10561  0.4553447   9.865946
##   17     13.49633  0.4417598  10.152908
##   18     13.81425  0.4309356  10.373471
##   19     13.95274  0.4321740  10.471727
##   20     14.32902  0.4230356  10.692783
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 7.
plot(PLS_fit)

  1. Predict the response for the test set. What is the test set estimate of R^2?

The R^2 is 0.4457228.

PLS_predict <- predict(PLS_fit, newdata=test_x)
postResample(pred=PLS_predict, obs=test_y)
##       RMSE   Rsquared        MAE 
## 10.8979377  0.4457228  8.4569465
  1. Try building other models discussed in this chapter. Do any have better predictive performance?

R^2 was used to choose the optimal model, which corresponds to lambda 0.4.

set.seed(1)
ridge_fit <- train(x=train_x, y=train_y,
                  method='ridge', metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
ridge_fit
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     24.02881  0.2627497  16.031438
##   0.1     11.83292  0.5081518   9.092937
##   0.2     11.58981  0.5330773   8.857077
##   0.3     11.79414  0.5390176   9.009380
##   0.4     12.15851  0.5405512   9.284284
##   0.5     12.62564  0.5397839   9.662863
##   0.6     13.17255  0.5384263  10.131791
##   0.7     13.76370  0.5366428  10.603431
##   0.8     14.40875  0.5347296  11.117149
##   0.9     15.08011  0.5328512  11.635077
##   1.0     15.78570  0.5309111  12.226047
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.4.
plot(ridge_fit)

Lasso regression.

R^2 was used to select the optimal model, which corresponds to fraction = 0.15.

set.seed(1)
lasso_fit <- train(x=train_x, y=train_y,
                  method='lasso',metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = train_x, y = train_y, method = "lasso", metric
## = "Rsquared", : missing values found in aggregated results
lasso_fit
## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.69625        NaN  12.538724
##   0.05      12.24969  0.4843311   9.116047
##   0.10      11.56814  0.4836731   8.205552
##   0.15      11.56574  0.4898894   8.221542
##   0.20      11.79666  0.4782172   8.428298
##   0.25      12.13109  0.4736244   8.810463
##   0.30      12.59829  0.4660510   9.277051
##   0.35      13.32724  0.4463834   9.791144
##   0.40      14.14116  0.4245198  10.327665
##   0.45      14.99053  0.4048185  10.809800
##   0.50      15.82414  0.3884961  11.289059
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.15.
plot(lasso_fit)

It appears that Ridge has the best R^2 0.5405512 among the three models.

summary(resamples(list(PLS=PLS_fit, Ridge=ridge_fit, Lasso=lasso_fit)))
## 
## Call:
## summary.resamples(object = resamples(list(PLS = PLS_fit, Ridge
##  = ridge_fit, Lasso = lasso_fit)))
## 
## Models: PLS, Ridge, Lasso 
## Number of resamples: 10 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean   3rd Qu.     Max. NA's
## PLS   6.543535 7.220561 8.824647 8.699414  9.785602 10.93386    0
## Ridge 6.802083 7.680192 9.047882 9.284284 11.031392 11.95240    0
## Lasso 6.660690 7.295599 7.918300 8.221542  8.329301 11.98442    0
## 
## RMSE 
##           Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS   8.018162  9.396614 10.99087 11.27779 13.14769 15.24661    0
## Ridge 9.902231 10.446352 11.48805 12.15851 13.75999 15.48795    0
## Lasso 8.975670 10.637842 11.16021 11.56574 12.78378 14.84629    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS   0.04977656 0.4365052 0.5847232 0.5394212 0.7029962 0.8759470    0
## Ridge 0.18314118 0.4019344 0.5990213 0.5405512 0.6504959 0.8222652    0
## Lasso 0.18010290 0.3056181 0.4640828 0.4898894 0.6740984 0.8337504    0
  1. Would you recommend any of your models to replace the permeability laboratory experiment?

The mean absolute error for the three models I investigate are around +/- 8 to 9. Histogram of permeability below shows that distribution is left skewed with most permeability under 10. So, I would not recommend replacing the lab experiment.

hist(permeability)


6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
head(ChemicalManufacturingProcess)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Below is summary of processPredictors. As you can see, there are NA’s.

summary(ChemicalManufacturingProcess)
##      Yield       BiologicalMaterial01 BiologicalMaterial02
##  Min.   :35.25   Min.   :4.580        Min.   :46.87       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68       
##  Median :39.97   Median :6.305        Median :55.09       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75       
##                                                           
##  BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
##  Min.   :56.97        Min.   : 9.38        Min.   :13.24       
##  1st Qu.:64.98        1st Qu.:11.24        1st Qu.:17.23       
##  Median :67.22        Median :12.10        Median :18.49       
##  Mean   :67.70        Mean   :12.35        Mean   :18.60       
##  3rd Qu.:70.43        3rd Qu.:13.22        3rd Qu.:19.90       
##  Max.   :78.25        Max.   :23.09        Max.   :24.85       
##                                                                
##  BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
##  Min.   :40.60        Min.   :100.0        Min.   :15.88       
##  1st Qu.:46.05        1st Qu.:100.0        1st Qu.:17.06       
##  Median :48.46        Median :100.0        Median :17.51       
##  Mean   :48.91        Mean   :100.0        Mean   :17.49       
##  3rd Qu.:51.34        3rd Qu.:100.0        3rd Qu.:17.88       
##  Max.   :59.38        Max.   :100.8        Max.   :19.14       
##                                                                
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##  Min.   :11.44        Min.   :1.770        Min.   :135.8       
##  1st Qu.:12.60        1st Qu.:2.460        1st Qu.:143.8       
##  Median :12.84        Median :2.710        Median :146.1       
##  Mean   :12.85        Mean   :2.801        Mean   :147.0       
##  3rd Qu.:13.13        3rd Qu.:2.990        3rd Qu.:149.6       
##  Max.   :14.08        Max.   :6.870        Max.   :158.7       
##                                                                
##  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
##  Min.   :18.35        Min.   : 0.00          Min.   : 0.00         
##  1st Qu.:19.73        1st Qu.:10.80          1st Qu.:19.30         
##  Median :20.12        Median :11.40          Median :21.00         
##  Mean   :20.20        Mean   :11.21          Mean   :16.68         
##  3rd Qu.:20.75        3rd Qu.:12.15          3rd Qu.:21.50         
##  Max.   :22.21        Max.   :14.10          Max.   :22.50         
##                       NA's   :1              NA's   :3             
##  ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
##  Min.   :1.47           Min.   :911.0          Min.   : 923.0        
##  1st Qu.:1.53           1st Qu.:928.0          1st Qu.: 986.8        
##  Median :1.54           Median :934.0          Median : 999.2        
##  Mean   :1.54           Mean   :931.9          Mean   :1001.7        
##  3rd Qu.:1.55           3rd Qu.:936.0          3rd Qu.:1008.9        
##  Max.   :1.60           Max.   :946.0          Max.   :1175.3        
##  NA's   :15             NA's   :1              NA's   :1             
##  ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
##  Min.   :203.0          Min.   :177.0          Min.   :177.0         
##  1st Qu.:205.7          1st Qu.:177.0          1st Qu.:177.0         
##  Median :206.8          Median :177.0          Median :178.0         
##  Mean   :207.4          Mean   :177.5          Mean   :177.6         
##  3rd Qu.:208.7          3rd Qu.:178.0          3rd Qu.:178.0         
##  Max.   :227.4          Max.   :178.0          Max.   :178.0         
##  NA's   :2              NA's   :1              NA's   :1             
##  ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
##  Min.   :38.89          Min.   : 7.500         Min.   : 7.500        
##  1st Qu.:44.89          1st Qu.: 8.700         1st Qu.: 9.000        
##  Median :45.73          Median : 9.100         Median : 9.400        
##  Mean   :45.66          Mean   : 9.179         Mean   : 9.386        
##  3rd Qu.:46.52          3rd Qu.: 9.550         3rd Qu.: 9.900        
##  Max.   :49.36          Max.   :11.600         Max.   :11.500        
##                         NA's   :9              NA's   :10            
##  ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
##  Min.   :   0.0         Min.   :32.10          Min.   :4701          
##  1st Qu.:   0.0         1st Qu.:33.90          1st Qu.:4828          
##  Median :   0.0         Median :34.60          Median :4856          
##  Mean   : 857.8         Mean   :34.51          Mean   :4854          
##  3rd Qu.:   0.0         3rd Qu.:35.20          3rd Qu.:4882          
##  Max.   :4549.0         Max.   :38.60          Max.   :5055          
##  NA's   :1                                     NA's   :1             
##  ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
##  Min.   :5904           Min.   :   0           Min.   :31.30         
##  1st Qu.:6010           1st Qu.:4561           1st Qu.:33.50         
##  Median :6032           Median :4588           Median :34.40         
##  Mean   :6039           Mean   :4566           Mean   :34.34         
##  3rd Qu.:6061           3rd Qu.:4619           3rd Qu.:35.10         
##  Max.   :6233           Max.   :4852           Max.   :40.00         
##                                                                      
##  ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
##  Min.   :   0           Min.   :5890           Min.   :   0          
##  1st Qu.:4813           1st Qu.:6001           1st Qu.:4553          
##  Median :4835           Median :6022           Median :4582          
##  Mean   :4810           Mean   :6028           Mean   :4556          
##  3rd Qu.:4862           3rd Qu.:6050           3rd Qu.:4610          
##  Max.   :4971           Max.   :6146           Max.   :4759          
##                                                                      
##  ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
##  Min.   :-1.8000        Min.   : 0.000         Min.   :0.000         
##  1st Qu.:-0.6000        1st Qu.: 3.000         1st Qu.:2.000         
##  Median :-0.3000        Median : 5.000         Median :3.000         
##  Mean   :-0.1642        Mean   : 5.406         Mean   :3.017         
##  3rd Qu.: 0.0000        3rd Qu.: 8.000         3rd Qu.:4.000         
##  Max.   : 3.6000        Max.   :12.000         Max.   :6.000         
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
##  Min.   : 0.000         Min.   :   0           Min.   :   0          
##  1st Qu.: 4.000         1st Qu.:4832           1st Qu.:6020          
##  Median : 8.000         Median :4855           Median :6047          
##  Mean   : 8.834         Mean   :4828           Mean   :6016          
##  3rd Qu.:14.000         3rd Qu.:4877           3rd Qu.:6070          
##  Max.   :23.000         Max.   :4990           Max.   :6161          
##  NA's   :1              NA's   :5              NA's   :5             
##  ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
##  Min.   :   0           Min.   : 0.000         Min.   : 0.00         
##  1st Qu.:4560           1st Qu.: 0.000         1st Qu.:19.70         
##  Median :4587           Median :10.400         Median :19.90         
##  Mean   :4563           Mean   : 6.592         Mean   :20.01         
##  3rd Qu.:4609           3rd Qu.:10.750         3rd Qu.:20.40         
##  Max.   :4710           Max.   :11.500         Max.   :22.00         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
##  Min.   : 0.000         Min.   : 0.00          Min.   :143.0         
##  1st Qu.: 8.800         1st Qu.:70.10          1st Qu.:155.0         
##  Median : 9.100         Median :70.80          Median :158.0         
##  Mean   : 9.161         Mean   :70.18          Mean   :158.5         
##  3rd Qu.: 9.700         3rd Qu.:71.40          3rd Qu.:162.0         
##  Max.   :11.200         Max.   :72.50          Max.   :173.0         
##  NA's   :5              NA's   :5                                    
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##  Min.   :56.00          Min.   :2.300          Min.   :463.0         
##  1st Qu.:62.00          1st Qu.:2.500          1st Qu.:490.0         
##  Median :64.00          Median :2.500          Median :495.0         
##  Mean   :63.54          Mean   :2.494          Mean   :495.6         
##  3rd Qu.:65.00          3rd Qu.:2.500          3rd Qu.:501.5         
##  Max.   :70.00          Max.   :2.600          Max.   :522.0         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##  Min.   :0.01700        Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.01900        1st Qu.:0.700          1st Qu.:2.000         
##  Median :0.02000        Median :1.000          Median :3.000         
##  Mean   :0.01957        Mean   :1.014          Mean   :2.534         
##  3rd Qu.:0.02000        3rd Qu.:1.300          3rd Qu.:3.000         
##  Max.   :0.02200        Max.   :2.300          Max.   :3.000         
##  NA's   :5                                                           
##  ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
##  Min.   :0.000          Min.   :0.00000        Min.   :0.00000       
##  1st Qu.:7.100          1st Qu.:0.00000        1st Qu.:0.00000       
##  Median :7.200          Median :0.00000        Median :0.00000       
##  Mean   :6.851          Mean   :0.01771        Mean   :0.02371       
##  3rd Qu.:7.300          3rd Qu.:0.00000        3rd Qu.:0.00000       
##  Max.   :7.500          Max.   :0.10000        Max.   :0.20000       
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
##  Min.   : 0.00          Min.   : 0.0000        Min.   :0.000         
##  1st Qu.:11.40          1st Qu.: 0.6000        1st Qu.:1.800         
##  Median :11.60          Median : 0.8000        Median :1.900         
##  Mean   :11.21          Mean   : 0.9119        Mean   :1.805         
##  3rd Qu.:11.70          3rd Qu.: 1.0250        3rd Qu.:1.900         
##  Max.   :12.10          Max.   :11.0000        Max.   :2.100         
##                                                                      
##  ManufacturingProcess45
##  Min.   :0.000         
##  1st Qu.:2.100         
##  Median :2.200         
##  Mean   :2.138         
##  3rd Qu.:2.300         
##  Max.   :2.600         
## 

K-nearest neighbors is used with the preProcess function to impute missing data.

(cmp_knn_impute <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute')))
## Created from 152 samples and 58 variables
## 
## Pre-processing:
##   - centered (58)
##   - ignored (0)
##   - 5 nearest neighbor imputation (58)
##   - scaled (58)
cmp_df <- predict(cmp_knn_impute, ChemicalManufacturingProcess)

We no longer see any NA’s.

summary(cmp_df)
##      Yield         BiologicalMaterial01 BiologicalMaterial02
##  Min.   :-2.6692   Min.   :-2.5653      Min.   :-2.1858     
##  1st Qu.:-0.7716   1st Qu.:-0.6078      1st Qu.:-0.7457     
##  Median :-0.1119   Median :-0.1491      Median :-0.1484     
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000     
##  3rd Qu.: 0.7035   3rd Qu.: 0.6423      3rd Qu.: 0.7557     
##  Max.   : 3.3394   Max.   : 3.3597      Max.   : 2.2459     
##  BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
##  Min.   :-2.6830      Min.   :-1.6731      Min.   :-2.90576    
##  1st Qu.:-0.6811      1st Qu.:-0.6222      1st Qu.:-0.73944    
##  Median :-0.1212      Median :-0.1405      Median :-0.05891    
##  Mean   : 0.0000      Mean   : 0.0000      Mean   : 0.00000    
##  3rd Qu.: 0.6804      3rd Qu.: 0.4907      3rd Qu.: 0.70568    
##  Max.   : 2.6355      Max.   : 6.0523      Max.   : 3.38985    
##  BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
##  Min.   :-2.2184      Min.   :-0.1313      Min.   :-2.38535    
##  1st Qu.:-0.7622      1st Qu.:-0.1313      1st Qu.:-0.64225    
##  Median :-0.1202      Median :-0.1313      Median : 0.02249    
##  Mean   : 0.0000      Mean   : 0.0000      Mean   : 0.00000    
##  3rd Qu.: 0.6499      3rd Qu.:-0.1313      3rd Qu.: 0.56906    
##  Max.   : 2.7948      Max.   : 7.5723      Max.   : 2.43034    
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##  Min.   :-3.39629     Min.   :-1.7202      Min.   :-2.3116     
##  1st Qu.:-0.59627     1st Qu.:-0.5685      1st Qu.:-0.6505     
##  Median :-0.03627     Median :-0.1513      Median :-0.1811     
##  Mean   : 0.00000     Mean   : 0.0000      Mean   : 0.0000     
##  3rd Qu.: 0.67428     3rd Qu.: 0.3161      3rd Qu.: 0.5491     
##  Max.   : 2.96246     Max.   : 6.7920      Max.   : 2.4431     
##  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
##  Min.   :-2.3914      Min.   :-6.149703      Min.   :-1.969253     
##  1st Qu.:-0.6074      1st Qu.:-0.223563      1st Qu.: 0.308956     
##  Median :-0.1033      Median : 0.105667      Median : 0.509627     
##  Mean   : 0.0000      Mean   : 0.001224      Mean   : 0.009518     
##  3rd Qu.: 0.7112      3rd Qu.: 0.503487      3rd Qu.: 0.568648     
##  Max.   : 2.5986      Max.   : 1.587202      Max.   : 0.686690     
##  ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
##  Min.   :-3.10582       Min.   :-3.323233      Min.   :-2.577803     
##  1st Qu.:-0.42705       1st Qu.:-0.613828      1st Qu.:-0.487046     
##  Median : 0.37658       Median : 0.342432      Median :-0.086583     
##  Mean   : 0.04123       Mean   : 0.003213      Mean   :-0.002534     
##  3rd Qu.: 0.46587       3rd Qu.: 0.661186      3rd Qu.: 0.230347     
##  Max.   : 2.69818       Max.   : 2.254953      Max.   : 5.686954     
##  ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
##  Min.   :-1.630631      Min.   :-0.9580199     Min.   :-1.111973     
##  1st Qu.:-0.630408      1st Qu.:-0.9580199     1st Qu.:-1.111973     
##  Median :-0.222910      Median :-0.9580199     Median : 0.894164     
##  Mean   :-0.006574      Mean   :-0.0009072     Mean   :-0.001759     
##  3rd Qu.: 0.480950      3rd Qu.: 1.0378549     3rd Qu.: 0.894164     
##  Max.   : 7.408415      Max.   : 1.0378549     Max.   : 0.894164     
##  ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
##  Min.   :-4.37787       Min.   :-2.18999       Min.   :-2.63442      
##  1st Qu.:-0.49799       1st Qu.:-0.62482       1st Qu.:-0.53867      
##  Median : 0.04519       Median :-0.10310       Median : 0.02020      
##  Mean   : 0.00000       Mean   : 0.02156       Mean   : 0.03163      
##  3rd Qu.: 0.55281       3rd Qu.: 0.54906       3rd Qu.: 0.71878      
##  Max.   : 2.39252       Max.   : 3.15768       Max.   : 2.95425      
##  ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
##  Min.   :-0.480694      Min.   :-2.37172       Min.   :-2.803712     
##  1st Qu.:-0.480694      1st Qu.:-0.59881       1st Qu.:-0.488202     
##  Median :-0.480694      Median : 0.09066       Median : 0.029921     
##  Mean   :-0.002731      Mean   : 0.00000       Mean   :-0.004071     
##  3rd Qu.:-0.480694      3rd Qu.: 0.68163       3rd Qu.: 0.520534     
##  Max.   : 2.068439      Max.   : 4.03046       Max.   : 3.688885     
##  ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
##  Min.   :-2.3137        Min.   :-12.98219      Min.   :-2.43850      
##  1st Qu.:-0.4960        1st Qu.: -0.01436      1st Qu.:-0.67597      
##  Median :-0.1273        Median :  0.06312      Median : 0.04507      
##  Mean   : 0.0000        Mean   :  0.00000      Mean   : 0.00000      
##  3rd Qu.: 0.3786        3rd Qu.:  0.15126      3rd Qu.: 0.60587      
##  Max.   : 3.3283        Max.   :  0.81376      Max.   : 4.53150      
##  ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
##  Min.   :-13.08836      Min.   :-3.0321        Min.   :-13.05542     
##  1st Qu.:  0.00903      1st Qu.:-0.6022        1st Qu.: -0.01063     
##  Median :  0.06890      Median :-0.1360        Median :  0.07318     
##  Mean   :  0.00000      Mean   : 0.0000        Mean   :  0.00000     
##  3rd Qu.:  0.14237      3rd Qu.: 0.4838        3rd Qu.:  0.15197     
##  Max.   :  0.43899      Max.   : 2.5846        Max.   :  0.58033     
##  ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
##  Min.   :-2.1018        Min.   :-1.6230324     Min.   :-1.814768     
##  1st Qu.:-0.5599        1st Qu.:-0.7223009     1st Qu.:-0.611797     
##  Median :-0.1745        Median :-0.1218132     Median :-0.010311     
##  Mean   : 0.0000        Mean   : 0.0003314     Mean   : 0.004726     
##  3rd Qu.: 0.2110        3rd Qu.: 0.7789183     3rd Qu.: 0.591175     
##  Max.   : 4.8365        Max.   : 1.9798937     Max.   : 1.794146     
##  ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
##  Min.   :-1.523304      Min.   :-12.927496     Min.   :-12.940454    
##  1st Qu.:-0.833580      1st Qu.:  0.002208     1st Qu.:  0.008505    
##  Median :-0.143857      Median :  0.069146     Median :  0.063251    
##  Mean   : 0.005061      Mean   :  0.000105     Mean   :  0.000404    
##  3rd Qu.: 0.890729      3rd Qu.:  0.128720     3rd Qu.:  0.115417    
##  Max.   : 2.442608      Max.   :  0.433287     Max.   :  0.312785    
##  ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
##  Min.   :-12.888994     Min.   :-1.25583       Min.   :-12.026718    
##  1st Qu.:  0.000681     1st Qu.:-1.25583       1st Qu.: -0.186978    
##  Median :  0.066362     Median : 0.72551       Median : -0.066778    
##  Mean   :  0.001121     Mean   :-0.02868       Mean   : -0.003263    
##  3rd Qu.:  0.131337     3rd Qu.: 0.78266       3rd Qu.:  0.233723    
##  Max.   :  0.416660     Max.   : 0.93507       Max.   :  1.195326    
##  ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
##  Min.   :-9.38589       Min.   :-12.632749     Min.   :-2.86552      
##  1st Qu.:-0.37026       1st Qu.: -0.015263     1st Qu.:-0.64216      
##  Median : 0.03954       Median :  0.110732     Median :-0.08632      
##  Mean   : 0.01114       Mean   : -0.001722     Mean   : 0.00000      
##  3rd Qu.: 0.55179       3rd Qu.:  0.218728     3rd Qu.: 0.65480      
##  Max.   : 2.08855       Max.   :  0.416720     Max.   : 2.69287      
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##  Min.   :-3.037737      Min.   :-3.558813      Min.   :-3.01270      
##  1st Qu.:-0.621676      1st Qu.: 0.118269      1st Qu.:-0.51725      
##  Median : 0.183677      Median : 0.118269      Median :-0.05513      
##  Mean   :-0.005764      Mean   :-0.009176      Mean   :-0.02362      
##  3rd Qu.: 0.586354      3rd Qu.: 0.118269      3rd Qu.: 0.49941      
##  Max.   : 2.599738      Max.   : 1.956810      Max.   : 2.44032      
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##  Min.   :-2.944307      Min.   :-2.27741       Min.   :-3.9024       
##  1st Qu.:-0.655777      1st Qu.:-0.70467       1st Qu.:-0.8225       
##  Median :-0.083645      Median :-0.03064       Median : 0.7175       
##  Mean   :-0.008228      Mean   : 0.00000       Mean   : 0.0000       
##  3rd Qu.: 0.488487      3rd Qu.: 0.64339       3rd Qu.: 0.7175       
##  Max.   : 2.777017      Max.   : 2.89017       Max.   : 0.7175       
##  ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
##  Min.   :-4.5508        Min.   :-0.4626528     Min.   :-0.440588     
##  1st Qu.: 0.1653        1st Qu.:-0.4626528     1st Qu.:-0.440588     
##  Median : 0.2317        Median :-0.4626528     Median :-0.440588     
##  Mean   : 0.0000        Mean   : 0.0003392     Mean   :-0.000392     
##  3rd Qu.: 0.2982        3rd Qu.:-0.4626528     3rd Qu.:-0.440588     
##  Max.   : 0.4310        Max.   : 2.1490969     Max.   : 3.275213     
##  ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
##  Min.   :-5.77163       Min.   :-1.0506        Min.   :-5.60583      
##  1st Qu.: 0.09979       1st Qu.:-0.3594        1st Qu.:-0.01588      
##  Median : 0.20280       Median :-0.1290        Median : 0.29467      
##  Mean   : 0.00000       Mean   : 0.0000        Mean   : 0.00000      
##  3rd Qu.: 0.25430       3rd Qu.: 0.1303        3rd Qu.: 0.29467      
##  Max.   : 0.46031       Max.   :11.6224        Max.   : 0.91578      
##  ManufacturingProcess45
##  Min.   :-5.25447      
##  1st Qu.:-0.09356      
##  Median : 0.15220      
##  Mean   : 0.00000      
##  3rd Qu.: 0.39796      
##  Max.   : 1.13523
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
dim(cmp_df)
## [1] 176  58

Apply nearZeroVar. One predictor was dropped.

cmp_df2 <- cmp_df[, -nearZeroVar(cmp_df)]
dim(cmp_df2)
## [1] 176  57

Highest R^2 is 0.5753564, which corresponds to ncomp of 3, RMSE of 0.7947649, and MAE of 0.6188437.

set.seed(1)

train_selection <- createDataPartition(cmp_df2$Yield, times = 1, p = .80, list = FALSE)

train_x2 <- cmp_df2[train_selection, ][, -c(1)] #remove Yield
test_x2 <- cmp_df2[-train_selection, ][, -c(1)] #remove Yield
train_y2 <- cmp_df2[train_selection, ]$Yield
test_y2 <- cmp_df2[-train_selection, ]$Yield

(PLS_fit2 <- train(x = train_x2, y = train_y2,
                method = "pls",
                metric = "Rsquared",
                tuneLength = 25, 
                trControl = trainControl(method = "cv", number=10), 
                preProcess = c('center', 'scale')
          ))
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 129, 130, 132, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7835955  0.4798923  0.6435414
##    2     0.9037000  0.5302901  0.6625127
##    3     0.7947649  0.5753564  0.6188437
##    4     0.9309008  0.5337241  0.6538771
##    5     1.0876695  0.5205717  0.6974711
##    6     1.1939811  0.5064324  0.7262966
##    7     1.3384657  0.4934207  0.7748374
##    8     1.4551234  0.4911846  0.8123308
##    9     1.6262222  0.4866754  0.8584908
##   10     1.8605613  0.4732849  0.9097498
##   11     2.0180454  0.4569387  0.9452939
##   12     2.2476387  0.4465673  1.0009581
##   13     2.3997689  0.4432034  1.0361214
##   14     2.5216608  0.4399495  1.0769015
##   15     2.6136424  0.4415797  1.1110390
##   16     2.6609321  0.4413105  1.1272948
##   17     2.6973758  0.4430938  1.1321804
##   18     2.8105843  0.4439771  1.1693947
##   19     2.8235256  0.4417006  1.1698933
##   20     2.8828360  0.4395277  1.1837202
##   21     2.9284335  0.4322490  1.2039932
##   22     2.9829845  0.4253542  1.2243262
##   23     3.1163856  0.4173888  1.2704676
##   24     3.2500373  0.4069647  1.3146505
##   25     3.3925822  0.4039123  1.3556809
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
plot(PLS_fit2)

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The R^2 is 0.6516400. Training set R^2 is 0.5753564, which is lower.

PLS_predict2 <- predict(PLS_fit2, newdata=test_x2)
(postResample(pred=PLS_predict2, obs=test_y2))
##      RMSE  Rsquared       MAE 
## 0.5980589 0.6516400 0.4721775
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

As you can see, Manufacturing Process predictors dominate.

plot(varImp(PLS_fit2, scale = FALSE), top=20, scales = list(y = list(cex = 0.8)))
## Warning: package 'pls' was built under R version 3.5.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
## 
##     corrplot
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The top 3 predictors are ManufacturingProcess32, ManufacturingProcess36, and ManufacturingProcess13.

ManufacturingProcess32 has a somewhat strong positive correlation with Yield. Increasing this process will likely to increase the yield. It appears that ManufacturingProcess32 and ManufacturingProcess36 are negatively correlated with each other. ManufacturingPocess36 and ManufacturingProcess13 are moderately correlated with Yield.

correlation <- cor(select(cmp_df2, 'ManufacturingProcess32','ManufacturingProcess36','ManufacturingProcess13','Yield'))
corrplot::corrplot(correlation, method='pie', type="upper")