Chapter 6, Kuhn and Johnson, Applied Predictive Modeling

6.2.

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

6.2.a

Start R and use these commands to load the data:

data(permeability)
summary(permeability)

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

str(fingerprints)

##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

6.2.b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

After removing the predictors with low frequencies there appear to be 388 predictors remaining out of the original 1107.

# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(fingerprints)

# remove the above set from the data and store the result in a dataframe
fingerprintsPred <- fingerprints[,-lowFreqPredictors]

# see what is left
str(fingerprintsPred)

##  num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...

6.2.c

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

Based on the PLS model created using the training set 3 components seem optimal with a corresponding estimate of \(R^2 = 0.4657791\).

set.seed(132435)

# pre-process the data to center and scale it
pred_prep <- preProcess(fingerprintsPred, method = c("center","scale"))
resp_preo <- predict(pred_prep, fingerprintsPred)

# create a set of randomly selected training rows containing 75% of the data
trainingRows <- createDataPartition(permeability, p = 0.75, list = FALSE)

# use the training rows to split the data into training and test sets for the predictor and the response
pred_train <- resp_preo[trainingRows,]
pred_test <- resp_preo[-trainingRows,]

resp_train <- permeability[trainingRows,]
resp_test <- permeability[-trainingRows,]

# create the PLS model with the training data to maximize covariance between the
# predictor and target variables using 10-fold cross validation and measure using R2
pls_model <- train(pred_train, resp_train
                   , method = "pls"
                   , metric = "Rsquared"
                   , tuneLength = 20
                   , trControl = trainControl(method = "cv", number = 10))

pls_model

## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 113, 113, 111, 112, 113, 112, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.26533  0.2980803  10.009040
##    2     12.22901  0.4302612   8.707710
##    3     12.15172  0.4657791   9.036021
##    4     12.40468  0.4433031   9.404771
##    5     12.19534  0.4458150   9.080123
##    6     12.22558  0.4285594   9.156984
##    7     12.21694  0.4349349   9.241427
##    8     12.14031  0.4492858   9.415098
##    9     12.51810  0.4403712   9.670983
##   10     12.77960  0.4338531   9.857639
##   11     13.29300  0.4067931  10.252781
##   12     13.83929  0.3754990  10.843743
##   13     14.32021  0.3516469  11.032974
##   14     14.53883  0.3494497  11.059203
##   15     14.75606  0.3444324  11.176290
##   16     15.29865  0.3228011  11.459704
##   17     15.85285  0.3035451  11.792657
##   18     16.11621  0.3059340  11.914719
##   19     16.25239  0.2954203  12.019663
##   20     16.63190  0.2699180  12.227163
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.

We can see visually that the model with three variables returns the highest R-squared value.

ggplot(pls_model)

6.2.d

Predict the response for the test set. What is the test set estimate of \(R^2\)?

The \(R^2\) estimate for the test set is 0.5520256, which is even better than the training result.

# predict the test set response for the PLS model
pls_test <- predict(pls_model, newdata = pred_test)
postResample(pred=pls_test, obs = resp_test)

##       RMSE   Rsquared        MAE 
## 11.6075535  0.5520256  9.0000542

6.2.e

Try building other models discussed in this chapter. Do any have better predictive performance?

Trying principal component analysis the best fit is 8 components with \(R^2 = 0.4032165\).

set.seed(45678)

pcrFit <- train(x=pred_train,
                  y=resp_train,
                  method='pcr',
                  metric = "Rsquared",
                  tuneLength = 10,
                  trControl = trainControl('cv'),
                  preProc=c('center','scale')
                  )
pcrFit

## Principal Component Analysis 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 113, 113, 111, 112, 113, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     14.54915  0.1259773  11.224252
##    2     14.61353  0.1240144  11.302737
##    3     13.25980  0.3219973  10.009028
##    4     12.30918  0.4003002   9.298525
##    5     12.68988  0.3837063   9.577919
##    6     12.75886  0.3819379   9.421907
##    7     12.71902  0.3736417   9.484800
##    8     12.53080  0.4032165   9.171209
##    9     12.72833  0.3631219   9.337410
##   10     12.46369  0.3861440   9.127543
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 8.

ggplot(pcrFit)

Using ridge regression the best fit is where labmda = 0.1 with an \(R^2 = 0.4438373\).

set.seed(45689)

ridgeGrid <- data.frame(lambda = seq(0, .1, length = 10))

ridgeFit <- train(x=pred_train,
                  y=resp_train,
                  method='ridge',
                  metric = "Rsquared",
                  tuneGrid = ridgeGrid,
                  trControl = trainControl('cv'),
                  preProc=c('center','scale')
                  )
ridgeFit

## Ridge Regression 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 113, 113, 113, 113, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE      
##   0.00000000  28.08335  0.3914377  15.374091
##   0.01111111  14.88808  0.3610279  11.094464
##   0.02222222  14.02495  0.3962475  10.345275
##   0.03333333  13.57048  0.4129958  10.032558
##   0.04444444  13.33585  0.4227866   9.867680
##   0.05555556  13.17754  0.4299039   9.770660
##   0.06666667  13.07850  0.4343141   9.718247
##   0.07777778  13.00843  0.4381398   9.675278
##   0.08888889  12.94566  0.4412164   9.636646
##   0.10000000  12.89890  0.4438373   9.606251
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.1.

ggplot(ridgeFit)

6.2.f

Would you recommend any of your models to replace the permeability laboratory experiment?

The PLS model appears to have the best performance where \(R^2 = 0.4657791\).

6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

6.3.a

Start R and use these commands to load the data:

data("ChemicalManufacturingProcess")
chem_mfg <- ChemicalManufacturingProcess

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

Looking at the described strucutre of the data.

str(chem_mfg)

## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...

6.3.b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Since biological predictors cannot be changed, it is assumed that we can only impute missing values on the manufacturing process predictors. Checking the columns that contain the word ‘Biological’ in the label, there are none that have missing values.

# check the biologial predictors for missingness
bio <- chem_mfg %>%
  select(starts_with("Biological"))

names(which(colSums(is.na(bio))>0))

## character(0)

Checking the ‘Manufacturing’ variables for missingness we find 28 variables that have NA values.

# check the Manufacturing predictors for missingness
mfg <- chem_mfg %>%
  select(starts_with("Manufacturing"))

names(which(colSums(is.na(mfg))>0))

##  [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
##  [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
##  [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"

For most of the variables with NA’s the median and mean are very close, so imputing the mean would be the best approach. Here are the stats prior to imputation.

summary(mfg)

##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  NA's   :1              NA's   :3              NA's   :15            
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.2         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  NA's   :1              NA's   :1              NA's   :2             
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.400         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  NA's   :9              NA's   :10             NA's   :1             
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##                         NA's   :1                                    
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##                                                                      
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##                                                                      
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  NA's   :1              NA's   :1              NA's   :1             
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
##  Median :4855           Median :6047           Median :4587          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :19.90          Median : 9.100        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.80          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  NA's   :5                                     NA's   :5             
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.0          Median :0.02000       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##                                                                      
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600         
##

and here are the stats after imputation. The distributions hardly changed at all, and now there are no NA values.

mfg <- mfg %>%
  mutate(ManufacturingProcess01 = ifelse(is.na(ManufacturingProcess01), mean(ManufacturingProcess01, na.rm = TRUE), ManufacturingProcess01)) %>%
  mutate(ManufacturingProcess02 = ifelse(is.na(ManufacturingProcess02), mean(ManufacturingProcess02, na.rm = TRUE), ManufacturingProcess02)) %>%
  mutate(ManufacturingProcess03 = ifelse(is.na(ManufacturingProcess03), mean(ManufacturingProcess03, na.rm = TRUE), ManufacturingProcess03)) %>%
  mutate(ManufacturingProcess04 = ifelse(is.na(ManufacturingProcess04), mean(ManufacturingProcess04, na.rm = TRUE), ManufacturingProcess04)) %>%
  mutate(ManufacturingProcess05 = ifelse(is.na(ManufacturingProcess05), mean(ManufacturingProcess05, na.rm = TRUE), ManufacturingProcess05)) %>%
  mutate(ManufacturingProcess06 = ifelse(is.na(ManufacturingProcess06), mean(ManufacturingProcess06, na.rm = TRUE), ManufacturingProcess06)) %>%
  mutate(ManufacturingProcess07 = ifelse(is.na(ManufacturingProcess07), mean(ManufacturingProcess07, na.rm = TRUE), ManufacturingProcess07)) %>%
  mutate(ManufacturingProcess08 = ifelse(is.na(ManufacturingProcess08), mean(ManufacturingProcess08, na.rm = TRUE), ManufacturingProcess08)) %>%
  mutate(ManufacturingProcess10 = ifelse(is.na(ManufacturingProcess10), mean(ManufacturingProcess10, na.rm = TRUE), ManufacturingProcess10)) %>%
  mutate(ManufacturingProcess11 = ifelse(is.na(ManufacturingProcess11), mean(ManufacturingProcess11, na.rm = TRUE), ManufacturingProcess11)) %>%
  mutate(ManufacturingProcess12 = ifelse(is.na(ManufacturingProcess12), mean(ManufacturingProcess12, na.rm = TRUE), ManufacturingProcess12)) %>%
  mutate(ManufacturingProcess14 = ifelse(is.na(ManufacturingProcess14), mean(ManufacturingProcess14, na.rm = TRUE), ManufacturingProcess14)) %>%
  mutate(ManufacturingProcess22 = ifelse(is.na(ManufacturingProcess22), mean(ManufacturingProcess22, na.rm = TRUE), ManufacturingProcess22)) %>%
  mutate(ManufacturingProcess23 = ifelse(is.na(ManufacturingProcess23), mean(ManufacturingProcess23, na.rm = TRUE), ManufacturingProcess23)) %>%
  mutate(ManufacturingProcess24 = ifelse(is.na(ManufacturingProcess24), mean(ManufacturingProcess24, na.rm = TRUE), ManufacturingProcess24)) %>%
  mutate(ManufacturingProcess25 = ifelse(is.na(ManufacturingProcess25), mean(ManufacturingProcess25, na.rm = TRUE), ManufacturingProcess25)) %>%
  mutate(ManufacturingProcess26 = ifelse(is.na(ManufacturingProcess26), mean(ManufacturingProcess26, na.rm = TRUE), ManufacturingProcess26)) %>%
  mutate(ManufacturingProcess27 = ifelse(is.na(ManufacturingProcess27), mean(ManufacturingProcess27, na.rm = TRUE), ManufacturingProcess27)) %>%
  mutate(ManufacturingProcess28 = ifelse(is.na(ManufacturingProcess28), mean(ManufacturingProcess28, na.rm = TRUE), ManufacturingProcess28)) %>%
  mutate(ManufacturingProcess29 = ifelse(is.na(ManufacturingProcess29), mean(ManufacturingProcess29, na.rm = TRUE), ManufacturingProcess29)) %>%
  mutate(ManufacturingProcess30 = ifelse(is.na(ManufacturingProcess30), mean(ManufacturingProcess30, na.rm = TRUE), ManufacturingProcess30)) %>%
  mutate(ManufacturingProcess31 = ifelse(is.na(ManufacturingProcess31), mean(ManufacturingProcess31, na.rm = TRUE), ManufacturingProcess31)) %>%
  mutate(ManufacturingProcess33 = ifelse(is.na(ManufacturingProcess33), mean(ManufacturingProcess33, na.rm = TRUE), ManufacturingProcess33)) %>%
  mutate(ManufacturingProcess34 = ifelse(is.na(ManufacturingProcess34), mean(ManufacturingProcess34, na.rm = TRUE), ManufacturingProcess34)) %>%
  mutate(ManufacturingProcess35 = ifelse(is.na(ManufacturingProcess35), mean(ManufacturingProcess35, na.rm = TRUE), ManufacturingProcess35)) %>%
  mutate(ManufacturingProcess36 = ifelse(is.na(ManufacturingProcess36), mean(ManufacturingProcess36, na.rm = TRUE), ManufacturingProcess36)) %>%
  mutate(ManufacturingProcess40 = ifelse(is.na(ManufacturingProcess40), mean(ManufacturingProcess40, na.rm = TRUE), ManufacturingProcess40)) %>%
  mutate(ManufacturingProcess41 = ifelse(is.na(ManufacturingProcess41), mean(ManufacturingProcess41, na.rm = TRUE), ManufacturingProcess41))

summary(mfg)

##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.00          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.12          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.4         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.7         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.386         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.500         3rd Qu.: 9.825         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4829           1st Qu.:6019           1st Qu.:4562          
##  Median :4854           Median :6045           Median :4586          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4876           3rd Qu.:6069           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :20.00          Median : 9.161        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.700         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.75          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.6          Median :0.01957       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.0          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600

Put the data all back together.

chem_mfg <- cbind(chem_mfg$Yield,bio,mfg)

names(chem_mfg)[1] <- "Yield"

Check for low frequency values and remove them. Only one variable was removed

# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(chem_mfg)

# remove the above set from the data and store the result in a dataframe
chem_mfgPred <- chem_mfg[,-lowFreqPredictors]

# see what is left
str(chem_mfgPred)

## 'data.frame':    176 obs. of  57 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  11.2 0 0 0 10.7 ...
##  $ ManufacturingProcess02: num  16.7 0 0 0 0 ...
##  $ ManufacturingProcess03: num  1.54 1.54 1.54 1.54 1.54 ...
##  $ ManufacturingProcess04: num  932 917 912 911 918 ...
##  $ ManufacturingProcess05: num  1002 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  207 210 207 213 206 ...
##  $ ManufacturingProcess07: num  177 177 178 177 178 ...
##  $ ManufacturingProcess08: num  178 178 178 177 178 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  9.18 9.18 9.18 9.18 9.18 ...
##  $ ManufacturingProcess11: num  9.39 9.39 9.39 9.39 9.39 ...
##  $ ManufacturingProcess12: num  858 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  5.41 3 4 5 8 ...
##  $ ManufacturingProcess23: num  3.02 0 1 2 4 ...
##  $ ManufacturingProcess24: num  8.83 3 4 5 18 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  0.0177 0.1 0 0 0 ...
##  $ ManufacturingProcess41: num  0.0237 0.15 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...

6.3.c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

The optimal value of the performance metric is \(R^2 = 0.4919181\).

set.seed(86756453)

trainingRows <- sample(c(rep(0, 0.75 * nrow(chem_mfgPred)), 
                  rep(1, 0.25 * nrow(chem_mfgPred))))

pred_train2 <- chem_mfgPred[trainingRows == 0,]
pred_test2 <- chem_mfgPred[trainingRows == 1,]

# create the PLS model with the training data to maximize covariance between the
# predictor and target variables using 10-fold cross validation and measure using R2
pls_model2 <- train(Yield ~ .
                   , pred_train2
                   , method = "pls"
                   , metric = "Rsquared"
                   , tuneLength = 20
                   , trControl = trainControl(method = "cv")
                   , preProc=c('center','scale')
                   )

pls_model2

## Partial Least Squares 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 119, 119, 120, 117, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.525766  0.4118582  1.196626
##    2     2.055335  0.4045343  1.286997
##    3     1.545976  0.4919181  1.101313
##    4     1.869902  0.4657163  1.214377
##    5     2.231519  0.4544057  1.316088
##    6     2.420712  0.4466847  1.360377
##    7     2.465001  0.4324138  1.403582
##    8     2.796541  0.4195232  1.514503
##    9     2.922994  0.4077897  1.584853
##   10     3.056973  0.4085545  1.642520
##   11     3.149211  0.3918894  1.672440
##   12     3.095849  0.3891789  1.647143
##   13     3.064822  0.3830482  1.657215
##   14     3.018326  0.3847448  1.648539
##   15     2.827756  0.3942972  1.596415
##   16     2.584247  0.4051255  1.518665
##   17     2.294588  0.4204892  1.430888
##   18     2.169118  0.4373245  1.377378
##   19     2.070201  0.4486404  1.345298
##   20     2.303367  0.3917599  1.443707
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.

…which can be seen graphically in the plot below.

ggplot(pls_model2)

6.3.d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The \(R^2\) estimate for the test set is 0.5412972. Once again the test set performed better than the training set.

# predict the test set response for the PLS model
pls_test2 <- predict(pls_model2, newdata = pred_test2)
postResample(pred=pls_test2, obs = pred_test2$Yield)

##      RMSE  Rsquared       MAE 
## 1.1952589 0.5412972 0.9793152

6.3.e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

It appears that ManufacturingProcess 32 is the most important predictor in the model, and it appears that Manufacturing process predictors slighty edge out Biological predictors 12:8.

varImp(pls_model2)

## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   84.70
## ManufacturingProcess17   84.28
## ManufacturingProcess36   83.77
## ManufacturingProcess09   79.59
## BiologicalMaterial02     56.38
## ManufacturingProcess06   54.10
## ManufacturingProcess12   53.67
## BiologicalMaterial06     53.18
## ManufacturingProcess11   52.93
## BiologicalMaterial08     52.76
## ManufacturingProcess33   51.15
## ManufacturingProcess34   50.89
## BiologicalMaterial04     50.23
## BiologicalMaterial12     49.49
## BiologicalMaterial11     47.99
## BiologicalMaterial01     47.12
## BiologicalMaterial03     46.01
## ManufacturingProcess28   43.95
## ManufacturingProcess04   41.07

6.3.f

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The Yield response has a high correlation with quite a few of the predictors, BiologicalMaterial12 and ManufacturingProcess33 being among the strongest. Given the correlations in the plot below, manufacturing could experiment with adjusting the mix of processes to optimize production.

predictors <- chem_mfgPred %>%
  select('Yield', 'ManufacturingProcess32',
    'ManufacturingProcess36','ManufacturingProcess13',
    'ManufacturingProcess17','ManufacturingProcess09',
    'ManufacturingProcess12','ManufacturingProcess11',
    'ManufacturingProcess33','BiologicalMaterial02',
    'BiologicalMaterial08','BiologicalMaterial06',
    'BiologicalMaterial12')

corr_pred <- cor(predictors)

ggcorr(corr_pred,
       label = T,
       label_size = 2,
       label_round = 2,
       hjust = 1,
       size = 3,
       color = "royalblue",
       layout.exp = 5,
       low = "darkorange",
       mid = "gray95",
       high = "lightgreen",
       name = "Correlation")

Data 624 Predictive Analytics

Douglas Barley

4/10/2022