Data624: HW7

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

library(MASS)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(elasticnet)

## Loading required package: lars

## Loaded lars 1.3

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(AppliedPredictiveModeling)
library(DataExplorer)
library(RANN)
library(ResourceSelection)

## ResourceSelection 0.3-6   2023-06-27

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

dim(fingerprints)

## [1]  165 1107

dim(fingerprints[,-caret::nearZeroVar(fingerprints)])

## [1] 165 388

val_fp = fingerprints[,-caret::nearZeroVar(fingerprints)]

After cutting out sparse columns there are 388 predictors left for the model to consider

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

set.seed(12)
trainingRows <- createDataPartition(permeability, p = .80, list= FALSE)
train_finger <- val_fp[trainingRows,]
test_finger <- val_fp[-trainingRows,]
train_permY <- permeability[trainingRows,]
test_permY <- permeability[-trainingRows,]

set.seed(49)
tr_ctrl <- trainControl(method = "cv", number = 10)
    
plsFinger <- train(x=train_finger, y=train_permY,
                 method = "pls",tuneLength = 20,
                 trControl = tr_ctrl,preProcess = c("center", "scale"))

plsFinger

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 120, 121, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.84296  0.3645242  10.076404
##    2     11.35927  0.5255717   8.098447
##    3     11.30260  0.5221696   8.491675
##    4     11.61965  0.5171470   8.692696
##    5     11.47106  0.5136132   8.394059
##    6     11.29323  0.5178519   8.293291
##    7     11.32815  0.5189196   8.284300
##    8     11.37310  0.5317987   8.677746
##    9     11.50948  0.5189617   8.901416
##   10     11.72691  0.4993286   9.092009
##   11     11.82241  0.4897281   9.211926
##   12     11.95728  0.4883831   9.396215
##   13     12.14737  0.4784274   9.662778
##   14     12.46239  0.4629380   9.857318
##   15     12.76723  0.4464990  10.124604
##   16     13.12807  0.4260383  10.209452
##   17     13.34671  0.4168155  10.408683
##   18     13.45350  0.4121383  10.339464
##   19     13.75097  0.3955095  10.523665
##   20     13.85979  0.3931836  10.530734
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

summary(plsFinger)

## Data:    X dimension: 133 388 
##  Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 6
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X           22.78    34.31    39.98    45.08    52.85    59.13
## .outcome    35.93    54.74    61.60    67.27    70.61    73.39

Based on the summarized output of the 10 fold cross validation the pls model found that the optimal number of components is 6 with an \(R^2\) value of 51.7% and a minimized RMSE of 11.727. While this number of components doesn’t have the lowest \(R^2\) the RMSE is probably a better means of judging these model variants.

The optimization can be further demonstrated with a standard graphic:

plot(plsFinger)

xyplot(train_permY ~ predict(plsFinger),
     type = c("p", "g"), xlab = "Predicted", ylab = "Observed")

The observed vs predicted plot does appear to show a somewhat linear relationship although as X increases, the spread of the points shows some fanning between 15 - 40. It does subsequently decrease the spread of the points after 40.

The one standard error rule could feasibly be used here as Kuhn indicates that this simpler model may be comparable.

Predict the response for the test set. What is the test set estimate of \(R^2\)?

pred_finger <- predict(plsFinger,test_finger)
pls_test <- caret::defaultSummary(data.frame(obs=test_permY,pred=pred_finger))
pls_test

##       RMSE   Rsquared        MAE 
## 13.7878339  0.3262867  9.6809449

The test set \(R^2\) is 32.63% which is substantially worse than the performance in the training set. The RMSE is also considerably larger on the unseen data although to soem extent that is to be expected

Try building other models discussed in this chapter. Do any have better predictive performance?

Let’s try using penalized regression models as an alternative means of predicting the permeability:

set.seed(15)
# ridgeGS <- data.frame(.lambda = seq(0, 1, length = 15)) 


ridgeRegFit <- train(x=train_finger,
                  y=train_permY,
                  method='ridge',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )

## Warning: model fit failed for Fold05: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

ridgeRegFit

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 121, 119, 120, 119, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     16.67673  0.3892225  12.039832
##   0.1     13.18946  0.4723426   9.542513
##   0.2     13.03888  0.4947783   9.375433
##   0.3     13.23126  0.5036644   9.450973
##   0.4     13.58231  0.5082904   9.699912
##   0.5     14.03206  0.5108524  10.049187
##   0.6     14.56629  0.5124485  10.544434
##   0.7     15.15949  0.5135351  11.081939
##   0.8     15.79820  0.5143557  11.655795
##   0.9     16.48145  0.5148649  12.295152
##   1.0     17.19775  0.5152141  12.952609
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2.

The model after 10 fold validation identified the optimal lambda penalty term of 0.2 which minimized RMSE, but not the \(R^2\)

ridge_var_imp <- varImp(ridgeRegFit,scale=FALSE)
top_ridge_df <- head(ridge_var_imp$importance |> arrange(-Overall),25)

ggplot(data=top_ridge_df,aes(x=Overall,y=reorder(rownames(top_ridge_df),Overall))) +
    geom_bar(stat='identity') +
    labs(title='Most important variables in Ridge Regression',y='Predictors')

lfpFit <- train(x=train_finger,
                  y=train_permY,
                  method='lasso',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.01)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )

## Warning: model fit failed for Fold02: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold08: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

lfpFit

## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 118, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.84899        NaN  12.557052
##   0.01      14.41453  0.4663622  11.335468
##   0.02      13.42165  0.4776610  10.376542
##   0.03      12.75480  0.4829840   9.600487
##   0.04      12.31003  0.4986082   9.067208
##   0.05      11.92013  0.5237388   8.642053
##   0.06      11.73624  0.5292693   8.386778
##   0.07      11.70822  0.5226766   8.313568
##   0.08      11.66058  0.5188943   8.211338
##   0.09      11.62523  0.5156880   8.153741
##   0.10      11.59529  0.5134110   8.061772
##   0.11      11.54799  0.5124494   7.961476
##   0.12      11.52212  0.5128684   7.976015
##   0.13      11.47629  0.5145714   7.970087
##   0.14      11.45083  0.5138270   7.949818
##   0.15      11.44164  0.5114545   7.924643
##   0.16      11.46520  0.5076419   7.955653
##   0.17      11.51218  0.5034053   7.997677
##   0.18      11.55970  0.4995324   8.039333
##   0.19      11.59722  0.4970285   8.071813
##   0.20      11.62118  0.4956033   8.088565
##   0.21      11.64660  0.4943499   8.108581
##   0.22      11.68438  0.4917432   8.148406
##   0.23      11.73113  0.4884401   8.187316
##   0.24      11.76452  0.4858565   8.199227
##   0.25      11.80173  0.4834556   8.212295
##   0.26      11.85367  0.4798908   8.237946
##   0.27      11.90546  0.4762125   8.274100
##   0.28      11.94583  0.4732587   8.305997
##   0.29      11.98965  0.4701541   8.339552
##   0.30      12.03446  0.4670109   8.371814
##   0.31      12.06864  0.4648629   8.399720
##   0.32      12.09809  0.4632811   8.429887
##   0.33      12.12225  0.4622335   8.456906
##   0.34      12.14739  0.4612992   8.481985
##   0.35      12.16981  0.4606670   8.503759
##   0.36      12.16694  0.4614412   8.510481
##   0.37      12.17808  0.4614279   8.523784
##   0.38      12.19186  0.4614223   8.542521
##   0.39      12.20947  0.4608577   8.564572
##   0.40      12.23086  0.4598441   8.588747
##   0.41      12.25881  0.4584283   8.617467
##   0.42      12.29273  0.4570994   8.651993
##   0.43      12.33188  0.4554681   8.687168
##   0.44      12.37083  0.4536491   8.720676
##   0.45      12.40461  0.4522291   8.759340
##   0.46      12.44470  0.4507220   8.800022
##   0.47      12.50724  0.4483127   8.851814
##   0.48      12.57106  0.4459671   8.903393
##   0.49      12.63669  0.4434637   8.956676
##   0.50      12.70035  0.4409328   9.010415
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.15.

Let’s see what terms remain important in the model:

lasso_var_imp <- varImp(lfpFit,scale=FALSE)
top_lasso_df <- head(lasso_var_imp$importance |> arrange(-Overall),25)

ggplot(data=top_lasso_df,aes(x=Overall,y=reorder(rownames(top_lasso_df),Overall))) +
    geom_bar(stat='identity') +
    labs(title='Most important variables in Lasso Regression',y='Predictors')

head(lasso_var_imp$importance,20)

##          Overall
## X1  0.0423781376
## X2  0.0560929401
## X3  0.0461494364
## X4  0.0461494364
## X5  0.0461494364
## X6  0.2558763551
## X11 0.0020932720
## X12 0.0229244226
## X15 0.1168511490
## X16 0.2613520881
## X20 0.0461494364
## X21 0.0461494364
## X25 0.0158237514
## X26 0.0461494364
## X27 0.0461494364
## X28 0.0461494364
## X29 0.0461494364
## X35 0.0007154195
## X36 0.0021717551
## X37 0.0158237514

set.seed(6)

# Grid search for optimized penalty lambdas
enet_gs <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

# tuning penalized regression model
enet_fit <- train(train_finger, train_permY, method = "enet",
                  tuneGrid = enet_gs, trControl = tr_ctrl, preProc = c("center", "scale"))

## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

plot(enet_fit)

enet_var_imp <- varImp(enet_fit,scale=FALSE)
top_enet_df <- head(enet_var_imp$importance |> arrange(-Overall),25)

ggplot(data=top_enet_df,aes(x=Overall,y=reorder(rownames(top_enet_df),Overall))) +
    geom_bar(stat='identity') +
    labs(title='Most important variables in ElasticNet Regression',y='Predictors')

ridge_preds <- predict(ridgeRegFit,test_finger)
lasso_preds <- predict(lfpFit,test_finger)
enet_preds <- predict(enet_fit,test_finger)

ridge_test <- caret::defaultSummary(data.frame(pred=ridge_preds,obs=test_permY))
lasso_test <- caret::defaultSummary(data.frame(pred=lasso_preds,obs=test_permY))
enet_test <- caret::defaultSummary(data.frame(pred=enet_preds,obs=test_permY))

Would you recommend any of your models to replace the permeability laboratory experiment?

The different models have a mix of metrics on the test set in terms of ranking the unseen data. The RMSE values are all in the same ballpark although the lasso model has the best value for this evaluation criteria. The lasso model’s \(R^2\) value isn’t as high as the Ridge or PLS models, but even the mean absolute error is better. Therefore, if any model would help to identify potential improvements it would be the lasso model. It could also help identify certain important factors and allow the researchers to ignore one’s that statistically are not predictive for their problems.

test_evals <-rbind(pls_test,ridge_test,lasso_test,enet_test)
rownames(test_evals) <- c('PLS','Ridge','Lasso','Elastic Net')
knitr::kable(test_evals)

	RMSE	Rsquared	MAE
PLS	13.78783	0.3262867	9.680945
Ridge	13.14550	0.4220319	9.065979
Lasso	12.89992	0.3217248	8.834154
Elastic Net	13.43265	0.2534215	9.084110

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

summary(ChemicalManufacturingProcess)

##      Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##  Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
##  Median :39.97   Median :6.305        Median :55.09        Median :67.22       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
##                                                                                
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##  Min.   : 9.38        Min.   :13.24        Min.   :40.60       
##  1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
##  Median :12.10        Median :18.49        Median :48.46       
##  Mean   :12.35        Mean   :18.60        Mean   :48.91       
##  3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
##  Max.   :23.09        Max.   :24.85        Max.   :59.38       
##                                                                
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##  Min.   :100.0        Min.   :15.88        Min.   :11.44       
##  1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
##  Median :100.0        Median :17.51        Median :12.84       
##  Mean   :100.0        Mean   :17.49        Mean   :12.85       
##  3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
##  Max.   :100.8        Max.   :19.14        Max.   :14.08       
##                                                                
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##  Min.   :1.770        Min.   :135.8        Min.   :18.35       
##  1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
##  Median :2.710        Median :146.1        Median :20.12       
##  Mean   :2.801        Mean   :147.0        Mean   :20.20       
##  3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
##  Max.   :6.870        Max.   :158.7        Max.   :22.21       
##                                                                
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  NA's   :1              NA's   :3              NA's   :15            
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.2         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  NA's   :1              NA's   :1              NA's   :2             
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.400         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  NA's   :9              NA's   :10             NA's   :1             
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##                         NA's   :1                                    
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##                                                                      
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##                                                                      
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  NA's   :1              NA's   :1              NA's   :1             
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
##  Median :4855           Median :6047           Median :4587          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :19.90          Median : 9.100        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.80          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  NA's   :5                                     NA's   :5             
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.0          Median :0.02000       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##                                                                      
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600         
##

x_chem <- ChemicalManufacturingProcess |> select(-Yield)
y_chem <- ChemicalManufacturingProcess |> select(Yield)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

x_chem |> pivot_longer(cols=colnames(x_chem)) |> filter(is.na(value)) |> group_by(name) |> summarise(cnt=n()) |> arrange(-cnt)

## # A tibble: 28 × 2
##    name                     cnt
##    <chr>                  <int>
##  1 ManufacturingProcess03    15
##  2 ManufacturingProcess11    10
##  3 ManufacturingProcess10     9
##  4 ManufacturingProcess25     5
##  5 ManufacturingProcess26     5
##  6 ManufacturingProcess27     5
##  7 ManufacturingProcess28     5
##  8 ManufacturingProcess29     5
##  9 ManufacturingProcess30     5
## 10 ManufacturingProcess31     5
## # ℹ 18 more rows

This graphic shows the percentages of missing values by column, which can be helpful to visualize all in one place rather than reviewing specific instances as in the code above.

plot_missing(x_chem, missing_only = TRUE,
                   ggtheme = theme_classic())

There are a couple of easy methods within preProcess that can be used to apply transformation; however, to preserve the originalLet’s apply KNN to find the nearest neighbors for imputation:

imputed <- preProcess(x_chem,method=c('center','scale','knnImpute'),k=5)
x_chem_imp <- predict(imputed,x_chem)

Let’s confirm there aren’t any null values left

sum(is.na(x_chem_imp))

## [1] 0

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Split the data into train test splits

set.seed(18)
chem_train <- caret::createDataPartition(y_chem$Yield, p = .70, list= FALSE)
x_train_chem <- x_chem_imp[chem_train,]
x_test_chem <- x_chem_imp[-chem_train,]
y_train_chem <- y_chem[chem_train,]
y_test_chem <- y_chem[-chem_train,]

Let’s see if prioritizing bias, variance or balancing them both minimizes the RMSE:

set.seed(12)

rgs <- data.frame(.lambda = seq(0,0.1,length=15))
ridgeFit_chem <- train(x=x_train_chem,y=y_train_chem,
                       method='ridge',
                       tuneGrid=rgs,
                       trControl=tr_ctrl,
                       preProc = c('center','scale')
                       )

ridgeFit_chem

## Ridge Regression 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE     
##   0.000000000  3.604689  0.4259973  2.121548
##   0.007142857  2.730452  0.5060252  1.636248
##   0.014285714  2.500808  0.5204816  1.527147
##   0.021428571  2.401962  0.5260009  1.483482
##   0.028571429  2.347087  0.5281048  1.460613
##   0.035714286  2.311949  0.5286619  1.446628
##   0.042857143  2.287149  0.5284755  1.436774
##   0.050000000  2.268308  0.5279328  1.430753
##   0.057142857  2.253142  0.5272322  1.426133
##   0.064285714  2.240366  0.5264792  1.422379
##   0.071428571  2.229220  0.5257302  1.419430
##   0.078571429  2.219235  0.5250147  1.417069
##   0.085714286  2.210112  0.5243477  1.414907
##   0.092857143  2.201655  0.5237355  1.412765
##   0.100000000  2.193734  0.5231794  1.410581
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

The best individual lambda is at 0.1 which will be interesting to compare against the enet model and it’s in-sample training RMSE is 2.19.

Let’s run the lasso regression model now as well:

set.seed(12)

frac_gs <- data.frame(.fraction = seq(0,1,by=0.05))
lassoFit_chem <- train(x=x_train_chem,y=y_train_chem,
                       method='lasso',
                       tuneGrid=frac_gs,
                       trControl=tr_ctrl,
                       preProc = c('center','scale')
                       )

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

The Lasso Model Results:

lassoFit_chem

## The lasso 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      1.774324        NaN  1.4558734
##   0.05      1.185460  0.5949685  0.9761787
##   0.10      1.178212  0.5885738  0.9510623
##   0.15      1.217401  0.5817917  0.9623385
##   0.20      1.178073  0.6114951  0.9425087
##   0.25      1.271762  0.5674444  1.0151536
##   0.30      1.474895  0.5216791  1.1301455
##   0.35      1.577076  0.4987345  1.1936457
##   0.40      1.673373  0.4850159  1.2395459
##   0.45      1.786916  0.4712343  1.2884060
##   0.50      2.054639  0.4573306  1.4112541
##   0.55      2.423125  0.4482429  1.5634704
##   0.60      2.428828  0.4434366  1.5912224
##   0.65      2.460219  0.4404302  1.6118995
##   0.70      2.528606  0.4376021  1.6470107
##   0.75      2.638364  0.4349723  1.6946468
##   0.80      2.752816  0.4331351  1.7399360
##   0.85      2.878973  0.4311473  1.7886424
##   0.90      3.026342  0.4291307  1.8440035
##   0.95      3.365382  0.4275135  2.0106954
##   1.00      3.604689  0.4259973  2.1215479
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.2.

The optimal value for the fraction parameter is 0.2 which yields a 1.178 in RMSE in the in-sample training set, which outperformed the ridge regression.

Lastly, let’s run the enet model which balances and optimizes the parameters from both the lasso and the ridge models.

set.seed(12)
# Grid search for optimized penalty lambdas
enet_gs <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

# tuning penalized regression model
enet_fit_chem <- train(x_train_chem, y_train_chem, method = "enet",
                  tuneGrid = enet_gs, trControl = tr_ctrl, preProc = c("center", "scale"))

plot(enet_fit_chem)

enet_fit_chem

## Elasticnet 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      1.185460  0.5949685  0.9761787
##   0.00    0.10      1.178212  0.5885738  0.9510623
##   0.00    0.15      1.217401  0.5817917  0.9623385
##   0.00    0.20      1.178073  0.6114951  0.9425087
##   0.00    0.25      1.271762  0.5674444  1.0151536
##   0.00    0.30      1.474895  0.5216791  1.1301455
##   0.00    0.35      1.577076  0.4987345  1.1936457
##   0.00    0.40      1.673373  0.4850159  1.2395459
##   0.00    0.45      1.786916  0.4712343  1.2884060
##   0.00    0.50      2.054639  0.4573306  1.4112541
##   0.00    0.55      2.423125  0.4482429  1.5634704
##   0.00    0.60      2.428828  0.4434366  1.5912224
##   0.00    0.65      2.460219  0.4404302  1.6118995
##   0.00    0.70      2.528606  0.4376021  1.6470107
##   0.00    0.75      2.638364  0.4349723  1.6946468
##   0.00    0.80      2.752816  0.4331351  1.7399360
##   0.00    0.85      2.878973  0.4311473  1.7886424
##   0.00    0.90      3.026342  0.4291307  1.8440035
##   0.00    0.95      3.365382  0.4275135  2.0106954
##   0.00    1.00      3.604689  0.4259973  2.1215479
##   0.01    0.05      1.416959  0.5625946  1.1439554
##   0.01    0.10      1.215877  0.5913592  1.0024114
##   0.01    0.15      1.179914  0.5849740  0.9769658
##   0.01    0.20      1.193521  0.5749623  0.9739957
##   0.01    0.25      1.238737  0.5599391  0.9826093
##   0.01    0.30      1.170353  0.5933902  0.9494856
##   0.01    0.35      1.165372  0.5990682  0.9442126
##   0.01    0.40      1.253421  0.5690645  0.9978396
##   0.01    0.45      1.395531  0.5461331  1.0699375
##   0.01    0.50      1.519609  0.5420232  1.1229987
##   0.01    0.55      1.570763  0.5448129  1.1451582
##   0.01    0.60      1.632989  0.5444553  1.1725678
##   0.01    0.65      1.719087  0.5426914  1.2094253
##   0.01    0.70      1.856697  0.5388151  1.2685455
##   0.01    0.75      2.047185  0.5341109  1.3463414
##   0.01    0.80      2.245804  0.5296128  1.4266243
##   0.01    0.85      2.421063  0.5249558  1.4964331
##   0.01    0.90      2.492555  0.5204042  1.5257318
##   0.01    0.95      2.553484  0.5168639  1.5515391
##   0.01    1.00      2.609982  0.5136214  1.5779534
##   0.10    0.05      1.556425  0.4794030  1.2649904
##   0.10    0.10      1.377568  0.5673686  1.1120479
##   0.10    0.15      1.259038  0.5801671  1.0327710
##   0.10    0.20      1.202020  0.5798737  0.9945356
##   0.10    0.25      1.210505  0.5612809  1.0014380
##   0.10    0.30      1.223439  0.5524961  1.0008888
##   0.10    0.35      1.264786  0.5392222  1.0080186
##   0.10    0.40      1.280128  0.5395270  1.0076313
##   0.10    0.45      1.255157  0.5490661  0.9992590
##   0.10    0.50      1.232135  0.5594726  0.9937856
##   0.10    0.55      1.206974  0.5744677  0.9769703
##   0.10    0.60      1.282965  0.5587305  1.0183883
##   0.10    0.65      1.405600  0.5470514  1.0767923
##   0.10    0.70      1.520317  0.5397956  1.1281169
##   0.10    0.75      1.650197  0.5352548  1.1842849
##   0.10    0.80      1.781200  0.5319931  1.2400027
##   0.10    0.85      1.904189  0.5292173  1.2915373
##   0.10    0.90      2.034644  0.5271368  1.3458364
##   0.10    0.95      2.124133  0.5250614  1.3828653
##   0.10    1.00      2.193734  0.5231794  1.4105805
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.01.

We can see from the output of the elasticnet model that a lambda value of 0.01 (ridge) and a fraction of 0.35 (lasso) provide the optimal RMSE error value based on the training set of 1.165. This metric is the lowest of all of the methods attempted so far for the chemical manufacturing.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

x_ridge_pred_chem <- predict(ridgeFit_chem,x_test_chem)
x_lasso_pred_chem <- predict(lassoFit_chem,x_test_chem)
x_enet_pred_chem <- predict(enet_fit_chem,x_test_chem)

caret::defaultSummary(data.frame(pred=x_ridge_pred_chem,obs=y_test_chem))

##      RMSE  Rsquared       MAE 
## 2.0067806 0.3769701 1.2991752

caret::defaultSummary(data.frame(pred=x_lasso_pred_chem,obs=y_test_chem))

##      RMSE  Rsquared       MAE 
## 1.5022300 0.5067301 1.1811992

caret::defaultSummary(data.frame(pred=x_enet_pred_chem,obs=y_test_chem))

##      RMSE  Rsquared       MAE 
## 1.1323445 0.6639588 0.9507090

In line with the results from the training data the enet model outperformed the ridge and lasso models that were only focused on their penalized parameters. The ridge model outperformed it’s RMSE from the resampled training, but it was still higher than the other models tested. The lasso regulularization had a higher RMSE in the test set as compared to the training data, while the enet model had a slightly lower test set RMSE.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

plot(varImp(enet_fit_chem,scale=TRUE))

It appears that manufacturing processes are the predictors with most imporant with manufacturingprocess32 and manufacturingprocess36 far exceeding the importance compared to other available predictors. It is noted that more manufacturing processes are also minimzed to zero with regularization so not all processes are created equally for this chemical manufacturing.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

imp_chem_preds <- c('ManufacturingProcess32','ManufacturingProcess36','BiologicalMaterial06','ManufacturingProcess13','BiologicalMaterial03','ManufacturingProcess31','Yield')
par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))

plot(x=x_chem_imp$ManufacturingProcess32,y=y_chem$Yield,main='Manu_Process32 vs Yield') 
plot(x=x_chem_imp$ManufacturingProcess36,y=y_chem$Yield,main='Manu_Process36 vs Yield')
plot(x=x_chem_imp$ManufacturingProcess31,y=y_chem$Yield,main='Manu_Process31 vs Yield')
plot(x=x_chem_imp$ManufacturingProcess13,y=y_chem$Yield,main='Manu_Process13 vs Yield')
plot(x=x_chem_imp$BiologicalMaterial03,y=y_chem$Yield,main='Bio_Process03 vs Yield')
plot(x=x_chem_imp$BiologicalMaterial06,y=y_chem$Yield,main='Bio_Process06 vs Yield')

Most of these predictors appear to have strong positive or negative linear relationships with yield which would make sense why they would be considered important in a regression model. It’s surprising that process 31 would be significant as there doesn’t seem to be much variety in values that change yield. Manufacturing process 36 also appears to have concentration at specific values although there remains a fairly strong negative relationship between it and yield. Ultimately, this would be indicative to the manufacturer of which processes to invest in future R&D or improvement as they have a greater likelihood of impacting yield and how they should focus on the factors that will consistently impact yield or minimize lowered yield.

Data624: HW7

jfor: Josh Forster

2024-03-16

Split the data into train test splits