Applied Predictive Modeling - Chapter 6 Exercises: 6.2, 6.3

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
data(permeability)
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# predictors
x <- as.data.frame(fingerprints)

# identify near-zero variance predictors
nzv <- nearZeroVar(x)
x_filtered <- x[, -nzv]

# number of predictors left
ncol(x_filtered)
## [1] 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
# remove near-zero variance predictors
nzv <- nearZeroVar(fingerprints)
x <- fingerprints[, -nzv]
y <- permeability

# train/test split
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)

x_train <- x[trainIndex, ]
x_test  <- x[-trainIndex, ]
y_train <- y[trainIndex]
y_test  <- y[-trainIndex]

# cross-validation
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

# PLS model
set.seed(123)
pls_fit <- train(
  x = x_train,
  y = y_train,
  method = "pls",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_fit$bestTune
##   ncomp
## 7     7
# resampled R^2
max(pls_fit$results$Rsquared)
## [1] 0.5190079

7 latent values are optimal, corresponding resampled estimate of R2 = 0.5190079

  1. Predict the response for the test set. What is the test set estimate of R2?
# predictions on test set
pls_pred <- predict(pls_fit, x_test)
postResample(pls_pred, y_test)
##       RMSE   Rsquared        MAE 
## 11.9744490  0.3618213  8.3021291

The test set estimate of R2 is approximately 0.362.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
# Linear Regression
lm_fit <- train(
  x_train, y_train,
  method = "lm",
  preProcess = c("center","scale"),
  trControl = ctrl
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold10.Rep1: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold05.Rep4: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold03.Rep5: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
lm_pred <- predict(lm_fit, x_test)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
lm_perf <- postResample(lm_pred, y_test)

# PCR
pcr_fit <- train(
  x_train, y_train,
  method = "pcr",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pcr_pred <- predict(pcr_fit, x_test)
pcr_perf <- postResample(pcr_pred, y_test)

#Penalized Regression Model
glmnet_fit <- train(
  x_train, y_train,
  method = "glmnet",
  preProcess = c("center","scale"),
  trControl = ctrl,
  tuneLength = 10
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
glmnet_pred <- predict(glmnet_fit, x_test)
glmnet_perf <- postResample(glmnet_pred, y_test)

# Results
results <- rbind(
  LM = lm_perf,
  PCR = pcr_perf,
  PLS = postResample(pls_pred, y_test),
  GLMNET = glmnet_perf
)

results
##            RMSE   Rsquared       MAE
## LM     29.90647 0.07853504 19.332111
## PCR    12.20768 0.29221078  8.225112
## PLS    11.97445 0.36182132  8.302129
## GLMNET 10.98607 0.38732280  7.125565

The Penalized Regression Model achieved the best predictive performance, with the lowest RMSE (10.99) and the highest R2 (0.387)

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

No, I would not recommend replacing the laboratory experiment. Although the glmnet model performed best, its test set R2 of about 0.387 indicates only modest predictive ability.

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),

measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturing)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
x <- ChemicalManufacturingProcess %>%
  select(-Yield)

y <- ChemicalManufacturingProcess$Yield

impute_model <- preProcess(x, method = "medianImpute")
x_imputed <- predict(impute_model, x)
colSums(is.na(x_imputed))
##   BiologicalMaterial01   BiologicalMaterial02   BiologicalMaterial03 
##                      0                      0                      0 
##   BiologicalMaterial04   BiologicalMaterial05   BiologicalMaterial06 
##                      0                      0                      0 
##   BiologicalMaterial07   BiologicalMaterial08   BiologicalMaterial09 
##                      0                      0                      0 
##   BiologicalMaterial10   BiologicalMaterial11   BiologicalMaterial12 
##                      0                      0                      0 
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
##                      0                      0                      0 
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
##                      0                      0                      0 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09 
##                      0                      0                      0 
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 
##                      0                      0                      0 
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15 
##                      0                      0                      0 
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18 
##                      0                      0                      0 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21 
##                      0                      0                      0 
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
##                      0                      0                      0 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##                      0                      0                      0 
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
##                      0                      0                      0 
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33 
##                      0                      0                      0 
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 
##                      0                      0                      0 
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39 
##                      0                      0                      0 
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42 
##                      0                      0                      0 
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45 
##                      0                      0                      0
sum(is.na(x_imputed))
## [1] 0
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(123)
train_index <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)

train_data <- ChemicalManufacturingProcess[train_index, ]
test_data  <- ChemicalManufacturingProcess[-train_index, ]

# Separate predictors and yield
x_train <- train_data %>% select(-Yield)
y_train <- train_data$Yield

x_test <- test_data %>% select(-Yield)
y_test <- test_data$Yield

# Preprocess (median impute)
preprocess_model <- preProcess(x_train, method = c("medianImpute", "center", "scale"))

x_train_proc <- predict(preprocess_model, x_train)
x_test_proc  <- predict(preprocess_model, x_test)

control <- trainControl(method = "cv", number = 5)

model <- train(
  x = x_train_proc,
  y = y_train,
  method = "glmnet",
  trControl = control,
  tuneLength = 10)
model
## glmnet 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 115, 115, 116, 116, 114 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared   MAE      
##   0.1    0.0005188959  5.156621  0.2834696  2.1211875
##   0.1    0.0011987169  5.160377  0.2836877  2.1215589
##   0.1    0.0027691914  5.131129  0.2871259  2.1052216
##   0.1    0.0063971914  4.902392  0.2966737  2.0298006
##   0.1    0.0147783418  4.342283  0.3159132  1.8544014
##   0.1    0.0341398863  3.277009  0.3545164  1.5344035
##   0.1    0.0788675654  2.413383  0.4220250  1.2660664
##   0.1    0.1821943051  1.704282  0.5235847  1.0562472
##   0.1    0.4208924755  1.456846  0.5652540  0.9846508
##   0.1    0.9723162082  1.262742  0.5830895  0.9784338
##   0.2    0.0005188959  4.980777  0.2831687  2.0724988
##   0.2    0.0011987169  4.985928  0.2852832  2.0653836
##   0.2    0.0027691914  4.841772  0.2895247  2.0111556
##   0.2    0.0063971914  4.468638  0.3024022  1.8939475
##   0.2    0.0147783418  3.727566  0.3295920  1.6863045
##   0.2    0.0341398863  2.582686  0.3883715  1.3438735
##   0.2    0.0788675654  1.787403  0.4970327  1.0960644
##   0.2    0.1821943051  1.347050  0.5919200  0.9447972
##   0.2    0.4208924755  1.219016  0.6001019  0.9450603
##   0.2    0.9723162082  1.236002  0.6053409  1.0086200
##   0.3    0.0005188959  4.888410  0.2810174  2.0334048
##   0.3    0.0011987169  4.921140  0.2852138  2.0361094
##   0.3    0.0027691914  4.757407  0.2904881  1.9711828
##   0.3    0.0063971914  4.324390  0.3065514  1.8465400
##   0.3    0.0147783418  3.299286  0.3435025  1.5629875
##   0.3    0.0341398863  2.133184  0.4276565  1.2145331
##   0.3    0.0788675654  1.371243  0.5865371  0.9648468
##   0.3    0.1821943051  1.145344  0.6371187  0.8988680
##   0.3    0.4208924755  1.172989  0.6215750  0.9400838
##   0.3    0.9723162082  1.280418  0.6074575  1.0490671
##   0.4    0.0005188959  4.758994  0.2852444  2.0004706
##   0.4    0.0011987169  4.773976  0.2877828  1.9939818
##   0.4    0.0027691914  4.674002  0.2923907  1.9490156
##   0.4    0.0063971914  4.079009  0.3128480  1.7796478
##   0.4    0.0147783418  2.846829  0.3604684  1.4333833
##   0.4    0.0341398863  1.791334  0.4711229  1.1155329
##   0.4    0.0788675654  1.177691  0.6350829  0.8965093
##   0.4    0.1821943051  1.150323  0.6309875  0.9155766
##   0.4    0.4208924755  1.193284  0.6139629  0.9635847
##   0.4    0.9723162082  1.342539  0.6050784  1.0891068
##   0.5    0.0005188959  4.698075  0.2871774  1.9796294
##   0.5    0.0011987169  4.754730  0.2880118  1.9881411
##   0.5    0.0027691914  4.649026  0.2931095  1.9377208
##   0.5    0.0063971914  3.821333  0.3192707  1.6996978
##   0.5    0.0147783418  2.601600  0.3760519  1.3591661
##   0.5    0.0341398863  1.497375  0.5292975  1.0278288
##   0.5    0.0788675654  1.140145  0.6455722  0.8906848
##   0.5    0.1821943051  1.148148  0.6301718  0.9215614
##   0.5    0.4208924755  1.208489  0.6111822  0.9810088
##   0.5    0.9723162082  1.414176  0.5916575  1.1416068
##   0.6    0.0005188959  4.714367  0.2872332  1.9853483
##   0.6    0.0011987169  4.739669  0.2883807  1.9836850
##   0.6    0.0027691914  4.625771  0.2942302  1.9254836
##   0.6    0.0063971914  3.588278  0.3258300  1.6271450
##   0.6    0.0147783418  2.380098  0.3926722  1.2881261
##   0.6    0.0341398863  1.243794  0.6056060  0.9453799
##   0.6    0.0788675654  1.170825  0.6309636  0.9012347
##   0.6    0.1821943051  1.145167  0.6311680  0.9259329
##   0.6    0.4208924755  1.219304  0.6167474  0.9959113
##   0.6    0.9723162082  1.484329  0.5756295  1.1986788
##   0.7    0.0005188959  4.688823  0.2868190  1.9828752
##   0.7    0.0011987169  4.685369  0.2891771  1.9700132
##   0.7    0.0027691914  4.538955  0.2963162  1.9010240
##   0.7    0.0063971914  3.403861  0.3324592  1.5694776
##   0.7    0.0147783418  2.149273  0.4130293  1.2217388
##   0.7    0.0341398863  1.102599  0.6611366  0.8791996
##   0.7    0.0788675654  1.160718  0.6313154  0.9036914
##   0.7    0.1821943051  1.145777  0.6315370  0.9302208
##   0.7    0.4208924755  1.243856  0.6133826  1.0132665
##   0.7    0.9723162082  1.552780  0.5567206  1.2526599
##   0.8    0.0005188959  4.805484  0.2809105  2.0087577
##   0.8    0.0011987169  4.753095  0.2867342  1.9824233
##   0.8    0.0027691914  4.571447  0.2962720  1.9002346
##   0.8    0.0063971914  3.223895  0.3388110  1.5143624
##   0.8    0.0147783418  1.904064  0.4412295  1.1535365
##   0.8    0.0341398863  1.078002  0.6702328  0.8645310
##   0.8    0.0788675654  1.155899  0.6311105  0.9080987
##   0.8    0.1821943051  1.150655  0.6298961  0.9355187
##   0.8    0.4208924755  1.272107  0.6057250  1.0357280
##   0.8    0.9723162082  1.622702  0.5201388  1.3079131
##   0.9    0.0005188959  4.793123  0.2805001  2.0060453
##   0.9    0.0011987169  4.720935  0.2870155  1.9735235
##   0.9    0.0027691914  4.488262  0.2982346  1.8763417
##   0.9    0.0063971914  3.108718  0.3442950  1.4777232
##   0.9    0.0147783418  1.746072  0.4655458  1.1091396
##   0.9    0.0341398863  1.067417  0.6737926  0.8581378
##   0.9    0.0788675654  1.155091  0.6298058  0.9132337
##   0.9    0.1821943051  1.159197  0.6266213  0.9408779
##   0.9    0.4208924755  1.300816  0.5965687  1.0548501
##   0.9    0.9723162082  1.694880  0.4581868  1.3653337
##   1.0    0.0005188959  4.755752  0.2815289  1.9995363
##   1.0    0.0011987169  4.720163  0.2864759  1.9708655
##   1.0    0.0027691914  4.386481  0.3006880  1.8486229
##   1.0    0.0063971914  3.004064  0.3498386  1.4468648
##   1.0    0.0147783418  1.536356  0.5080054  1.0446037
##   1.0    0.0341398863  1.082224  0.6673251  0.8659695
##   1.0    0.0788675654  1.153078  0.6290068  0.9163961
##   1.0    0.1821943051  1.169119  0.6228936  0.9465286
##   1.0    0.4208924755  1.325662  0.5890036  1.0721070
##   1.0    0.9723162082  1.752677  0.4132036  1.4135736
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.9 and lambda = 0.03413989.
model$results[which.min(model$results$RMSE), ]
##    alpha     lambda     RMSE  Rsquared       MAE     RMSESD RsquaredSD
## 86   0.9 0.03413989 1.067417 0.6737926 0.8581378 0.08661804 0.07084266
##         MAESD
## 86 0.07864809

The optimal RMSE achieved was 1.0674, and the corresponding \(R2\) was 0.6738.

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
pred <- predict(model, newdata = x_test_proc)
postResample(pred, y_test)
##      RMSE  Rsquared       MAE 
## 1.2890720 0.5291922 1.1195627

The test RMSE (1.2891) is higher than the training RMSE (1.0674), and the \(R2\) is lower, indicating some overfitting.

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
coef(model$finalModel, model$bestTune$lambda)
## 58 x 1 sparse Matrix of class "dgCMatrix"
##                         s=0.03413989
## (Intercept)            40.1958298249
## BiologicalMaterial01    .           
## BiologicalMaterial02    .           
## BiologicalMaterial03    0.1109873116
## BiologicalMaterial04    .           
## BiologicalMaterial05    0.2066945516
## BiologicalMaterial06    0.2629139713
## BiologicalMaterial07   -0.0122981044
## BiologicalMaterial08    .           
## BiologicalMaterial09   -0.1243129815
## BiologicalMaterial10   -0.0436836423
## BiologicalMaterial11    .           
## BiologicalMaterial12    .           
## ManufacturingProcess01  0.0008194475
## ManufacturingProcess02  .           
## ManufacturingProcess03  .           
## ManufacturingProcess04  0.2716139040
## ManufacturingProcess05  .           
## ManufacturingProcess06  0.2280267092
## ManufacturingProcess07 -0.0710738376
## ManufacturingProcess08 -0.0879278522
## ManufacturingProcess09  0.4551638572
## ManufacturingProcess10  .           
## ManufacturingProcess11  .           
## ManufacturingProcess12  0.0175168308
## ManufacturingProcess13 -0.1583335502
## ManufacturingProcess14  .           
## ManufacturingProcess15  0.1183417813
## ManufacturingProcess16  .           
## ManufacturingProcess17 -0.3360427362
## ManufacturingProcess18  0.0463377594
## ManufacturingProcess19  0.0580749700
## ManufacturingProcess20  .           
## ManufacturingProcess21  .           
## ManufacturingProcess22  .           
## ManufacturingProcess23  .           
## ManufacturingProcess24 -0.0029771992
## ManufacturingProcess25  .           
## ManufacturingProcess26  .           
## ManufacturingProcess27  .           
## ManufacturingProcess28 -0.1348603803
## ManufacturingProcess29  0.0335801096
## ManufacturingProcess30  .           
## ManufacturingProcess31  .           
## ManufacturingProcess32  0.7520304256
## ManufacturingProcess33  .           
## ManufacturingProcess34  0.1855759896
## ManufacturingProcess35  .           
## ManufacturingProcess36 -0.0557524538
## ManufacturingProcess37 -0.3218553657
## ManufacturingProcess38 -0.0110749789
## ManufacturingProcess39  0.0666967180
## ManufacturingProcess40  .           
## ManufacturingProcess41 -0.0121748092
## ManufacturingProcess42  .           
## ManufacturingProcess43  0.1635447084
## ManufacturingProcess44  0.0303830189
## ManufacturingProcess45  0.2002217835

The most important predictors in the model are those with the largest non-zero coefficients in the Elastic Net model. Overall, manufacturing process predictors dominate the model, as they have larger coefficients and appear more frequently among the important variables.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
ggplot(ChemicalManufacturingProcess, aes(x = ManufacturingProcess32, y = Yield)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(ChemicalManufacturingProcess, aes(x = ManufacturingProcess17, y = Yield)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

The top predictors show both positive and negative relationships with yield. Increasing variables like ManufacturingProcess32 improves yield, while reducing variables like ManufacturingProcess17 can prevent decreases in yield. Since these are process variables, they can be adjusted to optimize production and improve future outcomes.