Homework 7. Linear Regression

#chunks
knitr::opts_chunk$set(eval=TRUE, message=FALSE, warning=FALSE, fig.height=5, fig.align='center')

#libraries
library(tidyverse)
library(fpp3)
library(latex2exp)
library(seasonal)
library(GGally)
library(gridExtra)
library(reshape2)
library(Hmisc)
library(corrplot)
library(e1071)
library(caret)
library(VIM)
library(forecast)
library(urca)
library(earth)
library(glmnet)
library(aTSA)
library(AppliedPredictiveModeling)
#random seed
set.seed(547)

6.2.

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a.

Start R and use these commands to load the data. The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

#Load data
#library(AppliedPredictiveModeling)
data(permeability)

#Copy data
perme_df <- permeability

b.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

The fingerprints were sparse since each predictor offered a binary indication of whether a certain substructure was present or absent. Predictors with low frequencies (i.e., near-zero variance) were filtered away using the caret package’s nearZeroVar function, leaving only predictors with sufficient variability. Following filtering, 388 predictors remained from the original 1,107.

#Filter out predictors with low frequencies nearZeroVar()
nzv <- nearZeroVar(fingerprints, saveMetrics = TRUE)
fingerprints_nzv <- fingerprints[, !nzv$nzv]

#388 predictors left after nearZeroVar
ncol(fingerprints_nzv)

## [1] 388

c.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

To prepare the data for modeling, we divided it into a training set (80%) and a test set (20%): training set has 133 observations, test set has 32 observations.

We cross-validated a Partial Least Squares (PLS) model, tweaking it over a range of 1 to 10 latent variables to get the ideal number. Next, we used train() with preProcess = c(“center”, “scale”) to train the model. During cross-validation and testing of training data, caret automatically applies centering and scaling. In this way we made sure that pre-processing was executed consistently to both training and test sets using the same settings, which were maintained by caret. We also utilized repeated cross-validation to obtain a more stable estimate by using “repeatedcv” and repeats as 3. This produced a little more robust estimate, but at the expense of additional calculation.

The cross-validated findings showed that a model with 8 latent variables produced the best results, with a resampled R^2 of around 0.46, RMSE=11.82.

set.seed(547)

#Split fingerprints data
train_idx <- createDataPartition(perme_df, p = 0.8, list = FALSE)
train_df <- fingerprints_nzv[train_idx, ]
test_df <- fingerprints_nzv[-train_idx, ]
#Split response data
train_perme <- perme_df[train_idx]
test_perme <- perme_df[-train_idx]

str(train_df)

##  num [1:133, 1:388] 0 0 0 0 0 0 1 1 0 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:133] "1" "3" "4" "5" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...

str(test_df)

##  num [1:32, 1:388] 0 0 0 0 0 0 1 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:32] "2" "6" "9" "10" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...

str(train_perme)

##  num [1:133] 12.52 19.41 1.73 1.68 25.36 ...

str(test_perme)

##  num [1:32] 1.12 0.51 39.455 4.91 0.725 ...

set.seed(547)
#10-fold cross-validation for the optimal number of latent variables, up to 10 variables
plsGrid <- expand.grid(ncomp = 1:10) 

#Repeat 10-fold cross-validation 3 times and build model
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3) 
pls_model <- train(
  x = train_df,
  y = train_perme,
  method = "pls",
  tuneGrid = plsGrid,
  trControl = control,
  preProcess = c("center", "scale")
)

#Optimal # of latent variables = 8, R2 = 0.4553289 RMSE=11.82041
pls_model$bestTune

##   ncomp
## 8     8

pls_model$results

##    ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      1 12.87109 0.3252459 9.890712 2.945860  0.2258129 2.095335
## 2      2 11.85342 0.4455063 8.376335 3.669064  0.2941283 2.394397
## 3      3 12.24442 0.4287283 9.033620 3.082877  0.2386077 2.024400
## 4      4 12.55122 0.3986930 9.426808 2.898306  0.2148062 2.141508
## 5      5 12.20298 0.4274460 9.254383 3.022060  0.2254174 2.243517
## 6      6 12.17500 0.4305162 9.155104 3.062341  0.2368748 2.368329
## 7      7 11.94574 0.4425642 8.973910 2.885744  0.2297002 2.149150
## 8      8 11.82041 0.4553289 8.991281 3.066081  0.2367520 2.369858
## 9      9 12.06683 0.4404436 9.152319 3.091582  0.2375843 2.342984
## 10    10 12.10810 0.4427680 9.188245 2.964062  0.2318648 2.238204

plot(pls_model)

d.

Predict the response for the test set. What is the test set estimate of R^2?

We used the trained PLS model to estimate the permeability values for the test set and tested the model’s performance on previously unseen data. The model explains roughly 63.2% of the variance in permeability for the test data, as indicated by the test set’s R^2 value of 0.632. This indicates the model’s ability to accurately forecast fresh data. Furthermore, RMSE was found to be around 10.74. The model effectively captures the link between predictors and permeability in the test set, as evidenced by a high R^2 and low RMSE.

#Predict test set
pls_predict <- predict(pls_model, newdata = test_df)

#Compute R2 =  0.6318318 RMSE = 10.74333
pls_r2 <- R2(pls_predict, test_perme)
pls_rmse <- RMSE(pls_predict, test_perme)
pls_r2

## [1] 0.6318318

pls_rmse

## [1] 10.74333

e.

Try building other models discussed in this chapter. Do any have better predictive performance?

The following models were used:

Ridge regression reduces overfitting by applying a L2-norm penalty on linear regression coefficients. It is effective when predictors exhibit multicollinearity. We developed the model with train(), ridge as the algorithm, and a lambda tuning grid. The best lambda value was found to be 0.23. Ridge regression had a R^2 of 0.62 with an RMSE of 11.24 on the test set. This performance is little lower than PLS, but still acceptable.
Lasso regression employs a L1-norm penalty to reduce some coefficients to zero, resulting in successful feature selection. This is useful in sparse models where certain attributes may be irrelevant. Using train() and glmnet as methods, we tweaked for alpha = 1 (pure Lasso) and lambda. The best lambda value was 0.45. The model outperformed PLS and Ridge, with a R^2 of 0.65 and RMSE of 10.72 for the test set, indicating good handling of sparse data.
To increase prediction accuracy and strike a balance between Lasso and Ridge regularization, we built an Elastic Net model with glmnet. The ideal parameter values were chosen using repeated 10-fold cross-validation to get the lowest RMSE. The ideal tuning parameters for the Elastic Net model were alpha=0.9 (Lasso-like behavior), lambda=0.46. The Elastic Net model fit the test set well, with an RMSE of 10.48236 and a R^2 of 0.6679718. The plot shows that when alpha approaches 1, RMSE rapidly declines, highlighting the Lasso penalty’s sparsity-inducing feature.

The Elastic Net model offers a versatile solution that incorporates both Ridge and Lasso penalties. In this scenario, the model was more inclined toward Lasso (with a high alpha value of 0.9), implying that a sparse solution with fewer coefficients was more successful for the dataset. This strategy produced a lower RMSE than the Ridge regression model and equivalent performance to the Lasso model. As a result, Elastic Net is a useful tool for balancing feature selection with model complexity. If more improvement is wanted, exploring a finer grid around the identified optimal values or adding more hyperparameters may help to optimize the model.

set.seed(547)
#Ridge Regression
ridgeGrid <- expand.grid(lambda = seq(0.01, 1, length = 10))
ridge_model <- train(
  x = train_df,
  y = train_perme,
  method = "ridge",
  tuneGrid = ridgeGrid,
  trControl = control,
  preProcess = c("center", "scale")
  )

#Optimal lambda=0.23, R2 = 0.4534137  RMSE=12.58217
ridge_model$bestTune

##   lambda
## 3   0.23

ridge_model$results

##    lambda      RMSE  Rsquared        MAE     RMSESD RsquaredSD      MAESD
## 1    0.01 165.62464 0.2936013 119.598952 460.874473  0.2549441 336.239329
## 2    0.12  58.44650 0.4294719  46.344005 251.599932  0.2641951 203.256374
## 3    0.23  12.58217 0.4534137   9.306198   3.440128  0.2506588   2.457004
## 4    0.34  12.81670 0.4631265   9.524281   3.516204  0.2488736   2.509973
## 5    0.45  13.17345 0.4685700   9.829729   3.625975  0.2489948   2.629169
## 6    0.56  13.61361 0.4721218  10.196704   3.758118  0.2499916   2.758101
## 7    0.67  14.12047 0.4742415  10.620389   3.901646  0.2512908   2.865080
## 8    0.78  14.68065 0.4755388  11.105972   4.053904  0.2526954   2.940731
## 9    0.89  15.27929 0.4763514  11.613721   4.213198  0.2541284   3.034378
## 10   1.00  15.91469 0.4768201  12.148460   4.376757  0.2554424   3.142172

plot(ridge_model)

#R2 = 0.6190875 RMSE= 11.23778
ridge_predict <- predict(ridge_model, newdata = test_df)
ridge_r2 <- R2(ridge_predict, test_perme)
ridge_rmse <- RMSE(ridge_predict, test_perme)

set.seed(547)
#Lasso regeression
lassoGrid <- expand.grid(alpha = 1, lambda = seq(0.0001, 0.5, by = 0.05))
lasso_model <- train(
  x = train_df,
  y = train_perme,
  method = "glmnet",
  tuneGrid = lassoGrid,
  trControl = control,
  preProcess = c("center", "scale")
  )

#Optimal lambda= 0.4501, R2 = 0.4648473 RMSE=11.50082
lasso_model$bestTune

##    alpha lambda
## 10     1 0.4501

lasso_model$results

##    alpha lambda     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      1 0.0001 13.14192 0.3833492 9.614772 3.491374  0.2577702 2.696687
## 2      1 0.0501 13.14192 0.3833492 9.614772 3.491374  0.2577702 2.696687
## 3      1 0.1001 12.93078 0.3905100 9.434135 3.486369  0.2590893 2.726631
## 4      1 0.1501 12.28544 0.4131745 8.885215 3.289555  0.2540865 2.650967
## 5      1 0.2001 12.00553 0.4295871 8.702617 3.160199  0.2464232 2.484934
## 6      1 0.2501 11.80192 0.4434763 8.612644 3.100578  0.2408734 2.325351
## 7      1 0.3001 11.66372 0.4524235 8.563904 3.064874  0.2369920 2.225049
## 8      1 0.3501 11.56398 0.4595726 8.502934 3.043957  0.2336054 2.139557
## 9      1 0.4001 11.50162 0.4641904 8.433555 3.055504  0.2331719 2.084366
## 10     1 0.4501 11.50082 0.4648473 8.395057 3.074767  0.2342845 2.046224

plot(lasso_model)

#Compute R2 = 0.6479493 RMSE= 10.72072
lasso_predict <- predict(lasso_model, newdata = test_df)
lasso_r2 <- R2(lasso_predict, test_perme)
lasso_rmse <- RMSE(lasso_predict, test_perme)

set.seed(547)
#Elastic net
enetGrid <-  expand.grid(alpha = seq(0.1, 1, by = 0.1), lambda = seq(0.01, 0.5, by = 0.05))

enet_model <- train(
  x = train_df,
  y = train_perme,
  method = "glmnet",
  tuneGrid = enetGrid,
  trControl = control,
  preProcess = c("center", "scale")
  )

#Optimal alpa=0.9, lambda= 0.46, R2 = 0.466719 RMSE= 11.49296
enet_model$bestTune

##    alpha lambda
## 90   0.9   0.46

enet_model$results[enet_model$results$alpha == enet_model$bestTune$alpha & enet_model$results$lambda == enet_model$bestTune$lambda, ]

##    alpha lambda     RMSE Rsquared     MAE   RMSESD RsquaredSD    MAESD
## 90   0.9   0.46 11.49296 0.466719 8.40265 3.077127  0.2338012 2.072827

plot(enet_model)

#Compute R2 = 0.6679718 RMSE= 10.48236
enet_predict <- predict(enet_model, newdata = test_df)
enet_r2 <- R2(enet_predict, test_perme)
enet_rmse <- RMSE(enet_predict, test_perme)

results <- data.frame(Model = c("PLS", "Ridge", "Lasso", "Elastic Net"), Test_R2 = c(pls_r2, ridge_r2, lasso_r2, enet_r2), Test_RMSE = c(pls_rmse, ridge_rmse, lasso_rmse, enet_rmse))
results

##         Model   Test_R2 Test_RMSE
## 1         PLS 0.6318318  10.74333
## 2       Ridge 0.6190875  11.23778
## 3       Lasso 0.6479493  10.72072
## 4 Elastic Net 0.6679718  10.48236

f.

Would you recommend any of your models to replace the permeability laboratory experiment?

Based on the performance metrics derived from the models, we would be wary about endorsing one of them as a viable alternative for the permeability lab experiment. However, a few considerations must be taken into consideration before reaching a final recommendation:

The Elastic Net model has the highest accuracy and predictive power on the test set, with an RMSE of 10.48 and R^2 of 0.668. There is a reasonably strong predictive capability, however there is still potential for error. Laboratory trials often provide a higher level of precision and reliability, so the model’s error rate must be carefully considered in the context of practical application. The Elastic Net model strikes a balance between Ridge and Lasso regularization, allowing for some feature selection and so making the model interpretable. However, because it relies on statistical correlations rather than actual measures, it may miss some underlying subtleties. If consistency in prediction accuracy is critical, depending exclusively on the model may pose dangers, particularly if new or unknown conditions emerge. If the primary goal is to lower costs while increasing efficiency, the model could be used as a supplement to the laboratory experiment rather than a complete replacement. For example, it might be used to pre-screen samples and prioritize those that require laboratory testing, minimizing the number of physical trials required.

Before completely replacing laboratory experiments, the model should be verified on a wider range of data, ideally under a variety of conditions and environments. Additional validation stages, such as testing on previously unseen data or in-field samples, would increase confidence in the model’s generalizability.

In conclusion, while the Elastic Net model shows promise and could potentially supplement laboratory experiments, it may not yet be precise enough to completely replace physical testing. It is advised that the model be used as a workflow support tool, reducing costs and increasing efficiency by determining which samples to prioritize for laboratory testing.

6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

a.

Start R and use these commands to load the data. The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.

#library(AppliedPredictiveModeling)
#Load data
data(ChemicalManufacturingProcess)
#Copy data
chemical_df <- ChemicalManufacturingProcess

b.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

When checking the data, we found that 28 of the 57 predictors had missing values, indicating that missing data affected fewer than half of the predictors. ManufacturingProcess03 had the largest proportion of missing values (8.52%), followed by ManufacturingProcess11 (5.68%) and ManufacturingProcess10 (5.11%). Other manufacturing process predictors have missing values ranging from 2.84% to 0.57%. Missing values were more concentrated in the manufacturing process predictors than in the biological material forecasters. This concentration means that missing data predominantly impacts the portion of the dataset that can be controlled or altered, which is consistent with the purpose of optimizing these parameters to increase yield.

KNN Imputation was selected for this dataset from Section 3.8. Unlike mean imputation, which can oversimplify datasets, KNN takes into account the links between observations. This method is beneficial if there is an underlying similarity structure in the manufacturing process data, allowing us to estimate missing values using similar rows. KNN adjusts to both the pattern and amplitude of surrounding data points, which may result in more realistic imputed values. Because only a tiny percentage of values are missing, the computational cost of KNN is doable for this dataset, even with 57 predictors.

To handle the imputation, we use the caret package’s preProcess and specify method = ‘knnImpute’. By default, KNN imputation in caret utilizes an acceptable number of neighbors, although this can be changed if necessary. To keep things simple, we’ll use the default settings here.

#Calculate the percentage of NAs
na_summary <- data.frame(predictor = colnames(chemical_df), 
                         na_percent = colSums(is.na(chemical_df)) / nrow(chemical_df) * 100) %>%
              filter(na_percent > 0) %>%
              arrange(desc(na_percent))
#Fix row names
rownames(na_summary) <- NULL
na_summary

##                 predictor na_percent
## 1  ManufacturingProcess03  8.5227273
## 2  ManufacturingProcess11  5.6818182
## 3  ManufacturingProcess10  5.1136364
## 4  ManufacturingProcess25  2.8409091
## 5  ManufacturingProcess26  2.8409091
## 6  ManufacturingProcess27  2.8409091
## 7  ManufacturingProcess28  2.8409091
## 8  ManufacturingProcess29  2.8409091
## 9  ManufacturingProcess30  2.8409091
## 10 ManufacturingProcess31  2.8409091
## 11 ManufacturingProcess33  2.8409091
## 12 ManufacturingProcess34  2.8409091
## 13 ManufacturingProcess35  2.8409091
## 14 ManufacturingProcess36  2.8409091
## 15 ManufacturingProcess02  1.7045455
## 16 ManufacturingProcess06  1.1363636
## 17 ManufacturingProcess01  0.5681818
## 18 ManufacturingProcess04  0.5681818
## 19 ManufacturingProcess05  0.5681818
## 20 ManufacturingProcess07  0.5681818
## 21 ManufacturingProcess08  0.5681818
## 22 ManufacturingProcess12  0.5681818
## 23 ManufacturingProcess14  0.5681818
## 24 ManufacturingProcess22  0.5681818
## 25 ManufacturingProcess23  0.5681818
## 26 ManufacturingProcess24  0.5681818
## 27 ManufacturingProcess40  0.5681818
## 28 ManufacturingProcess41  0.5681818

#KNN imputation
preProcess_knn <- preProcess(chemical_df, method = 'knnImpute')
chemical_df <- predict(preProcess_knn, newdata = chemical_df)

#Check for remaining NAs
sum(is.na(chemical_df))

## [1] 0

c.

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First, we used the nearZeroVar function to select out predictors with very little variation. Following this phase, the dataset had 56 predictors instead of 57. Next, we split the training and testing datasets 80/20, with 144 observations in the training set and 32 in the testing set. Before fitting the model, we centered and scaled the predictors. This preprocessing step ensures that all variables are on the same scale. This is especially crucial for Elastic Net regularization, which incorporates both L1 and L2 penalties. We used three rounds of 10-fold cross-validation to tweak the Elastic Net model. This repeated CV strategy reduces the variance in performance measurements, resulting in a more trustworthy estimate of model performance. We conducted a grid search across a variety of values for the alpha and lambda hyperparameters. The best hyperparameters are alpha=0.1, lambda=0.31. The model produced R^2 of 0.6207435 and an RMSE of 0.6188371 on the training set (cross-validated).

#Filter out predictors with low frequencies nearZeroVar()
nzv <- nearZeroVar(chemical_df, saveMetrics = TRUE)
chemical_df <- chemical_df[, !nzv$nzv]

#56 predictors left after nearZeroVar
ncol(chemical_df)

## [1] 57

#Split data
set.seed(547) 
trainidx <- createDataPartition(chemical_df$Yield, p = 0.8, list = FALSE)
train_df <- chemical_df[trainidx, ]
test_df <- chemical_df[-trainidx, ]
str(train_df)

## 'data.frame':    144 obs. of  57 variables:
##  $ Yield                 : num  1.226 1.004 0.674 1.253 1.839 ...
##  $ BiologicalMaterial01  : num  2.239 2.239 2.239 1.483 -0.408 ...
##  $ BiologicalMaterial02  : num  1.309 1.309 1.309 1.894 0.662 ...
##  $ BiologicalMaterial03  : num  -0.0562 -0.0562 -0.0562 1.1359 -0.5986 ...
##  $ BiologicalMaterial04  : num  1.296 1.296 1.296 0.941 1.589 ...
##  $ BiologicalMaterial05  : num  0.413 0.413 0.413 -0.373 1.731 ...
##  $ BiologicalMaterial06  : num  1.129 1.129 1.129 1.535 0.619 ...
##  $ BiologicalMaterial08  : num  2.28 2.28 2.28 1.07 1.19 ...
##  $ BiologicalMaterial09  : num  -0.723 -0.723 -0.723 -0.121 -1.734 ...
##  $ BiologicalMaterial10  : num  1.101 1.101 1.101 0.416 1.635 ...
##  $ BiologicalMaterial11  : num  1.393 1.393 1.393 0.136 1.022 ...
##  $ BiologicalMaterial12  : num  1.099 1.099 1.099 1.099 0.724 ...
##  $ ManufacturingProcess01: num  -6.15 -6.15 -6.15 -0.278 0.435 ...
##  $ ManufacturingProcess02: num  -1.97 -1.97 -1.97 -1.97 -1.97 ...
##  $ ManufacturingProcess03: num  0.198 0.109 0.466 0.109 0.555 ...
##  $ ManufacturingProcess04: num  -2.37 -3.16 -3.32 -2.21 -1.25 ...
##  $ ManufacturingProcess05: num  0.9993 0.0625 0.4228 0.8454 0.4949 ...
##  $ ManufacturingProcess06: num  0.963 -0.112 2.185 -0.63 0.555 ...
##  $ ManufacturingProcess07: num  -0.958 1.038 -0.958 1.038 1.038 ...
##  $ ManufacturingProcess08: num  0.894 0.894 -1.112 0.894 0.894 ...
##  $ ManufacturingProcess09: num  0.588 -0.382 -0.479 -0.453 -0.22 ...
##  $ ManufacturingProcess10: num  0.523 0.3143 -0.0248 -0.39 0.2882 ...
##  $ ManufacturingProcess11: num  1.082 0.551 0.803 0.104 1.417 ...
##  $ ManufacturingProcess12: num  -0.481 -0.481 -0.481 -0.481 -0.481 ...
##  $ ManufacturingProcess13: num  -0.5003 0.2877 0.2877 0.0907 -0.5003 ...
##  $ ManufacturingProcess14: num  0.278 0.443 0.791 2.533 2.405 ...
##  $ ManufacturingProcess15: num  0.962 0.825 1.082 3.328 3.14 ...
##  $ ManufacturingProcess16: num  0.146 0.146 0.197 0.475 0.626 ...
##  $ ManufacturingProcess17: num  -0.275 0.366 0.366 -0.356 -0.756 ...
##  $ ManufacturingProcess18: num  0.156 0.183 0.17 0.208 0.142 ...
##  $ ManufacturingProcess19: num  1.51 1.093 0.983 1.619 1.904 ...
##  $ ManufacturingProcess20: num  0.185 0.185 0.156 0.294 0.4 ...
##  $ ManufacturingProcess21: num  0.211 0.211 0.211 -0.688 -0.56 ...
##  $ ManufacturingProcess22: num  -0.722 -0.422 -0.122 0.779 1.079 ...
##  $ ManufacturingProcess23: num  -1.815 -1.213 -0.612 0.591 -1.213 ...
##  $ ManufacturingProcess24: num  -1.006 -0.834 -0.661 1.58 -1.351 ...
##  $ ManufacturingProcess25: num  0.109 0.184 0.171 0.273 0.115 ...
##  $ ManufacturingProcess26: num  0.197 0.216 0.205 0.291 0.242 ...
##  $ ManufacturingProcess27: num  0.191 0.21 0.191 0.343 0.352 ...
##  $ ManufacturingProcess28: num  0.878 0.859 0.859 0.897 0.916 ...
##  $ ManufacturingProcess29: num  0.835 0.775 0.775 0.955 1.015 ...
##  $ ManufacturingProcess30: num  0.757 0.244 0.244 -0.165 0.962 ...
##  $ ManufacturingProcess31: num  -0.267 -0.159 -0.159 -0.141 -0.357 ...
##  $ ManufacturingProcess32: num  1.95 2.69 2.32 2.32 2.69 ...
##  $ ManufacturingProcess33: num  0.989 0.989 1.794 2.6 2.6 ...
##  $ ManufacturingProcess34: num  1.957 1.957 0.118 0.118 0.118 ...
##  $ ManufacturingProcess35: num  1.1464 1.2388 0.0373 -2.5506 -0.5173 ...
##  $ ManufacturingProcess36: num  -0.656 -1.8 -1.8 -2.944 -1.8 ...
##  $ ManufacturingProcess37: num  2.216 -0.705 0.419 -1.828 -1.379 ...
##  $ ManufacturingProcess38: num  -0.822 -0.822 -0.822 -0.822 -0.822 ...
##  $ ManufacturingProcess39: num  0.232 0.232 0.232 0.298 0.232 ...
##  $ ManufacturingProcess40: num  2.149 -0.463 -0.463 -0.463 -0.463 ...
##  $ ManufacturingProcess41: num  2.346 -0.441 -0.441 -0.441 -0.441 ...
##  $ ManufacturingProcess42: num  -0.0547 0.4088 -0.3122 -0.1062 0.1513 ...
##  $ ManufacturingProcess43: num  -0.0137 0.1015 0.2167 0.2167 1.484 ...
##  $ ManufacturingProcess44: num  0.2947 -0.0159 -0.0159 -0.3264 -0.0159 ...
##  $ ManufacturingProcess45: num  0.1522 0.398 -0.0936 -0.0936 -0.3393 ...

str(test_df)

## 'data.frame':    32 obs. of  57 variables:
##  $ Yield                 : num  -1.179 0.387 0.284 -0.99 0.918 ...
##  $ BiologicalMaterial01  : num  -0.226 2.015 0.236 0.166 1.007 ...
##  $ BiologicalMaterial02  : num  -1.514 0.627 0.667 1.485 1.413 ...
##  $ BiologicalMaterial03  : num  -2.683 -0.189 -0.134 2.361 1.598 ...
##  $ BiologicalMaterial04  : num  0.2202 1.7078 -0.0728 -0.3039 0.5188 ...
##  $ BiologicalMaterial05  : num  0.4942 1.2262 -0.0752 1.3672 1.2805 ...
##  $ BiologicalMaterial06  : num  -1.383 0.515 0.678 2.795 1.879 ...
##  $ BiologicalMaterial08  : num  -1.233 1.5 1.071 -0.273 0.126 ...
##  $ BiologicalMaterial09  : num  -3.3963 -0.3614 -0.0483 -0.0483 -0.53 ...
##  $ BiologicalMaterial10  : num  1.101 2.019 -0.185 -1.103 -0.318 ...
##  $ BiologicalMaterial11  : num  -1.839 1.217 0.319 1.578 0.84 ...
##  $ BiologicalMaterial12  : num  -1.771 0.647 1.332 2.56 0.776 ...
##  $ ManufacturingProcess01: num  0.21541 -0.05895 -0.05895 -0.44305 -0.00408 ...
##  $ ManufacturingProcess02: num  0.566 -1.969 -1.969 -1.969 -1.969 ...
##  $ ManufacturingProcess03: num  0.3766 0.9123 -0.0699 -0.8735 -0.4271 ...
##  $ ManufacturingProcess04: num  0.566 -0.614 0.821 -1.729 -0.933 ...
##  $ ManufacturingProcess05: num  -0.4459 0.8454 -0.1308 -0.0194 0.7307 ...
##  $ ManufacturingProcess06: num  -0.541 -0.556 0.814 -0.704 0.296 ...
##  $ ManufacturingProcess07: num  -0.16 -0.958 -0.958 1.038 1.038 ...
##  $ ManufacturingProcess08: num  -0.31 -1.112 -1.112 0.894 0.894 ...
##  $ ManufacturingProcess09: num  -1.7202 -0.0777 0.4978 -0.4592 0.9182 ...
##  $ ManufacturingProcess10: num  -0.077 -0.234 0.706 0.288 1.984 ...
##  $ ManufacturingProcess11: num  -0.0916 0.8585 -0.5946 -0.5387 0.1599 ...
##  $ ManufacturingProcess12: num  -0.481 -0.481 -0.481 -0.481 -0.481 ...
##  $ ManufacturingProcess13: num  0.977 1.273 -1.584 0.78 -0.993 ...
##  $ ManufacturingProcess14: num  0.809 1.616 -0.717 -0.419 -1.795 ...
##  $ ManufacturingProcess15: num  1.185 2.368 -0.633 -0.376 -1.148 ...
##  $ ManufacturingProcess16: num  0.3304 0.4527 -12.9822 -0.0279 -0.1444 ...
##  $ ManufacturingProcess17: num  0.926 1.167 1.727 0.766 -0.275 ...
##  $ ManufacturingProcess18: num  0.151 0.131 0.251 0.153 0.121 ...
##  $ ManufacturingProcess19: num  0.45638 1.57533 2.01413 -0.00436 0.89518 ...
##  $ ManufacturingProcess20: num  0.311 0.2852 0.1849 0.0474 0.1248 ...
##  $ ManufacturingProcess21: num  0.211 0.211 4.836 0.211 0.853 ...
##  $ ManufacturingProcess22: num  0.0583 -1.3228 0.4787 0.1784 0.7789 ...
##  $ ManufacturingProcess23: num  0.8318 -1.2133 -0.0103 0.5912 -0.6118 ...
##  $ ManufacturingProcess24: num  0.891 -1.351 1.408 -0.834 -0.489 ...
##  $ ManufacturingProcess25: num  0.12 0.0852 0.2084 0.0932 0.037 ...
##  $ ManufacturingProcess26: num  0.1256 0.1816 0.2547 0.0503 0.0891 ...
##  $ ManufacturingProcess27: num  0.34604 0.27259 0.17936 0.00421 0.03246 ...
##  $ ManufacturingProcess28: num  0.783 0.859 0.878 0.764 0.821 ...
##  $ ManufacturingProcess29: num  0.5943 0.7746 0.5943 0.0534 0.2337 ...
##  $ ManufacturingProcess30: num  0.757 0.859 -0.268 -0.165 0.552 ...
##  $ ManufacturingProcess31: num  -0.1953 -0.2673 -0.0153 0.1287 -0.0513 ...
##  $ ManufacturingProcess32: num  -0.457 1.581 0.284 -0.272 0.84 ...
##  $ ManufacturingProcess33: num  0.989 1.794 0.989 0.184 0.989 ...
##  $ ManufacturingProcess34: num  -1.72 0.118 -1.72 0.118 0.118 ...
##  $ ManufacturingProcess35: num  -0.8869 -0.0551 -0.3324 0.7767 -0.8869 ...
##  $ ManufacturingProcess36: num  -0.656 -0.656 -0.656 0.488 -0.656 ...
##  $ ManufacturingProcess37: num  -1.154 0.643 1.093 -0.705 -0.705 ...
##  $ ManufacturingProcess38: num  0.717 0.717 0.717 0.717 -0.822 ...
##  $ ManufacturingProcess39: num  0.232 0.365 0.365 -0.034 0.165 ...
##  $ ManufacturingProcess40: num  0.0597 -0.4627 -0.4627 -0.4627 -0.4627 ...
##  $ ManufacturingProcess41: num  -0.069 -0.441 -0.441 -0.441 -0.441 ...
##  $ ManufacturingProcess42: num  0.203 0.151 0.203 0.203 0.203 ...
##  $ ManufacturingProcess43: num  2.406 0.217 -0.129 -0.475 -0.359 ...
##  $ ManufacturingProcess44: num  -0.0159 0.2947 0.2947 0.2947 0.6052 ...
##  $ ManufacturingProcess45: num  0.6437 0.398 0.6437 0.1522 -0.0936 ...

#Train Elastic Net
enetGrid <-  expand.grid(alpha = seq(0.1, 1, by = 0.1), lambda = seq(0.01, 0.5, by = 0.05))
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3) 
enet_model <- train(
  Yield ~ ., data = train_df,
  method = "glmnet",
  tuneGrid = enetGrid,
  trControl = control,
  metric = "RMSE",
  preProcess = c("center", "scale")
)
#Optimal alpa=0.1, lambda= 0.31, R2 =0.6207435 RMSE= 0.6188371
enet_model$bestTune

##   alpha lambda
## 7   0.1   0.31

enet_model$results[enet_model$results$alpha == enet_model$bestTune$alpha & enet_model$results$lambda == enet_model$bestTune$lambda, ]

##   alpha lambda      RMSE  Rsquared       MAE    RMSESD RsquaredSD    MAESD
## 7   0.1   0.31 0.6188371 0.6207435 0.5123727 0.1257053   0.157034 0.100962

plot(enet_model)

d.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

We evaluated the Elastic Net model’s performance on the test set by generating predictions and calculating important metrics. Test metrics: R^2 = 0.517, RMSE=0.708. Train metrics: R^2=0.621, RMSE=0.619. The test set RMSE (0.708) is somewhat higher than the cross-validated RMSE on the training set (0.619), as expected given the model’s exposure to unseen data in the test set. The test set has a slightly lower R^2 (0.517) than the training set (0.621), indicating a modest decrease in explained variance.

The model generalizes reasonably well from training to test data, with just a modest drop in performance. The model’s close match in RMSE and R^2 across training and test sets indicates no substantial underfitting or overfitting, making it a viable choice for forecasting yield in similar manufacturing datasets.

#R2 = 0.5167401 RMSE= 0.7084928
enet_predict <- predict(enet_model, newdata = test_df)
enet_r2 <-  R2(enet_predict, test_df$Yield)
enet_rmse <- RMSE(enet_predict, test_df$Yield)
enet_r2

## [1] 0.5167401

enet_rmse

## [1] 0.7084928

e.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

According to the variable importance plot, the most influential predictors in the Elastic Net model are process predictors:

ManufacturingProcess32 is the strongest predictor, with a significance score of 0.23250, followed by ManufacturingProcess09 (0.15012) and ManufacturingProcess36 (0.12482). Other manufacturing predictions on the top list include ManufacturingProcess17, ManufacturingProcess13, ManufacturingProcess34, ManufacturingProcess37, and ManufacturingProcess11.
BiologicalMaterial03 and BiologicalMaterial06 are biological predictors with quite high significance ratings (0.08144 and 0.06183, respectively). However, they are often less important than the primary manufacturing predictors. Other biological predictors appear lower on the list and have a less impact on yield predictions.

Overall, Manufacturing Process predictors dominate the model’s relevance rankings, implying that changes to the manufacturing process may have a greater impact on yield than variations in biological materials. This dominance is consistent with the model’s emphasis on forecasting and potentially optimizing yield, as manufacturing process variables can be adjusted and managed to maximize yield results.

#Variable importance
vif <- varImp(enet_model, scale = FALSE)
vif

## glmnet variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 0.23250
## ManufacturingProcess09 0.15012
## ManufacturingProcess36 0.12482
## ManufacturingProcess17 0.11296
## ManufacturingProcess13 0.10107
## ManufacturingProcess34 0.10042
## BiologicalMaterial03   0.08144
## ManufacturingProcess37 0.07178
## BiologicalMaterial06   0.06183
## ManufacturingProcess11 0.05091
## ManufacturingProcess04 0.04849
## ManufacturingProcess43 0.04435
## BiologicalMaterial04   0.04276
## ManufacturingProcess15 0.04051
## BiologicalMaterial09   0.03989
## ManufacturingProcess06 0.03820
## ManufacturingProcess39 0.03802
## ManufacturingProcess33 0.03502
## BiologicalMaterial02   0.02918
## ManufacturingProcess40 0.02755

#Plot top 10 predictors' importance
plot(vif, top = 10)

f.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

When we examine the top ten predictors for their link with Yield, we see varied degrees of positive and negative correlation.

ManufacturingProcess32: This predictor has a high positive link with yield (0.61). As shown in the scatter plot, greater values correlate with higher yields. This shows that raising or optimizing this predictor may improve yield, making it an important parameter to monitor and alter during the manufacturing process.

ManufacturingProcess09: Another positively linked predictor, it has a 0.50 correlation with Yield. The scatter plot’s increasing trend shows that higher predictor values are often associated with increased yield. Focusing on this variable may also help to enhance yields.

ManufacturingProcess36: This predictor has a negative connection with Yield (-0.53). The scatter plot supports the adverse link, with higher predictor values correlated with poorer yields. Reducing or adjusting the values may enhance yield.

ManufacturingProcess17: Like ManufacturingProcess36, it has a -0.43 association with Yield. Optimizing this parameter by lowering its value may contribute to higher yield.

ManufacturingProcess13: It has a correlation of -0.50. Reducing this predictor’s value during the production process may help to increase yield.

BiologicalMaterial03: This biological predictor has a 0.45 positive association with Yield, indicating that higher predictor quality is connected with increased yield. Although this predictor cannot be adjusted throughout the production process, it can be used to assess raw material quality prior to processing, allowing for the selection of higher-quality inputs.

ManufacturingProcess34: This predictor has a weaker positive link with Yield (correlation=0.18). While not as influential as others, minor changes to this variable may nevertheless lead to better outcomes, albeit the influence may be minimal.

ManufacturingProcess37 shows a weak negative connection with Yield (-0.16). While it has a minimal impact on yield, monitoring it as part of the whole manufacturing process may aid in process improvement.

BiologicalMaterial06: With a correlation of 0.48, this biological material predictor has a favorable effect on yield. It emphasizes the importance of raw material quality, as does BiologicalMaterial03, and can aid in the selection of appropriate raw material batches.

ManufacturingProcess11: This predictor has a somewhat positive connection with Yield (0.35). Adjusting this predictor could help improve yield, although its impact may be less substantial than top predictors like ManufacturingProcess32 and ManufacturingProcess09.

As a result, increasing positevily correlated predictors like ManufacturingProcess32 and ManufacturingProcess09 values may immediately increase Yield. These predictors have a substantial and positive correlation with yield, making them important variables to regulate. Lowering the values of negative predictors like ManufacturingProcess36, ManufacturingProcess17, and ManufacturingProcess13 may prevent yield losses, as larger values of these predictors correspond to lower yields. The favorable associations between BiologicalMaterial03 and BiologicalMaterial06 and Yield emphasize the importance of using high-quality raw materials. Although these predictors cannot be adjusted during the manufacturing process, they can be utilized as quality controls to ensure that only high-quality materials are used in production, potentially leading to higher yields.

#Target vs top predictors
top_predictors <- rownames(vif$importance)[order(vif$importance$Overall, decreasing = TRUE)[1:10]]
for (predictor in top_predictors) {
  p <- ggplot(chemical_df, aes(x = .data[[predictor]], y = Yield)) +
    geom_point() +
    geom_smooth(method = "lm") +
    ggtitle(paste("Relationship between", predictor, "and Yield")) +
    theme_minimal()
  
  print(p)  
}

#Corr matrix
top_predictors <- c("Yield", "ManufacturingProcess32", "ManufacturingProcess09", "ManufacturingProcess36", "ManufacturingProcess17", "ManufacturingProcess13", "BiologicalMaterial03",  "ManufacturingProcess34", "ManufacturingProcess37", "BiologicalMaterial06", "ManufacturingProcess11")

#Include only the top predictors and Yield
corr_data <- chemical_df[, top_predictors]

#Correlation matrix
correlation_matrix <- cor(corr_data, use = "complete.obs")

#Plot
corrplot(correlation_matrix,  tl.cex = 0.5,method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black", order="hclust",  number.cex=0.5, diag=FALSE)

Homework 7. Linear Regression

Daria Dubovskaia

6.2.

a.

b.

c.

d.

e.

f.

6.3.

a.

b.

c.

d.

e.

f.