APM Chapter 6 HW: Linear Regression and its Cousins

Author

Alex Ptacek

library(tidyverse)
library(caret)
library(pls)
library(elasticnet)
library(lars)
library(ggpubr)
library(janitor)
library(VIM)

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Answer: 388 predictors remain.

fingerprints <- fingerprints |> 
  data.frame() |> 
  tibble()

cols_to_remove <- nearZeroVar(fingerprints)
  
fp_reduced <- fingerprints |> 
  select(-all_of(cols_to_remove)) |> 
  tibble()
fp_reduced |> print(n = 10)

# A tibble: 165 × 388
      X1    X2    X3    X4    X5    X6   X11   X12   X15   X16   X20   X21   X25
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     0     0     0     0     0     1     0     1     0     0     0     0     0
 2     0     0     0     0     0     0     0     1     0     0     0     0     0
 3     0     0     0     0     0     1     0     0     0     0     0     0     0
 4     0     0     0     0     0     0     0     1     0     0     0     0     0
 5     0     0     0     0     0     0     0     1     0     0     0     0     0
 6     0     0     0     0     0     0     0     1     0     0     0     0     0
 7     0     0     0     0     0     1     0     0     0     1     0     0     0
 8     0     0     0     0     0     0     0     0     0     0     0     0     0
 9     0     0     0     0     0     1     0     0     1     1     0     0     0
10     0     0     0     0     0     0     0     1     0     0     0     0     0
# ℹ 155 more rows
# ℹ 375 more variables: X26 <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X35 <dbl>,
#   X36 <dbl>, X37 <dbl>, X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>,
#   X42 <dbl>, X43 <dbl>, X44 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>,
#   X49 <dbl>, X50 <dbl>, X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>,
#   X55 <dbl>, X56 <dbl>, X57 <dbl>, X58 <dbl>, X59 <dbl>, X60 <dbl>,
#   X61 <dbl>, X62 <dbl>, X63 <dbl>, X64 <dbl>, X65 <dbl>, X66 <dbl>, …

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R-squared?

Answer: Based on lowest RMSE, 2 latent variables is optimal. The resampled estimate of R-squared is 50.4%, using 2 latent variables.

Confirmed there are no missing values.

na_counts <- sapply(fp_reduced, \(x) sum(is.na(x)))
na_counts[na_counts > 0]

named integer(0)

Confirmed all variables are numeric.

fp_reduced |> 
  select(-where(is.numeric))

# A tibble: 165 × 0

Add permeability measurement to our data and take a random sample (80% of data) to use as the training set. Filter data to rows not contained in train set, in order to get our test set. Lastly, separate training set into predictors and outcome objects.

set.seed(624)

fp_reduced <- fp_reduced |> 
  mutate(index = row_number())

perm <- tibble(permeability = permeability)

fp_reduced <- bind_cols(fp_reduced, perm)

fp_train <- slice_sample(fp_reduced, prop = 0.8)

fp_test <- fp_reduced |> 
  filter(!index %in% fp_train$index) |> 
  select(-index)

fp_train <- fp_train |> 
  select(-index)

fp_train_x <- fp_train |> 
  select(-permeability)

fp_train_y <- fp_train |> 
  pull(permeability) |> 
  as.vector()


fp_test_x <- fp_test |> 
  select(-permeability)

fp_test_y <- fp_test |> 
  select(permeability)

Tune and Preprocess PLS Model.

ctrl <- trainControl(method = "cv", number = 10)

pls_tune <- train(fp_train_x, fp_train_y,
              method = "pls",
              tuneLength = 20,
              trControl = ctrl,
              preProcess = c("center", "scale"))
pls_tune

Partial Least Squares 

132 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 117, 119, 118, 118, 119, 119, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE      
   1     13.17313  0.3918669  10.176406
   2     11.70682  0.5037310   8.347076
   3     11.98798  0.4725074   9.183246
   4     11.99041  0.4777708   9.029144
   5     11.99870  0.4953262   8.790618
   6     11.83241  0.5082347   8.800271
   7     11.75907  0.5119234   8.945394
   8     11.79184  0.5185900   9.050760
   9     11.78906  0.5177758   9.000040
  10     12.08232  0.4942047   9.213010
  11     11.91992  0.5006241   9.028373
  12     11.99656  0.4918541   9.103258
  13     12.22409  0.4724377   9.217928
  14     12.51284  0.4573563   9.338774
  15     12.69779  0.4505422   9.468876
  16     12.99417  0.4334697   9.794046
  17     13.39912  0.4146691  10.010684
  18     13.53110  0.4136754  10.086156
  19     13.60291  0.4107249  10.055617
  20     13.56404  0.4129781  10.040205

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 2.

Predict the response for the test set. What is the test set estimate of R-squared?

Answer: The R-squared is 36%.

pls_prediction <- predict(pls_tune, fp_test_x) |> 
                    tibble() |> 
                    select(pred_y = 1)

pls_obs_pred_combined <- bind_cols(fp_test_y, pls_prediction) |> 
                          select(obs = 1, pred = 2)

defaultSummary(pls_obs_pred_combined)

      RMSE   Rsquared        MAE 
11.4496187  0.3603048  8.0747561

Try building other models discussed in this chapter. Do any have better predictive performance?

Answer: No, the PLS model had the best RMSE and R-squared.

Ridge Regression

ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))

ridge_tune <- train(fp_train_x, fp_train_y,
              method = "ridge",
              tuneGrid =  ridge_grid,
              trControl = ctrl,
              preProcess = c("center", "scale"))
ridge_tune

Ridge Regression 

132 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 118, 120, 119, 118, 119, 118, ... 
Resampling results across tuning parameters:

  lambda       RMSE       Rsquared   MAE       
  0.000000000   12.48844  0.4825823    9.421076
  0.007142857   12.88678  0.4604007    9.718990
  0.014285714   12.43011  0.4780770    9.514511
  0.021428571  929.36106  0.4374912  658.970448
  0.028571429   12.03508  0.4882980    9.151273
  0.035714286   12.00627  0.4930783    9.208443
  0.042857143   11.88920  0.4990345    9.129327
  0.050000000   11.82976  0.5025129    9.092501
  0.057142857   11.79544  0.5039491    9.074007
  0.064285714   11.75420  0.5085057    9.073717
  0.071428571   11.70178  0.5103219    9.020144
  0.078571429   11.67533  0.5123691    8.999777
  0.085714286   11.68942  0.5129649    9.009750
  0.092857143   11.64184  0.5158789    8.970937
  0.100000000   11.61963  0.5177365    8.945842

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.1.

ridge_prediction <- predict(ridge_tune, fp_test_x) |> 
                    tibble() |> 
                    select(pred_y = 1)

ridge_obs_pred_combined <- bind_cols(fp_test_y, ridge_prediction) |> 
                          select(obs = 1, pred = 2)

defaultSummary(ridge_obs_pred_combined)

      RMSE   Rsquared        MAE 
12.4378610  0.3063327  9.2056500

Elastic Net

 enet_grid <- expand.grid(.lambda = c(0, 0.01, .1),
                         .fraction = seq(.05, 1, length = 20))

enet_tune <- train(fp_train_x, fp_train_y,
              method = "enet",
              tuneGrid =  enet_grid,
              trControl = ctrl,
              preProcess = c("center", "scale"))
enet_tune

Elasticnet 

132 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 119, 118, 118, 119, 120, ... 
Resampling results across tuning parameters:

  lambda  fraction  RMSE      Rsquared   MAE      
  0.00    0.05      12.37385  0.4569798   9.417613
  0.00    0.10      11.91431  0.4738088   8.679379
  0.00    0.15      11.72760  0.4932896   8.542255
  0.00    0.20      11.66125  0.5035412   8.635545
  0.00    0.25      11.52741  0.5211643   8.567525
  0.00    0.30      11.45773  0.5290837   8.534863
  0.00    0.35      11.48754  0.5307170   8.546461
  0.00    0.40      11.52804  0.5318989   8.540397
  0.00    0.45      11.59546  0.5302517   8.554151
  0.00    0.50      11.70025  0.5232463   8.612933
  0.00    0.55      11.83881  0.5149630   8.746009
  0.00    0.60      11.93229  0.5073544   8.853257
  0.00    0.65      11.96399  0.5029738   8.924916
  0.00    0.70      12.02942  0.4986601   8.985680
  0.00    0.75      12.13891  0.4923525   9.060554
  0.00    0.80      12.26667  0.4851202   9.129794
  0.00    0.85      12.40945  0.4775497   9.203325
  0.00    0.90      12.54648  0.4702779   9.273636
  0.00    0.95      12.66711  0.4630492   9.350421
  0.00    1.00      12.76744  0.4578404   9.435308
  0.01    0.05      11.55750  0.5050067   8.370317
  0.01    0.10      11.55979  0.5099879   8.437194
  0.01    0.15      11.51050  0.5196834   8.443315
  0.01    0.20      11.49556  0.5200467   8.532668
  0.01    0.25      11.36869  0.5274342   8.447740
  0.01    0.30      11.34925  0.5264806   8.469522
  0.01    0.35      11.63120  0.5073801   8.675634
  0.01    0.40      12.05230  0.4829232   8.930921
  0.01    0.45      12.43543  0.4627624   9.208580
  0.01    0.50      12.66122  0.4525456   9.384468
  0.01    0.55      12.82485  0.4451750   9.520327
  0.01    0.60      12.91268  0.4414757   9.596970
  0.01    0.65      12.96731  0.4396219   9.644114
  0.01    0.70      13.03687  0.4378973   9.728637
  0.01    0.75      13.12608  0.4362157   9.838214
  0.01    0.80      13.18665  0.4359350   9.928200
  0.01    0.85      13.27696  0.4344610  10.034547
  0.01    0.90      13.40082  0.4307750  10.157060
  0.01    0.95      13.46249  0.4309503  10.220609
  0.01    1.00      13.52649  0.4302445  10.265326
  0.10    0.05      12.06060  0.4873934   9.144649
  0.10    0.10      11.46520  0.5104094   8.200918
  0.10    0.15      11.51042  0.5141440   8.247399
  0.10    0.20      11.50365  0.5219817   8.310136
  0.10    0.25      11.51787  0.5277021   8.355810
  0.10    0.30      11.53871  0.5286203   8.417547
  0.10    0.35      11.50920  0.5334116   8.433741
  0.10    0.40      11.49001  0.5376385   8.486825
  0.10    0.45      11.52953  0.5363681   8.560105
  0.10    0.50      11.59258  0.5338744   8.640788
  0.10    0.55      11.68753  0.5290837   8.750482
  0.10    0.60      11.79588  0.5231061   8.848746
  0.10    0.65      11.87551  0.5198949   8.931099
  0.10    0.70      11.94925  0.5174104   9.009174
  0.10    0.75      12.02625  0.5146236   9.083999
  0.10    0.80      12.08559  0.5131460   9.143843
  0.10    0.85      12.12816  0.5128736   9.196965
  0.10    0.90      12.15609  0.5140694   9.238551
  0.10    0.95      12.18073  0.5151279   9.265680
  0.10    1.00      12.19349  0.5164887   9.280302

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.3 and lambda = 0.01.

enet_prediction <- predict(enet_tune, fp_test_x) |> 
                    tibble() |> 
                    select(pred_y = 1)

enet_obs_pred_combined <- bind_cols(fp_test_y, enet_prediction) |> 
                          select(obs = 1, pred = 2)

defaultSummary(enet_obs_pred_combined)

      RMSE   Rsquared        MAE 
12.1777357  0.2924298  8.8549332

Would you recommend any of your models to replace the permeability laboratory experiment?

Answer: I would recommend the PLS model because it has the best RMSE and R-squared. The diagnostic plots show some major issues with the interpretability of the model (e.g. heteroschodasticy in the reisdual plot), which could potentially be fixed with more data pre-processing, but it is still more accurate than the other predictors. Considering a modelled approach could save “significant resources”, I would recommend this model.

fp_full_x <-  fp_reduced |> select(-permeability)

fp_full_y <- fp_reduced |> pull(permeability) |> as_vector()

pls_prediction <- predict(pls_tune, fp_full_x)

pls_obs_vs_fitted <- tibble(obs = fp_reduced$permeability,
                            fitted = pls_prediction) |> 
                      mutate(resid = fitted - obs)



pls_resid_plot <- ggplot(pls_obs_vs_fitted, aes(x = fitted, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

pls_obs_vs_fitted_plot <- ggplot(pls_obs_vs_fitted, aes(x = fitted, y = obs)) +
  geom_point() +
  xlab("Fitted values") +
  ylab("Observed values")

ggarrange(pls_resid_plot, pls_obs_vs_fitted_plot) |> 
 annotate_figure(
    top = text_grob("PLS Diagnostic Plots", size = 14))

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

chem_manu_process <- ChemicalManufacturingProcess |> 
  as_tibble() |> 
  clean_names()

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Examine columns with NA values.

na_counts <- sapply(chem_manu_process, \(x) sum(is.na(x)))

na_counts <- tibble(vars = names(na_counts),
       na_counts = as.integer(na_counts)) |> 
  filter(na_counts > 0)
na_counts |> print(n = 5)

# A tibble: 28 × 2
  vars                    na_counts
  <chr>                       <int>
1 manufacturing_process01         1
2 manufacturing_process02         3
3 manufacturing_process03        15
4 manufacturing_process04         1
5 manufacturing_process05         1
# ℹ 23 more rows

missing_chem <- chem_manu_process |> 
  select(all_of(na_counts$vars))

Find predictors with greater than or equal to 0.99 correlation. These highly correlated predictors are candidates for a linear regression imputation.

correlations <- cor(chem_manu_process, use = "pairwise.complete.obs")

findCorrelation(correlations, cutoff = .99, names = TRUE)

[1] "manufacturing_process27" "manufacturing_process25"
[3] "manufacturing_process20"

Visualize variables with high correlation. It looks like fitting a linear regression would be tricky, as there are outliers, we’ll impute with knn.

chem_manu_process |>
  ggplot(aes(x = manufacturing_process20, y = manufacturing_process25)) +
  geom_point()

chem_manu_process |>
  ggplot(aes(x = manufacturing_process20, y = manufacturing_process27)) +
  geom_point()

chem_manu_process |>
  ggplot(aes(x = manufacturing_process25, y = manufacturing_process27)) +
  geom_point()

Use K-Nearest Neighbors (KNN) to Impute.

imp_values <- kNN(missing_chem, imp_var = FALSE) |> as_tibble()

imp_values |> as_tibble() |> 
filter(if_any(everything(), is.na))

# A tibble: 0 × 28
# ℹ 28 variables: manufacturing_process01 <dbl>, manufacturing_process02 <dbl>,
#   manufacturing_process03 <dbl>, manufacturing_process04 <dbl>,
#   manufacturing_process05 <dbl>, manufacturing_process06 <dbl>,
#   manufacturing_process07 <dbl>, manufacturing_process08 <dbl>,
#   manufacturing_process10 <dbl>, manufacturing_process11 <dbl>,
#   manufacturing_process12 <dbl>, manufacturing_process14 <dbl>,
#   manufacturing_process22 <dbl>, manufacturing_process23 <dbl>, …

Add imputed values to data.

chem_data_wip <- chem_manu_process |> 
  select(-all_of(na_counts$vars)) |> 
  bind_cols(imp_values)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Answer: Using a PLS model, the optimal components is 14 and the R-squared is 61%.

Split data into train and test sets.

chem_data_wip <- chem_data_wip |> 
  mutate(index = row_number())


chem_train <- slice_sample(chem_data_wip, prop = 0.8)

chem_test <- chem_data_wip |> 
  filter(!index %in% chem_train$index) |> 
  select(-index)

chem_train <- chem_train |> 
  select(-index)

chem_train_x <- chem_train |> 
  select(-yield)

chem_train_y <- chem_train |> 
  pull(yield) |> 
  as.vector()


chem_test_x <- chem_test |> 
  select(-yield)

chem_test_y <- chem_test |> 
  select(yield)

Preprocess and tune a model.

ctrl <- trainControl(method = "cv", number = 10)

pls_tune <- train(chem_train_x, chem_train_y,
                  method = "pls",
                  preProcess = c("center", "scale"),
                  trControl = ctrl,
                  tuneLength = 20)
pls_tune

Partial Least Squares 

140 samples
 57 predictor

Pre-processing: centered (57), scaled (57) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 127, 126, 126, 126, 127, 125, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE      
   1     1.482324  0.4338432  1.1360310
   2     1.720307  0.4793203  1.1778567
   3     1.313428  0.5635094  1.0317300
   4     1.642100  0.5776492  1.1363541
   5     1.906716  0.5323399  1.1861195
   6     1.872045  0.5417769  1.1740765
   7     1.652362  0.5706559  1.1172741
   8     1.600043  0.5694450  1.1094587
   9     1.473252  0.5838533  1.0766138
  10     1.452794  0.5900721  1.0770033
  11     1.518600  0.5868901  1.1061338
  12     1.386258  0.6070812  1.0550334
  13     1.233115  0.6208777  0.9955550
  14     1.206864  0.6113552  0.9874143
  15     1.282925  0.5839330  1.0072365
  16     1.393245  0.5540968  1.0682785
  17     1.444606  0.5436852  1.0857940
  18     1.605452  0.5439741  1.1196716
  19     1.807649  0.5413408  1.2037221
  20     2.121396  0.5306355  1.3074498

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 14.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Answer: RMSE and R-squared are much worse on the test set.

pls_prediction <- predict(pls_tune, chem_test_x) |> 
                    as_tibble()

pls_obs_pred_combined <- bind_cols(chem_test_y, pls_prediction) |> 
                          select(obs = 1, pred = 2)

defaultSummary(pls_obs_pred_combined)

     RMSE  Rsquared       MAE 
1.4565301 0.5011956 1.2285159

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Answer: I’ve added up the coefficients for each predictor, to see which have the most overall weight in the model. This list is mostly dominated by process predictors.

pls_coeffs <- pls_tune[["finalModel"]][["coefficients"]] |> as.data.frame() |> 
  rownames_to_column() |> 
  as_tibble() |> 
  rename(predictor = rowname) |> 
  pivot_longer(cols = 2:15, names_to = "component", values_to = "coefficient")


top_preds_overall <- pls_coeffs |> 
  group_by(predictor) |> 
  summarise(sum_coeff = sum(coefficient)) |> 
  arrange(desc(sum_coeff)) |> 
  head(9)
top_preds_overall

# A tibble: 9 × 2
  predictor               sum_coeff
  <chr>                       <dbl>
1 manufacturing_process32      9.08
2 manufacturing_process09      8.10
3 manufacturing_process04      4.41
4 biological_material01        4.32
5 biological_material08        3.95
6 manufacturing_process39      3.16
7 manufacturing_process34      3.01
8 biological_material12        2.95
9 biological_material06        2.54

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Answer: This info gives us the predictors which have the strongest relationship with yield. In this case, we know we can affect the process predictors, which coincidentally makes up most the top predictors. Therefore, the company can prioritize affecting these top process predictors to improve yield.

pred_vs_yield_plots <- map(top_preds_overall$predictor, ~ {
  ggplot(chem_data_wip, aes(x = .data[[.x]], y = yield)) +
    geom_point()})

ggarrange(plotlist = pred_vs_yield_plots)