library(tidyverse)
library(caret)
library(pls)
library(elasticnet)
library(lars)
library(ggpubr)
library(janitor)
library(VIM)APM Chapter 6 HW: Linear Regression and its Cousins
6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
- Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
- The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
Answer: 388 predictors remain.
fingerprints <- fingerprints |>
data.frame() |>
tibble()
cols_to_remove <- nearZeroVar(fingerprints)
fp_reduced <- fingerprints |>
select(-all_of(cols_to_remove)) |>
tibble()
fp_reduced |> print(n = 10)# A tibble: 165 × 388
X1 X2 X3 X4 X5 X6 X11 X12 X15 X16 X20 X21 X25
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 1 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 0 0 0
5 0 0 0 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 0 0 1 0 0 0 0 0
7 0 0 0 0 0 1 0 0 0 1 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 1 0 0 1 1 0 0 0
10 0 0 0 0 0 0 0 1 0 0 0 0 0
# ℹ 155 more rows
# ℹ 375 more variables: X26 <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X35 <dbl>,
# X36 <dbl>, X37 <dbl>, X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>,
# X42 <dbl>, X43 <dbl>, X44 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>,
# X49 <dbl>, X50 <dbl>, X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>,
# X55 <dbl>, X56 <dbl>, X57 <dbl>, X58 <dbl>, X59 <dbl>, X60 <dbl>,
# X61 <dbl>, X62 <dbl>, X63 <dbl>, X64 <dbl>, X65 <dbl>, X66 <dbl>, …
- Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R-squared?
Answer: Based on lowest RMSE, 2 latent variables is optimal. The resampled estimate of R-squared is 50.4%, using 2 latent variables.
- Confirmed there are no missing values.
na_counts <- sapply(fp_reduced, \(x) sum(is.na(x)))
na_counts[na_counts > 0]named integer(0)
- Confirmed all variables are numeric.
fp_reduced |>
select(-where(is.numeric))# A tibble: 165 × 0
- Add permeability measurement to our data and take a random sample (80% of data) to use as the training set. Filter data to rows not contained in train set, in order to get our test set. Lastly, separate training set into predictors and outcome objects.
set.seed(624)
fp_reduced <- fp_reduced |>
mutate(index = row_number())
perm <- tibble(permeability = permeability)
fp_reduced <- bind_cols(fp_reduced, perm)
fp_train <- slice_sample(fp_reduced, prop = 0.8)
fp_test <- fp_reduced |>
filter(!index %in% fp_train$index) |>
select(-index)
fp_train <- fp_train |>
select(-index)fp_train_x <- fp_train |>
select(-permeability)
fp_train_y <- fp_train |>
pull(permeability) |>
as.vector()
fp_test_x <- fp_test |>
select(-permeability)
fp_test_y <- fp_test |>
select(permeability)- Tune and Preprocess PLS Model.
ctrl <- trainControl(method = "cv", number = 10)
pls_tune <- train(fp_train_x, fp_train_y,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProcess = c("center", "scale"))
pls_tunePartial Least Squares
132 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 117, 119, 118, 118, 119, 119, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 13.17313 0.3918669 10.176406
2 11.70682 0.5037310 8.347076
3 11.98798 0.4725074 9.183246
4 11.99041 0.4777708 9.029144
5 11.99870 0.4953262 8.790618
6 11.83241 0.5082347 8.800271
7 11.75907 0.5119234 8.945394
8 11.79184 0.5185900 9.050760
9 11.78906 0.5177758 9.000040
10 12.08232 0.4942047 9.213010
11 11.91992 0.5006241 9.028373
12 11.99656 0.4918541 9.103258
13 12.22409 0.4724377 9.217928
14 12.51284 0.4573563 9.338774
15 12.69779 0.4505422 9.468876
16 12.99417 0.4334697 9.794046
17 13.39912 0.4146691 10.010684
18 13.53110 0.4136754 10.086156
19 13.60291 0.4107249 10.055617
20 13.56404 0.4129781 10.040205
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 2.
- Predict the response for the test set. What is the test set estimate of R-squared?
Answer: The R-squared is 36%.
pls_prediction <- predict(pls_tune, fp_test_x) |>
tibble() |>
select(pred_y = 1)
pls_obs_pred_combined <- bind_cols(fp_test_y, pls_prediction) |>
select(obs = 1, pred = 2)
defaultSummary(pls_obs_pred_combined) RMSE Rsquared MAE
11.4496187 0.3603048 8.0747561
- Try building other models discussed in this chapter. Do any have better predictive performance?
Answer: No, the PLS model had the best RMSE and R-squared.
- Ridge Regression
ridge_grid <- data.frame(.lambda = seq(0, .1, length = 15))
ridge_tune <- train(fp_train_x, fp_train_y,
method = "ridge",
tuneGrid = ridge_grid,
trControl = ctrl,
preProcess = c("center", "scale"))
ridge_tuneRidge Regression
132 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 118, 120, 119, 118, 119, 118, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.000000000 12.48844 0.4825823 9.421076
0.007142857 12.88678 0.4604007 9.718990
0.014285714 12.43011 0.4780770 9.514511
0.021428571 929.36106 0.4374912 658.970448
0.028571429 12.03508 0.4882980 9.151273
0.035714286 12.00627 0.4930783 9.208443
0.042857143 11.88920 0.4990345 9.129327
0.050000000 11.82976 0.5025129 9.092501
0.057142857 11.79544 0.5039491 9.074007
0.064285714 11.75420 0.5085057 9.073717
0.071428571 11.70178 0.5103219 9.020144
0.078571429 11.67533 0.5123691 8.999777
0.085714286 11.68942 0.5129649 9.009750
0.092857143 11.64184 0.5158789 8.970937
0.100000000 11.61963 0.5177365 8.945842
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.1.
ridge_prediction <- predict(ridge_tune, fp_test_x) |>
tibble() |>
select(pred_y = 1)
ridge_obs_pred_combined <- bind_cols(fp_test_y, ridge_prediction) |>
select(obs = 1, pred = 2)
defaultSummary(ridge_obs_pred_combined) RMSE Rsquared MAE
12.4378610 0.3063327 9.2056500
- Elastic Net
enet_grid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
enet_tune <- train(fp_train_x, fp_train_y,
method = "enet",
tuneGrid = enet_grid,
trControl = ctrl,
preProcess = c("center", "scale"))
enet_tuneElasticnet
132 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 119, 118, 118, 119, 120, ...
Resampling results across tuning parameters:
lambda fraction RMSE Rsquared MAE
0.00 0.05 12.37385 0.4569798 9.417613
0.00 0.10 11.91431 0.4738088 8.679379
0.00 0.15 11.72760 0.4932896 8.542255
0.00 0.20 11.66125 0.5035412 8.635545
0.00 0.25 11.52741 0.5211643 8.567525
0.00 0.30 11.45773 0.5290837 8.534863
0.00 0.35 11.48754 0.5307170 8.546461
0.00 0.40 11.52804 0.5318989 8.540397
0.00 0.45 11.59546 0.5302517 8.554151
0.00 0.50 11.70025 0.5232463 8.612933
0.00 0.55 11.83881 0.5149630 8.746009
0.00 0.60 11.93229 0.5073544 8.853257
0.00 0.65 11.96399 0.5029738 8.924916
0.00 0.70 12.02942 0.4986601 8.985680
0.00 0.75 12.13891 0.4923525 9.060554
0.00 0.80 12.26667 0.4851202 9.129794
0.00 0.85 12.40945 0.4775497 9.203325
0.00 0.90 12.54648 0.4702779 9.273636
0.00 0.95 12.66711 0.4630492 9.350421
0.00 1.00 12.76744 0.4578404 9.435308
0.01 0.05 11.55750 0.5050067 8.370317
0.01 0.10 11.55979 0.5099879 8.437194
0.01 0.15 11.51050 0.5196834 8.443315
0.01 0.20 11.49556 0.5200467 8.532668
0.01 0.25 11.36869 0.5274342 8.447740
0.01 0.30 11.34925 0.5264806 8.469522
0.01 0.35 11.63120 0.5073801 8.675634
0.01 0.40 12.05230 0.4829232 8.930921
0.01 0.45 12.43543 0.4627624 9.208580
0.01 0.50 12.66122 0.4525456 9.384468
0.01 0.55 12.82485 0.4451750 9.520327
0.01 0.60 12.91268 0.4414757 9.596970
0.01 0.65 12.96731 0.4396219 9.644114
0.01 0.70 13.03687 0.4378973 9.728637
0.01 0.75 13.12608 0.4362157 9.838214
0.01 0.80 13.18665 0.4359350 9.928200
0.01 0.85 13.27696 0.4344610 10.034547
0.01 0.90 13.40082 0.4307750 10.157060
0.01 0.95 13.46249 0.4309503 10.220609
0.01 1.00 13.52649 0.4302445 10.265326
0.10 0.05 12.06060 0.4873934 9.144649
0.10 0.10 11.46520 0.5104094 8.200918
0.10 0.15 11.51042 0.5141440 8.247399
0.10 0.20 11.50365 0.5219817 8.310136
0.10 0.25 11.51787 0.5277021 8.355810
0.10 0.30 11.53871 0.5286203 8.417547
0.10 0.35 11.50920 0.5334116 8.433741
0.10 0.40 11.49001 0.5376385 8.486825
0.10 0.45 11.52953 0.5363681 8.560105
0.10 0.50 11.59258 0.5338744 8.640788
0.10 0.55 11.68753 0.5290837 8.750482
0.10 0.60 11.79588 0.5231061 8.848746
0.10 0.65 11.87551 0.5198949 8.931099
0.10 0.70 11.94925 0.5174104 9.009174
0.10 0.75 12.02625 0.5146236 9.083999
0.10 0.80 12.08559 0.5131460 9.143843
0.10 0.85 12.12816 0.5128736 9.196965
0.10 0.90 12.15609 0.5140694 9.238551
0.10 0.95 12.18073 0.5151279 9.265680
0.10 1.00 12.19349 0.5164887 9.280302
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.3 and lambda = 0.01.
enet_prediction <- predict(enet_tune, fp_test_x) |>
tibble() |>
select(pred_y = 1)
enet_obs_pred_combined <- bind_cols(fp_test_y, enet_prediction) |>
select(obs = 1, pred = 2)
defaultSummary(enet_obs_pred_combined) RMSE Rsquared MAE
12.1777357 0.2924298 8.8549332
- Would you recommend any of your models to replace the permeability laboratory experiment?
Answer: I would recommend the PLS model because it has the best RMSE and R-squared. The diagnostic plots show some major issues with the interpretability of the model (e.g. heteroschodasticy in the reisdual plot), which could potentially be fixed with more data pre-processing, but it is still more accurate than the other predictors. Considering a modelled approach could save “significant resources”, I would recommend this model.
fp_full_x <- fp_reduced |> select(-permeability)
fp_full_y <- fp_reduced |> pull(permeability) |> as_vector()pls_prediction <- predict(pls_tune, fp_full_x)
pls_obs_vs_fitted <- tibble(obs = fp_reduced$permeability,
fitted = pls_prediction) |>
mutate(resid = fitted - obs)
pls_resid_plot <- ggplot(pls_obs_vs_fitted, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
pls_obs_vs_fitted_plot <- ggplot(pls_obs_vs_fitted, aes(x = fitted, y = obs)) +
geom_point() +
xlab("Fitted values") +
ylab("Observed values")
ggarrange(pls_resid_plot, pls_obs_vs_fitted_plot) |>
annotate_figure(
top = text_grob("PLS Diagnostic Plots", size = 14))6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
- Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem_manu_process <- ChemicalManufacturingProcess |>
as_tibble() |>
clean_names()The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
- A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
- Examine columns with NA values.
na_counts <- sapply(chem_manu_process, \(x) sum(is.na(x)))
na_counts <- tibble(vars = names(na_counts),
na_counts = as.integer(na_counts)) |>
filter(na_counts > 0)
na_counts |> print(n = 5)# A tibble: 28 × 2
vars na_counts
<chr> <int>
1 manufacturing_process01 1
2 manufacturing_process02 3
3 manufacturing_process03 15
4 manufacturing_process04 1
5 manufacturing_process05 1
# ℹ 23 more rows
missing_chem <- chem_manu_process |>
select(all_of(na_counts$vars))- Find predictors with greater than or equal to 0.99 correlation. These highly correlated predictors are candidates for a linear regression imputation.
correlations <- cor(chem_manu_process, use = "pairwise.complete.obs")
findCorrelation(correlations, cutoff = .99, names = TRUE)[1] "manufacturing_process27" "manufacturing_process25"
[3] "manufacturing_process20"
- Visualize variables with high correlation. It looks like fitting a linear regression would be tricky, as there are outliers, we’ll impute with knn.
chem_manu_process |>
ggplot(aes(x = manufacturing_process20, y = manufacturing_process25)) +
geom_point()chem_manu_process |>
ggplot(aes(x = manufacturing_process20, y = manufacturing_process27)) +
geom_point()chem_manu_process |>
ggplot(aes(x = manufacturing_process25, y = manufacturing_process27)) +
geom_point()- Use K-Nearest Neighbors (KNN) to Impute.
imp_values <- kNN(missing_chem, imp_var = FALSE) |> as_tibble()
imp_values |> as_tibble() |>
filter(if_any(everything(), is.na))# A tibble: 0 × 28
# ℹ 28 variables: manufacturing_process01 <dbl>, manufacturing_process02 <dbl>,
# manufacturing_process03 <dbl>, manufacturing_process04 <dbl>,
# manufacturing_process05 <dbl>, manufacturing_process06 <dbl>,
# manufacturing_process07 <dbl>, manufacturing_process08 <dbl>,
# manufacturing_process10 <dbl>, manufacturing_process11 <dbl>,
# manufacturing_process12 <dbl>, manufacturing_process14 <dbl>,
# manufacturing_process22 <dbl>, manufacturing_process23 <dbl>, …
- Add imputed values to data.
chem_data_wip <- chem_manu_process |>
select(-all_of(na_counts$vars)) |>
bind_cols(imp_values)- Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Answer: Using a PLS model, the optimal components is 14 and the R-squared is 61%.
- Split data into train and test sets.
chem_data_wip <- chem_data_wip |>
mutate(index = row_number())
chem_train <- slice_sample(chem_data_wip, prop = 0.8)
chem_test <- chem_data_wip |>
filter(!index %in% chem_train$index) |>
select(-index)
chem_train <- chem_train |>
select(-index)chem_train_x <- chem_train |>
select(-yield)
chem_train_y <- chem_train |>
pull(yield) |>
as.vector()
chem_test_x <- chem_test |>
select(-yield)
chem_test_y <- chem_test |>
select(yield)- Preprocess and tune a model.
ctrl <- trainControl(method = "cv", number = 10)
pls_tune <- train(chem_train_x, chem_train_y,
method = "pls",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 20)
pls_tunePartial Least Squares
140 samples
57 predictor
Pre-processing: centered (57), scaled (57)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 127, 126, 126, 126, 127, 125, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 1.482324 0.4338432 1.1360310
2 1.720307 0.4793203 1.1778567
3 1.313428 0.5635094 1.0317300
4 1.642100 0.5776492 1.1363541
5 1.906716 0.5323399 1.1861195
6 1.872045 0.5417769 1.1740765
7 1.652362 0.5706559 1.1172741
8 1.600043 0.5694450 1.1094587
9 1.473252 0.5838533 1.0766138
10 1.452794 0.5900721 1.0770033
11 1.518600 0.5868901 1.1061338
12 1.386258 0.6070812 1.0550334
13 1.233115 0.6208777 0.9955550
14 1.206864 0.6113552 0.9874143
15 1.282925 0.5839330 1.0072365
16 1.393245 0.5540968 1.0682785
17 1.444606 0.5436852 1.0857940
18 1.605452 0.5439741 1.1196716
19 1.807649 0.5413408 1.2037221
20 2.121396 0.5306355 1.3074498
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 14.
- Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Answer: RMSE and R-squared are much worse on the test set.
pls_prediction <- predict(pls_tune, chem_test_x) |>
as_tibble()
pls_obs_pred_combined <- bind_cols(chem_test_y, pls_prediction) |>
select(obs = 1, pred = 2)
defaultSummary(pls_obs_pred_combined) RMSE Rsquared MAE
1.4565301 0.5011956 1.2285159
- Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Answer: I’ve added up the coefficients for each predictor, to see which have the most overall weight in the model. This list is mostly dominated by process predictors.
pls_coeffs <- pls_tune[["finalModel"]][["coefficients"]] |> as.data.frame() |>
rownames_to_column() |>
as_tibble() |>
rename(predictor = rowname) |>
pivot_longer(cols = 2:15, names_to = "component", values_to = "coefficient")
top_preds_overall <- pls_coeffs |>
group_by(predictor) |>
summarise(sum_coeff = sum(coefficient)) |>
arrange(desc(sum_coeff)) |>
head(9)
top_preds_overall# A tibble: 9 × 2
predictor sum_coeff
<chr> <dbl>
1 manufacturing_process32 9.08
2 manufacturing_process09 8.10
3 manufacturing_process04 4.41
4 biological_material01 4.32
5 biological_material08 3.95
6 manufacturing_process39 3.16
7 manufacturing_process34 3.01
8 biological_material12 2.95
9 biological_material06 2.54
- Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Answer: This info gives us the predictors which have the strongest relationship with yield. In this case, we know we can affect the process predictors, which coincidentally makes up most the top predictors. Therefore, the company can prioritize affecting these top process predictors to improve yield.
pred_vs_yield_plots <- map(top_preds_overall$predictor, ~ {
ggplot(chem_data_wip, aes(x = .data[[.x]], y = yield)) +
geom_point()})
ggarrange(plotlist = pred_vs_yield_plots)