library(fpp3)
library(tidyverse)
library(ggrepel)
library(seasonal)
library(fabletools)
library(corrplot)
library(GGally)
library(patchwork)
library(caret)
library(pls)
library(elasticnet)
library(impute)
theme_set(theme_bw())
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
fingerprints = as.data.frame(fingerprints)
filtered_fp = fingerprints[, nearZeroVar(fingerprints)]
After removing predictors with near zero variance, there are 719 predictors left for modeling.
set.seed(382)
train_rows = createDataPartition(permeability, p = 0.8, list = FALSE)
train_fp = filtered_fp[train_rows,]
train_perm = permeability[train_rows,]
test_fp = filtered_fp[-train_rows,]
test_perm = permeability[-train_rows,]
trCtrl = trainControl(method = "cv", number = 10,
preProcOptions = list(cutoff = 0.9))
pls_fp = train(train_fp, train_perm, method = "pls",
preProc = c("center", "scale"), tuneLength = 20,
trControl = trCtrl)
pls_fp
## Partial Least Squares
##
## 133 samples
## 719 predictors
##
## Pre-processing: centered (719), scaled (719)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 120, 119, 121, 120, 118, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 15.51316 0.05959557 11.56318
## 2 15.44321 0.10938067 11.33523
## 3 15.20303 0.14004963 11.18470
## 4 14.94071 0.15296063 10.97946
## 5 15.13426 0.14231598 11.20613
## 6 15.24549 0.12851510 11.24853
## 7 15.11638 0.14711889 11.22125
## 8 15.05906 0.16295982 11.13390
## 9 15.14544 0.16644651 11.09291
## 10 15.06708 0.17595034 10.91804
## 11 14.80301 0.19142151 10.62473
## 12 14.81347 0.20381329 10.39779
## 13 14.75270 0.21318425 10.30668
## 14 14.78499 0.21085245 10.26546
## 15 14.87369 0.20742474 10.25370
## 16 14.88846 0.20804256 10.24259
## 17 14.93306 0.20739401 10.21927
## 18 14.93448 0.20495758 10.20635
## 19 14.98019 0.20534141 10.21575
## 20 14.95069 0.20448201 10.18835
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 13.
The data was split into a training and test set, with the
proportion being 80% for training and 20% for testing. After splitting
the data, the train function was used with
the “pls” method. Pre-processing was done within the function via
scaling and centering.
Based on the output of the function, the optimal number of latent variables, based on the lowest RMSE value, is 13. The associated \(R^2\) value is roughly 0.213.
pls_fp_predict = predict(pls_fp, test_fp)
pls_fp_values = data.frame(obs = test_perm, pred = pls_fp_predict)
defaultSummary(pls_fp_values)
## RMSE Rsquared MAE
## 13.4596400 0.4300698 10.2186778
Based on the results above, using the model created using
the train method, the model’s
\(R^2\) value is about
0.430.
# Ordinary Linear Regression
lm_fp = train(train_fp, train_perm, method = "lm", trControl = trCtrl)
lm_fp
## Linear Regression
##
## 133 samples
## 719 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 120, 119, 121, 121, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 17.86945 0.1786968 12.22377
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
lm_fp_predict = predict(lm_fp, test_fp)
lm_fp_values = data.frame(obs = test_perm, pred = lm_fp_predict)
defaultSummary(lm_fp_values)
## RMSE Rsquared MAE
## 16.4943416 0.3494247 10.4833161
The RMSE for an ordinary linear regression model is 16.49, with an \(R^2\) value of about 0.349, both of which are worse than the PLS model.
# Robust Linear Regression
set.seed(382)
rlm_fp = train(train_fp, train_perm, method = "rlm", trControl = trCtrl,
preProcess = c("zv", "pca"))
rlm_fp
## Robust Linear Model
##
## 133 samples
## 719 predictors
##
## Pre-processing: principal component signal extraction (646), centered
## (646), scaled (646), remove (73)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 19.58557 0.1414304 14.45994
## FALSE psi.hampel 19.58860 0.1461660 14.53611
## FALSE psi.bisquare 19.56892 0.1331576 14.45298
## TRUE psi.huber 15.48049 0.1631865 10.21625
## TRUE psi.hampel 15.67475 0.1616814 10.05076
## TRUE psi.bisquare 51.13375 0.1378484 23.79015
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
rlm_fp_predict = predict(rlm_fp, test_fp)
rlm_fp_values = data.frame(obs = test_perm, pred = rlm_fp_predict)
defaultSummary(rlm_fp_values)
## RMSE Rsquared MAE
## 15.901804 0.179719 11.385654
The RMSE for robust linear regression model is 15.90, with an \(R^2\) value of about 0.180, both of which are worse than the PLS model.
# Ridge Regression
ridgeGrid = data.frame(.lambda = seq(0, .1, length = 15))
set.seed(382)
ridge_fp = train(train_fp, train_perm, method = "ridge",
trControl = trCtrl,
preProcess = c("zv", "center", "scale"),
tuneGrid = ridgeGrid)
ridge_fp
## Ridge Regression
##
## 133 samples
## 719 predictors
##
## Pre-processing: centered (646), scaled (646), remove (73)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 1.889437e+13 0.1443473 5.659084e+12
## 0.007142857 3.323334e+01 0.1824615 2.134721e+01
## 0.014285714 1.495368e+01 0.2030199 1.054933e+01
## 0.021428571 2.250643e+01 0.2020405 1.616860e+01
## 0.028571429 1.488834e+01 0.2047597 1.051622e+01
## 0.035714286 1.487595e+01 0.2043898 1.050647e+01
## 0.042857143 1.487791e+01 0.2042374 1.052246e+01
## 0.050000000 1.488814e+01 0.2034074 1.054305e+01
## 0.057142857 1.486081e+01 0.2046025 1.052731e+01
## 0.064285714 1.486735e+01 0.2043282 1.054465e+01
## 0.071428571 1.486968e+01 0.2044425 1.055675e+01
## 0.078571429 1.487970e+01 0.2040750 1.057820e+01
## 0.085714286 1.488921e+01 0.2041989 1.059451e+01
## 0.092857143 7.403398e+01 0.1933856 4.336542e+01
## 0.100000000 1.491405e+01 0.2038094 1.063059e+01
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.05714286.
ridge_fp_predict = predict(ridge_fp, test_fp)
ridge_fp_values = data.frame(obs = test_perm, pred = ridge_fp_predict)
defaultSummary(ridge_fp_values)
## RMSE Rsquared MAE
## 13.3895940 0.4433657 9.8014949
The RMSE using the Ridge-regression model is 13.39, with an \(R^2\) value of about 0.443. This model performs slightly better than the original PLS model based on both of these metrics.
# Lasso Method
lasso_grid = data.frame(.fraction = seq(0.05, 1, length = 20))
set.seed(382)
lasso_fp = train(train_fp, train_perm, method = "lasso",
trControl = trCtrl,
preProcess = c("zv", "center", "scale"),
tuneGrid = lasso_grid)
lasso_fp
## The lasso
##
## 133 samples
## 719 predictors
##
## Pre-processing: centered (646), scaled (646), remove (73)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.05 1.306789e+11 0.2767820 4.939199e+10
## 0.10 6.461414e+11 0.1996242 2.463813e+11
## 0.15 1.269996e+12 0.1997062 4.553079e+11
## 0.20 1.988231e+12 0.1889881 6.807385e+11
## 0.25 2.803892e+12 0.1805097 9.302148e+11
## 0.30 3.661596e+12 0.1777971 1.192086e+12
## 0.35 4.513141e+12 0.1773766 1.452004e+12
## 0.40 5.703573e+12 0.1775799 1.801338e+12
## 0.45 6.905638e+12 0.1770407 2.153438e+12
## 0.50 8.108370e+12 0.1715360 2.505538e+12
## 0.55 9.311512e+12 0.1653216 2.857639e+12
## 0.60 1.051492e+13 0.1599203 3.209739e+12
## 0.65 1.171852e+13 0.1547929 3.561840e+12
## 0.70 1.292225e+13 0.1498646 3.913940e+12
## 0.75 1.412608e+13 0.1470529 4.266041e+12
## 0.80 1.532998e+13 0.1460634 4.618141e+12
## 0.85 1.626670e+13 0.1462140 4.891782e+12
## 0.90 1.714256e+13 0.1461200 5.147549e+12
## 0.95 1.801845e+13 0.1451322 5.403316e+12
## 1.00 1.889437e+13 0.1443473 5.659084e+12
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.05.
lasso_fp_predict = predict(lasso_fp, test_fp)
lasso_fp_values = data.frame(obs = test_perm, pred = lasso_fp_predict)
defaultSummary(lasso_fp_values)
## RMSE Rsquared MAE
## 14.6216345 0.2318661 11.0227817
The RMSE using the Lasso-regression model is 14.62, with an \(R^2\) value of about 0.232, both of which are worse than the PLS model.
# Elastic Net Model
enet_grid = expand.grid(.lambda = c(0, 0.01, 0.1),
.fraction = seq(0.05, 1, length = 20))
set.seed(382)
enet_fp = train(train_fp, train_perm, method = "enet",
trControl = trCtrl,
preProcess = c("zv", "center", "scale"),
tuneGrid = enet_grid)
enet_fp
## Elasticnet
##
## 133 samples
## 719 predictors
##
## Pre-processing: centered (646), scaled (646), remove (73)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 1.306789e+11 0.2767820 4.939199e+10
## 0.00 0.10 6.461414e+11 0.1996242 2.463813e+11
## 0.00 0.15 1.269996e+12 0.1997062 4.553079e+11
## 0.00 0.20 1.988231e+12 0.1889881 6.807385e+11
## 0.00 0.25 2.803892e+12 0.1805097 9.302148e+11
## 0.00 0.30 3.661596e+12 0.1777971 1.192086e+12
## 0.00 0.35 4.513141e+12 0.1773766 1.452004e+12
## 0.00 0.40 5.703573e+12 0.1775799 1.801338e+12
## 0.00 0.45 6.905638e+12 0.1770407 2.153438e+12
## 0.00 0.50 8.108370e+12 0.1715360 2.505538e+12
## 0.00 0.55 9.311512e+12 0.1653216 2.857639e+12
## 0.00 0.60 1.051492e+13 0.1599203 3.209739e+12
## 0.00 0.65 1.171852e+13 0.1547929 3.561840e+12
## 0.00 0.70 1.292225e+13 0.1498646 3.913940e+12
## 0.00 0.75 1.412608e+13 0.1470529 4.266041e+12
## 0.00 0.80 1.532998e+13 0.1460634 4.618141e+12
## 0.00 0.85 1.626670e+13 0.1462140 4.891782e+12
## 0.00 0.90 1.714256e+13 0.1461200 5.147549e+12
## 0.00 0.95 1.801845e+13 0.1451322 5.403316e+12
## 0.00 1.00 1.889437e+13 0.1443473 5.659084e+12
## 0.01 0.05 1.381759e+01 0.2707874 1.102238e+01
## 0.01 0.10 1.342060e+01 0.2354614 1.049788e+01
## 0.01 0.15 1.318167e+01 0.2438842 1.009794e+01
## 0.01 0.20 1.324113e+01 0.2256947 9.948375e+00
## 0.01 0.25 1.341019e+01 0.2144983 9.943600e+00
## 0.01 0.30 1.352977e+01 0.2110836 9.903328e+00
## 0.01 0.35 1.359247e+01 0.2116970 9.777083e+00
## 0.01 0.40 1.367690e+01 0.2108888 9.678518e+00
## 0.01 0.45 1.380088e+01 0.2085789 9.643132e+00
## 0.01 0.50 1.393546e+01 0.2065098 9.665021e+00
## 0.01 0.55 1.405903e+01 0.2052179 9.731645e+00
## 0.01 0.60 1.419186e+01 0.2038794 9.852256e+00
## 0.01 0.65 1.436756e+01 0.2019051 1.001749e+01
## 0.01 0.70 1.451010e+01 0.2004742 1.016129e+01
## 0.01 0.75 1.465362e+01 0.1999589 1.027492e+01
## 0.01 0.80 1.474970e+01 0.1991055 1.038578e+01
## 0.01 0.85 1.473786e+01 0.2052526 1.038564e+01
## 0.01 0.90 1.478626e+01 0.2093008 1.039046e+01
## 0.01 0.95 1.489094e+01 0.2073482 1.050241e+01
## 0.01 1.00 1.495514e+01 0.2037320 1.058348e+01
## 0.10 0.05 1.383865e+01 0.2732578 1.105319e+01
## 0.10 0.10 1.346230e+01 0.2358989 1.056496e+01
## 0.10 0.15 1.321854e+01 0.2526059 1.016280e+01
## 0.10 0.20 1.320332e+01 0.2385307 9.975508e+00
## 0.10 0.25 1.337141e+01 0.2262602 9.950889e+00
## 0.10 0.30 1.348298e+01 0.2206763 9.917791e+00
## 0.10 0.35 1.355752e+01 0.2208925 9.829539e+00
## 0.10 0.40 1.360656e+01 0.2227471 9.682136e+00
## 0.10 0.45 1.370935e+01 0.2207244 9.616625e+00
## 0.10 0.50 1.383354e+01 0.2183448 9.594620e+00
## 0.10 0.55 1.395748e+01 0.2163048 9.658914e+00
## 0.10 0.60 1.405076e+01 0.2163049 9.730293e+00
## 0.10 0.65 1.415849e+01 0.2156818 9.826710e+00
## 0.10 0.70 1.427623e+01 0.2151838 9.912395e+00
## 0.10 0.75 1.440127e+01 0.2146093 1.002394e+01
## 0.10 0.80 1.452751e+01 0.2131522 1.015082e+01
## 0.10 0.85 1.462957e+01 0.2113786 1.031383e+01
## 0.10 0.90 1.473296e+01 0.2079236 1.045052e+01
## 0.10 0.95 1.482701e+01 0.2060678 1.053928e+01
## 0.10 1.00 1.491405e+01 0.2038094 1.063059e+01
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.15 and lambda = 0.01.
enet_fp_predict = predict(enet_fp, test_fp)
enet_fp_values = data.frame(obs = test_perm, pred = enet_fp_predict)
defaultSummary(enet_fp_values)
## RMSE Rsquared MAE
## 13.2283422 0.4005884 10.1874841
The RMSE using the Elastic Net regression model is 13.23, with an \(R^2\) value of about 0.401. The RMSE of the Elastic Net model is slightly better than the original PLS model, but the R2 value is slightly worse.
Based on the analysis above, the model that performed best was the Ridge-regression model.
Before confirming that this is the best value, the residuals of the model are reviewed to ensure the residuals hover around zero and don’t show any clear patterns.
residual_fp = data.frame(residuals = resid(ridge_fp), predictions = predict(ridge_fp))
residual_fp |> ggplot(aes(x = predictions, y = residuals)) +
geom_point(alpha = 0.5) +
labs(title = "Residuals for Ridge Regression Model",
x = "Predictions", y = "Residuals")
Based on the residual plot above, there adoes not appear to be any clear pattern. Additionally, the residuals appear to hover around 0. This suggests that the model has been sufficiently defined.
I would not recommend any of these models to replace the permeability laboratory experiment because the best model had only an \(R^2\) value of about 0.43, which suggests that only 43% of the variance within the data could be captured by the model.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
updated_chem = impute.knn(as.matrix(ChemicalManufacturingProcess))$data
updated_chem = as.data.frame(updated_chem)
anyNA(updated_chem)
## [1] FALSE
The above code uses K-nearest neighbors to impute the missing values in the dataset. The last line confirms that there are no longer any missing values.
# Use Ridge Method
set.seed(6192)
train_chem_rows = createDataPartition(updated_chem$Yield, p = 0.75, list = FALSE)
train_chem = updated_chem[train_chem_rows,]
test_chem = updated_chem[-train_chem_rows,]
ridgeGrid = data.frame(.lambda = seq(0, .5, length = 25))
trCtrl = trainControl(method = "cv", number = 10,
preProcOptions = list(cutoff = 0.9))
ridge_chem = train(Yield ~ ., data = train_chem, method = "ridge",
trControl = trCtrl,
preProcess = c("zv", "center", "scale"),
tuneGrid = ridgeGrid)
ridge_chem
## Ridge Regression
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 120, 119, 117, 118, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00000000 2.687286 0.3830984 1.599036
## 0.02083333 1.647633 0.4692241 1.197913
## 0.04166667 1.535512 0.4923182 1.144000
## 0.06250000 1.487562 0.5061530 1.118976
## 0.08333333 1.458386 0.5162453 1.104536
## 0.10416667 1.437534 0.5244613 1.095393
## 0.12500000 1.421408 0.5316103 1.089254
## 0.14583333 1.408451 0.5380902 1.085783
## 0.16666667 1.397885 0.5441057 1.082990
## 0.18750000 1.389276 0.5497576 1.081002
## 0.20833333 1.382363 0.5550860 1.080384
## 0.22916667 1.376981 0.5600945 1.080435
## 0.25000000 1.373021 0.5647662 1.080713
## 0.27083333 1.370406 0.5690749 1.081185
## 0.29166667 1.369080 0.5729942 1.081827
## 0.31250000 1.368999 0.5765028 1.082619
## 0.33333333 1.370121 0.5795887 1.083990
## 0.35416667 1.372408 0.5822504 1.086099
## 0.37500000 1.375815 0.5844973 1.088447
## 0.39583333 1.380296 0.5863479 1.091214
## 0.41666667 1.385797 0.5878281 1.094411
## 0.43750000 1.392260 0.5889688 1.097649
## 0.45833333 1.399623 0.5898037 1.104530
## 0.47916667 1.407820 0.5903673 1.112732
## 0.50000000 1.416785 0.5906935 1.120874
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.3125.
Using the Ridge regression method, the optimal value for lambda is 0.3125. The associated RMSE value is about 1.417 with an \(R^2\) value of about 0.591.
ridge_chem_predict = predict(ridge_chem, test_chem |> select(-Yield))
ridge_chem_values = data.frame(obs = test_chem$Yield, pred = ridge_chem_predict)
defaultSummary(ridge_chem_values)
## RMSE Rsquared MAE
## 1.6590599 0.5383268 1.1407088
After running the model on the test dataset, the model produced an RMSE value of about 1.659 and \(R^2\) value of about 0.538. These are both worse than what was produced on the training set. Logically, this makes sense since the model is developed over the training set itself and thus knows the desired output values associated with the predictors, whereas the model is trying to predict the output values based on the predictors.
plot(varImp(ridge_chem), top = 20,
main = "Most Important Predictors for Ridge Regression Model")
Above is a VIP plot showing the top 20 predictors in the ridge regression model. The plot shows that 11 predictors are process predictors, whereas the other 9 are biological predictors. This suggests that neither type of predictor dominates the model.
chem_var_imp = varImp(ridge_chem)$importance |>
arrange(desc(Overall))
chem_var_imp = chem_var_imp |>
mutate(variable = rownames(chem_var_imp)) |>
head(20)
top_chem_var = chem_var_imp$variable
chem_long = updated_chem |> select(top_chem_var, Yield) |>
pivot_longer(cols = -Yield, names_to = "predictor", values_to = "value")
chem_long |> ggplot(aes(x = value, y = Yield)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
facet_wrap(~predictor, scales = "free_x", nrow = 5, ncol = 4) +
theme(aspect.ratio = 1)
The plots above show the relationship between the top 20
predictors from the previous exercise and Yield. Since the objective of
analyzing this data is to improve the yield, each plot can be analyzed
to determine how each predictor generally affects yield. For example,
there appears to be a strong positive relationship between
ManufacturingProcess32 and Yield, which suggests that the
value of this process should be maximized to improve yield. On the
otherhand, there appears to be a strong negative relationship between
ManufacturingProcess36 and Yield, which suggests that the
value of this process should be minimized to improve yield.
Additionally, these relationships could help point out clear
outliers. For example, ManufacturingProcess31 appears to
have a single observation of 0 that appears to be affecting the
relationship. These outliers could then be handled further (whether by
removal or imputation) when developing the model.