library(fpp3)
library(tidyverse)
library(ggrepel)
library(seasonal)
library(fabletools)
library(corrplot)
library(GGally)
library(patchwork)
library(caret)
library(pls)
library(elasticnet)
library(impute)
theme_set(theme_bw())

Problem 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
fingerprints = as.data.frame(fingerprints)
filtered_fp = fingerprints[, nearZeroVar(fingerprints)]

After removing predictors with near zero variance, there are 719 predictors left for modeling.

  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
set.seed(382)
train_rows = createDataPartition(permeability, p = 0.8, list = FALSE)

train_fp = filtered_fp[train_rows,]
train_perm = permeability[train_rows,]
test_fp = filtered_fp[-train_rows,]
test_perm = permeability[-train_rows,]

trCtrl = trainControl(method = "cv", number = 10,
                      preProcOptions = list(cutoff = 0.9))

pls_fp = train(train_fp, train_perm, method = "pls",
               preProc = c("center", "scale"), tuneLength = 20,
               trControl = trCtrl)
pls_fp
## Partial Least Squares 
## 
## 133 samples
## 719 predictors
## 
## Pre-processing: centered (719), scaled (719) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 120, 119, 121, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared    MAE     
##    1     15.51316  0.05959557  11.56318
##    2     15.44321  0.10938067  11.33523
##    3     15.20303  0.14004963  11.18470
##    4     14.94071  0.15296063  10.97946
##    5     15.13426  0.14231598  11.20613
##    6     15.24549  0.12851510  11.24853
##    7     15.11638  0.14711889  11.22125
##    8     15.05906  0.16295982  11.13390
##    9     15.14544  0.16644651  11.09291
##   10     15.06708  0.17595034  10.91804
##   11     14.80301  0.19142151  10.62473
##   12     14.81347  0.20381329  10.39779
##   13     14.75270  0.21318425  10.30668
##   14     14.78499  0.21085245  10.26546
##   15     14.87369  0.20742474  10.25370
##   16     14.88846  0.20804256  10.24259
##   17     14.93306  0.20739401  10.21927
##   18     14.93448  0.20495758  10.20635
##   19     14.98019  0.20534141  10.21575
##   20     14.95069  0.20448201  10.18835
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 13.

The data was split into a training and test set, with the proportion being 80% for training and 20% for testing. After splitting the data, the train function was used with the “pls” method. Pre-processing was done within the function via scaling and centering.

Based on the output of the function, the optimal number of latent variables, based on the lowest RMSE value, is 13. The associated \(R^2\) value is roughly 0.213.

  1. Predict the response for the test set. What is the test set estimate of \(R^2\)?
pls_fp_predict = predict(pls_fp, test_fp)
pls_fp_values = data.frame(obs = test_perm, pred = pls_fp_predict)
defaultSummary(pls_fp_values)
##       RMSE   Rsquared        MAE 
## 13.4596400  0.4300698 10.2186778

Based on the results above, using the model created using the train method, the model’s \(R^2\) value is about 0.430.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
# Ordinary Linear Regression
lm_fp = train(train_fp, train_perm, method = "lm", trControl = trCtrl)
lm_fp
## Linear Regression 
## 
## 133 samples
## 719 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 120, 119, 121, 121, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   17.86945  0.1786968  12.22377
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
lm_fp_predict = predict(lm_fp, test_fp)
lm_fp_values = data.frame(obs = test_perm, pred = lm_fp_predict)
defaultSummary(lm_fp_values)
##       RMSE   Rsquared        MAE 
## 16.4943416  0.3494247 10.4833161

The RMSE for an ordinary linear regression model is 16.49, with an \(R^2\) value of about 0.349, both of which are worse than the PLS model.

# Robust Linear Regression
set.seed(382)
rlm_fp = train(train_fp, train_perm, method = "rlm", trControl = trCtrl,
               preProcess = c("zv", "pca"))
rlm_fp
## Robust Linear Model 
## 
## 133 samples
## 719 predictors
## 
## Pre-processing: principal component signal extraction (646), centered
##  (646), scaled (646), remove (73) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE      Rsquared   MAE     
##   FALSE      psi.huber     19.58557  0.1414304  14.45994
##   FALSE      psi.hampel    19.58860  0.1461660  14.53611
##   FALSE      psi.bisquare  19.56892  0.1331576  14.45298
##    TRUE      psi.huber     15.48049  0.1631865  10.21625
##    TRUE      psi.hampel    15.67475  0.1616814  10.05076
##    TRUE      psi.bisquare  51.13375  0.1378484  23.79015
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
rlm_fp_predict = predict(rlm_fp, test_fp)
rlm_fp_values = data.frame(obs = test_perm, pred = rlm_fp_predict)
defaultSummary(rlm_fp_values)
##      RMSE  Rsquared       MAE 
## 15.901804  0.179719 11.385654

The RMSE for robust linear regression model is 15.90, with an \(R^2\) value of about 0.180, both of which are worse than the PLS model.

# Ridge Regression
ridgeGrid = data.frame(.lambda = seq(0, .1, length = 15))
set.seed(382)

ridge_fp = train(train_fp, train_perm, method = "ridge",
                 trControl = trCtrl,
                 preProcess = c("zv", "center", "scale"),
                 tuneGrid = ridgeGrid)
ridge_fp
## Ridge Regression 
## 
## 133 samples
## 719 predictors
## 
## Pre-processing: centered (646), scaled (646), remove (73) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE          Rsquared   MAE         
##   0.000000000  1.889437e+13  0.1443473  5.659084e+12
##   0.007142857  3.323334e+01  0.1824615  2.134721e+01
##   0.014285714  1.495368e+01  0.2030199  1.054933e+01
##   0.021428571  2.250643e+01  0.2020405  1.616860e+01
##   0.028571429  1.488834e+01  0.2047597  1.051622e+01
##   0.035714286  1.487595e+01  0.2043898  1.050647e+01
##   0.042857143  1.487791e+01  0.2042374  1.052246e+01
##   0.050000000  1.488814e+01  0.2034074  1.054305e+01
##   0.057142857  1.486081e+01  0.2046025  1.052731e+01
##   0.064285714  1.486735e+01  0.2043282  1.054465e+01
##   0.071428571  1.486968e+01  0.2044425  1.055675e+01
##   0.078571429  1.487970e+01  0.2040750  1.057820e+01
##   0.085714286  1.488921e+01  0.2041989  1.059451e+01
##   0.092857143  7.403398e+01  0.1933856  4.336542e+01
##   0.100000000  1.491405e+01  0.2038094  1.063059e+01
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.05714286.
ridge_fp_predict = predict(ridge_fp, test_fp)
ridge_fp_values = data.frame(obs = test_perm, pred = ridge_fp_predict)
defaultSummary(ridge_fp_values)
##       RMSE   Rsquared        MAE 
## 13.3895940  0.4433657  9.8014949

The RMSE using the Ridge-regression model is 13.39, with an \(R^2\) value of about 0.443. This model performs slightly better than the original PLS model based on both of these metrics.

# Lasso Method
lasso_grid = data.frame(.fraction = seq(0.05, 1, length = 20))
set.seed(382)

lasso_fp = train(train_fp, train_perm, method = "lasso",
                 trControl = trCtrl,
                 preProcess = c("zv", "center", "scale"),
                 tuneGrid = lasso_grid)
lasso_fp
## The lasso 
## 
## 133 samples
## 719 predictors
## 
## Pre-processing: centered (646), scaled (646), remove (73) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE          Rsquared   MAE         
##   0.05      1.306789e+11  0.2767820  4.939199e+10
##   0.10      6.461414e+11  0.1996242  2.463813e+11
##   0.15      1.269996e+12  0.1997062  4.553079e+11
##   0.20      1.988231e+12  0.1889881  6.807385e+11
##   0.25      2.803892e+12  0.1805097  9.302148e+11
##   0.30      3.661596e+12  0.1777971  1.192086e+12
##   0.35      4.513141e+12  0.1773766  1.452004e+12
##   0.40      5.703573e+12  0.1775799  1.801338e+12
##   0.45      6.905638e+12  0.1770407  2.153438e+12
##   0.50      8.108370e+12  0.1715360  2.505538e+12
##   0.55      9.311512e+12  0.1653216  2.857639e+12
##   0.60      1.051492e+13  0.1599203  3.209739e+12
##   0.65      1.171852e+13  0.1547929  3.561840e+12
##   0.70      1.292225e+13  0.1498646  3.913940e+12
##   0.75      1.412608e+13  0.1470529  4.266041e+12
##   0.80      1.532998e+13  0.1460634  4.618141e+12
##   0.85      1.626670e+13  0.1462140  4.891782e+12
##   0.90      1.714256e+13  0.1461200  5.147549e+12
##   0.95      1.801845e+13  0.1451322  5.403316e+12
##   1.00      1.889437e+13  0.1443473  5.659084e+12
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.05.
lasso_fp_predict = predict(lasso_fp, test_fp)
lasso_fp_values = data.frame(obs = test_perm, pred = lasso_fp_predict)
defaultSummary(lasso_fp_values)
##       RMSE   Rsquared        MAE 
## 14.6216345  0.2318661 11.0227817

The RMSE using the Lasso-regression model is 14.62, with an \(R^2\) value of about 0.232, both of which are worse than the PLS model.

# Elastic Net Model
enet_grid = expand.grid(.lambda = c(0, 0.01, 0.1),
                        .fraction = seq(0.05, 1, length = 20))
set.seed(382)

enet_fp = train(train_fp, train_perm, method = "enet",
                 trControl = trCtrl,
                 preProcess = c("zv", "center", "scale"),
                 tuneGrid = enet_grid)
enet_fp
## Elasticnet 
## 
## 133 samples
## 719 predictors
## 
## Pre-processing: centered (646), scaled (646), remove (73) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 119, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE          Rsquared   MAE         
##   0.00    0.05      1.306789e+11  0.2767820  4.939199e+10
##   0.00    0.10      6.461414e+11  0.1996242  2.463813e+11
##   0.00    0.15      1.269996e+12  0.1997062  4.553079e+11
##   0.00    0.20      1.988231e+12  0.1889881  6.807385e+11
##   0.00    0.25      2.803892e+12  0.1805097  9.302148e+11
##   0.00    0.30      3.661596e+12  0.1777971  1.192086e+12
##   0.00    0.35      4.513141e+12  0.1773766  1.452004e+12
##   0.00    0.40      5.703573e+12  0.1775799  1.801338e+12
##   0.00    0.45      6.905638e+12  0.1770407  2.153438e+12
##   0.00    0.50      8.108370e+12  0.1715360  2.505538e+12
##   0.00    0.55      9.311512e+12  0.1653216  2.857639e+12
##   0.00    0.60      1.051492e+13  0.1599203  3.209739e+12
##   0.00    0.65      1.171852e+13  0.1547929  3.561840e+12
##   0.00    0.70      1.292225e+13  0.1498646  3.913940e+12
##   0.00    0.75      1.412608e+13  0.1470529  4.266041e+12
##   0.00    0.80      1.532998e+13  0.1460634  4.618141e+12
##   0.00    0.85      1.626670e+13  0.1462140  4.891782e+12
##   0.00    0.90      1.714256e+13  0.1461200  5.147549e+12
##   0.00    0.95      1.801845e+13  0.1451322  5.403316e+12
##   0.00    1.00      1.889437e+13  0.1443473  5.659084e+12
##   0.01    0.05      1.381759e+01  0.2707874  1.102238e+01
##   0.01    0.10      1.342060e+01  0.2354614  1.049788e+01
##   0.01    0.15      1.318167e+01  0.2438842  1.009794e+01
##   0.01    0.20      1.324113e+01  0.2256947  9.948375e+00
##   0.01    0.25      1.341019e+01  0.2144983  9.943600e+00
##   0.01    0.30      1.352977e+01  0.2110836  9.903328e+00
##   0.01    0.35      1.359247e+01  0.2116970  9.777083e+00
##   0.01    0.40      1.367690e+01  0.2108888  9.678518e+00
##   0.01    0.45      1.380088e+01  0.2085789  9.643132e+00
##   0.01    0.50      1.393546e+01  0.2065098  9.665021e+00
##   0.01    0.55      1.405903e+01  0.2052179  9.731645e+00
##   0.01    0.60      1.419186e+01  0.2038794  9.852256e+00
##   0.01    0.65      1.436756e+01  0.2019051  1.001749e+01
##   0.01    0.70      1.451010e+01  0.2004742  1.016129e+01
##   0.01    0.75      1.465362e+01  0.1999589  1.027492e+01
##   0.01    0.80      1.474970e+01  0.1991055  1.038578e+01
##   0.01    0.85      1.473786e+01  0.2052526  1.038564e+01
##   0.01    0.90      1.478626e+01  0.2093008  1.039046e+01
##   0.01    0.95      1.489094e+01  0.2073482  1.050241e+01
##   0.01    1.00      1.495514e+01  0.2037320  1.058348e+01
##   0.10    0.05      1.383865e+01  0.2732578  1.105319e+01
##   0.10    0.10      1.346230e+01  0.2358989  1.056496e+01
##   0.10    0.15      1.321854e+01  0.2526059  1.016280e+01
##   0.10    0.20      1.320332e+01  0.2385307  9.975508e+00
##   0.10    0.25      1.337141e+01  0.2262602  9.950889e+00
##   0.10    0.30      1.348298e+01  0.2206763  9.917791e+00
##   0.10    0.35      1.355752e+01  0.2208925  9.829539e+00
##   0.10    0.40      1.360656e+01  0.2227471  9.682136e+00
##   0.10    0.45      1.370935e+01  0.2207244  9.616625e+00
##   0.10    0.50      1.383354e+01  0.2183448  9.594620e+00
##   0.10    0.55      1.395748e+01  0.2163048  9.658914e+00
##   0.10    0.60      1.405076e+01  0.2163049  9.730293e+00
##   0.10    0.65      1.415849e+01  0.2156818  9.826710e+00
##   0.10    0.70      1.427623e+01  0.2151838  9.912395e+00
##   0.10    0.75      1.440127e+01  0.2146093  1.002394e+01
##   0.10    0.80      1.452751e+01  0.2131522  1.015082e+01
##   0.10    0.85      1.462957e+01  0.2113786  1.031383e+01
##   0.10    0.90      1.473296e+01  0.2079236  1.045052e+01
##   0.10    0.95      1.482701e+01  0.2060678  1.053928e+01
##   0.10    1.00      1.491405e+01  0.2038094  1.063059e+01
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.15 and lambda = 0.01.
enet_fp_predict = predict(enet_fp, test_fp)
enet_fp_values = data.frame(obs = test_perm, pred = enet_fp_predict)
defaultSummary(enet_fp_values)
##       RMSE   Rsquared        MAE 
## 13.2283422  0.4005884 10.1874841

The RMSE using the Elastic Net regression model is 13.23, with an \(R^2\) value of about 0.401. The RMSE of the Elastic Net model is slightly better than the original PLS model, but the R2 value is slightly worse.

Based on the analysis above, the model that performed best was the Ridge-regression model.

Before confirming that this is the best value, the residuals of the model are reviewed to ensure the residuals hover around zero and don’t show any clear patterns.

residual_fp = data.frame(residuals = resid(ridge_fp), predictions = predict(ridge_fp))

residual_fp |> ggplot(aes(x = predictions, y = residuals)) +
  geom_point(alpha = 0.5) +
  labs(title = "Residuals for Ridge Regression Model",
       x = "Predictions", y = "Residuals")

Based on the residual plot above, there adoes not appear to be any clear pattern. Additionally, the residuals appear to hover around 0. This suggests that the model has been sufficiently defined.

  1. Would you recommend any of your models to replace the permeability laboratory experiment.

I would not recommend any of these models to replace the permeability laboratory experiment because the best model had only an \(R^2\) value of about 0.43, which suggests that only 43% of the variance within the data could be captured by the model.

Problem 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
updated_chem = impute.knn(as.matrix(ChemicalManufacturingProcess))$data
updated_chem = as.data.frame(updated_chem)
anyNA(updated_chem)
## [1] FALSE

The above code uses K-nearest neighbors to impute the missing values in the dataset. The last line confirms that there are no longer any missing values.

  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# Use Ridge Method
set.seed(6192)
train_chem_rows = createDataPartition(updated_chem$Yield, p = 0.75, list = FALSE)

train_chem = updated_chem[train_chem_rows,]
test_chem = updated_chem[-train_chem_rows,]

ridgeGrid = data.frame(.lambda = seq(0, .5, length = 25))
trCtrl = trainControl(method = "cv", number = 10,
                      preProcOptions = list(cutoff = 0.9))

ridge_chem = train(Yield ~ ., data = train_chem, method = "ridge",
                 trControl = trCtrl,
                 preProcess = c("zv", "center", "scale"),
                 tuneGrid = ridgeGrid)
ridge_chem
## Ridge Regression 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 120, 119, 117, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.00000000  2.687286  0.3830984  1.599036
##   0.02083333  1.647633  0.4692241  1.197913
##   0.04166667  1.535512  0.4923182  1.144000
##   0.06250000  1.487562  0.5061530  1.118976
##   0.08333333  1.458386  0.5162453  1.104536
##   0.10416667  1.437534  0.5244613  1.095393
##   0.12500000  1.421408  0.5316103  1.089254
##   0.14583333  1.408451  0.5380902  1.085783
##   0.16666667  1.397885  0.5441057  1.082990
##   0.18750000  1.389276  0.5497576  1.081002
##   0.20833333  1.382363  0.5550860  1.080384
##   0.22916667  1.376981  0.5600945  1.080435
##   0.25000000  1.373021  0.5647662  1.080713
##   0.27083333  1.370406  0.5690749  1.081185
##   0.29166667  1.369080  0.5729942  1.081827
##   0.31250000  1.368999  0.5765028  1.082619
##   0.33333333  1.370121  0.5795887  1.083990
##   0.35416667  1.372408  0.5822504  1.086099
##   0.37500000  1.375815  0.5844973  1.088447
##   0.39583333  1.380296  0.5863479  1.091214
##   0.41666667  1.385797  0.5878281  1.094411
##   0.43750000  1.392260  0.5889688  1.097649
##   0.45833333  1.399623  0.5898037  1.104530
##   0.47916667  1.407820  0.5903673  1.112732
##   0.50000000  1.416785  0.5906935  1.120874
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.3125.

Using the Ridge regression method, the optimal value for lambda is 0.3125. The associated RMSE value is about 1.417 with an \(R^2\) value of about 0.591.

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
ridge_chem_predict = predict(ridge_chem, test_chem |> select(-Yield))
ridge_chem_values = data.frame(obs = test_chem$Yield, pred = ridge_chem_predict)
defaultSummary(ridge_chem_values)
##      RMSE  Rsquared       MAE 
## 1.6590599 0.5383268 1.1407088

After running the model on the test dataset, the model produced an RMSE value of about 1.659 and \(R^2\) value of about 0.538. These are both worse than what was produced on the training set. Logically, this makes sense since the model is developed over the training set itself and thus knows the desired output values associated with the predictors, whereas the model is trying to predict the output values based on the predictors.

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
plot(varImp(ridge_chem), top = 20,
     main = "Most Important Predictors for Ridge Regression Model")

Above is a VIP plot showing the top 20 predictors in the ridge regression model. The plot shows that 11 predictors are process predictors, whereas the other 9 are biological predictors. This suggests that neither type of predictor dominates the model.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
chem_var_imp = varImp(ridge_chem)$importance |>
  arrange(desc(Overall))
chem_var_imp = chem_var_imp |>
  mutate(variable = rownames(chem_var_imp)) |>
  head(20)

top_chem_var = chem_var_imp$variable

chem_long = updated_chem |> select(top_chem_var, Yield) |>
  pivot_longer(cols = -Yield, names_to = "predictor", values_to = "value")

chem_long |> ggplot(aes(x = value, y = Yield)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  facet_wrap(~predictor, scales = "free_x", nrow = 5, ncol = 4) +
  theme(aspect.ratio = 1)

The plots above show the relationship between the top 20 predictors from the previous exercise and Yield. Since the objective of analyzing this data is to improve the yield, each plot can be analyzed to determine how each predictor generally affects yield. For example, there appears to be a strong positive relationship between ManufacturingProcess32 and Yield, which suggests that the value of this process should be maximized to improve yield. On the otherhand, there appears to be a strong negative relationship between ManufacturingProcess36 and Yield, which suggests that the value of this process should be minimized to improve yield.

Additionally, these relationships could help point out clear outliers. For example, ManufacturingProcess31 appears to have a single observation of 0 that appears to be affecting the relationship. These outliers could then be handled further (whether by removal or imputation) when developing the model.