Validation and Improving Accuracy

Machine learning steps \(\Rightarrow\) Data > Process > Split > Recipe > Model Spec > Workflow > Train > Evaluate

pacman::p_load(ISLR, glmtoolbox, DALEX, MASS, lmtest, jtools, broom, car, vip, cowplot, glmnet, caret, tidymodels, tidyverse)

# Set tidymodels ecosystem to resolve conflicts of funtions with same names
tidymodels_prefer()

# Set ggplot theme as black and white
theme_set(theme_bw())

Import dataset:

data(Credit)
str(Credit)
## 'data.frame':    400 obs. of  12 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

The name of the dataset is credit. It contains 400 observations and 11 variables. You can see the dataset in the environment tab by clicking on the dataset.

Train-Test split (70:30 thumb rule)

set.seed(1)
split = initial_split(Credit, prop = 0.70)
train = training(split)
test = testing(split)

We can also split the data set proportionate to any variable values using split = initial_split(credit, prop = 0.70, strata = Student)

Create recipe (formula: dependent ~ independent, data)

recipe = recipe(Balance ~ Income + Limit + Cards + Age + Student, data = Credit)

Specify the model

lm_spec = linear_reg() %>% 
       set_engine('lm') %>% 
       set_mode('regression')

Workflow

workflow = workflow() %>% 
           add_model(lm_spec) %>% 
           add_recipe(recipe)

Train the model

fit = workflow %>% fit(data = train)

Evaluate the model

predictions = predict(fit, new_data = test) %>% bind_cols(test)
metrics(predictions, truth = Balance, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     108.   
## 2 rsq     standard       0.944
## 3 mae     standard      83.5

We see that RMSE is slightly higher obtained from the model using the train dataset; because this model predicts the data based on which the model is built. On the other hand, train-test split approach, the model is built on the train data and predicts the unknown test data.

predictions_train = predict(fit, new_data = train) %>% bind_cols(train)
metrics(predictions_train, truth = Balance, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      94.5  
## 2 rsq     standard       0.958
## 3 mae     standard      74.6
# Model coefficients
tidy(fit)
## # A tibble: 6 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -481.     25.6        -18.8  3.99e- 51
## 2 Income        -7.66    0.278      -27.5  8.90e- 81
## 3 Limit          0.268   0.00419     64.1  6.19e-167
## 4 Cards         23.8     4.02         5.94 8.78e-  9
## 5 Age           -0.678   0.346       -1.96 5.10e-  2
## 6 StudentYes   424.     18.3         23.2  9.54e- 67
# Model quality indicators
glance(fit)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.958         0.958  95.6     1261. 9.15e-187     5 -1671. 3356. 3381.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model outputs
model_outputs = extract_fit_engine(fit)
summ(model_outputs)
Observations 280
Dependent variable ..y
Type OLS linear regression
F(5,274) 1260.61
0.96
Adj. R² 0.96
Est. S.E. t val. p
(Intercept) -480.98 25.62 -18.77 0.00
Income -7.66 0.28 -27.50 0.00
Limit 0.27 0.00 64.07 0.00
Cards 23.84 4.02 5.94 0.00
Age -0.68 0.35 -1.96 0.05
StudentYes 424.38 18.26 23.24 0.00
Standard errors: OLS

Model diagnostic plots

# augment() also binds .pred (predicted values) and .resid (residuals in the predicted dataset)
# model_output contains these .pred and .resid special variables. We can use them by using dot (.) before them
augment(fit, new_data = test) %>% head()
## # A tibble: 6 × 14
##   .pred .resid    ID Income Limit Rating Cards   Age Education Gender   Student
##   <dbl>  <dbl> <int>  <dbl> <int>  <int> <int> <int>     <int> <fct>    <fct>  
## 1  663. -82.9      3  105.   7075    514     4    71        11 " Male"  No     
## 2  975. -10.7      4  149.   9504    681     3    36        11 "Female" No     
## 3  406. -75.1      5   55.9  4897    357     2    68        16 " Male"  No     
## 4 1106.  44.6      6   80.2  8047    569     4    77        10 " Male"  No     
## 5  290. -86.5      7   21.0  3388    259     2    37        12 "Female" No     
## 6  869.   3.07     8   71.4  7114    512     2    87         9 " Male"  No     
## # ℹ 3 more variables: Married <fct>, Ethnicity <fct>, Balance <int>
# Plot residuals vs fitted plot from model_outputs
ggplot(model_outputs, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted", x = "Fitted Values", y = "Residuals") 

# Q-Q plot
qqPlot(model_outputs$residuals)

## [1] 114  34

Cross-validation and Bootstrapping

We have one data set. In exploratory or inferential models, we use the full data set to develop a model in order to explain the relationship between the dependent and independent variables.

In machine learning (predictive models), we need to validate the model, i.e. check the predictability of the model for unknown observations. Since, we do not have unknown observations, we can validate the model using some observations from the main data set that have not been used for the model development. For doing this, we have two approaches, namely train-test data split and cross validation.

In train-test data split approach, we split the data set into training data (randomly taken 70% of the observations) and test data. We develop the model using the training data and check the predictability using the test data. This is a simple approach and less computationally expensive.

In cross validation approach, we iterate the train-test method for multiple times. Each time, we use one part of the data as the training set and the remaining as the test set. For example, in LOOCV(Leave One Out Cross Validation), we leave only one observation for testing and the remaining as the train data. If we have 1000 observations, we will have 1000 cross validation models, each time we will keep one observation as the test data. This is a very rigorous technique and computationally expensive. In 10-fold cross validation, we split the data in 10 parts - one part is kept as the test set and the remaining 9 parts are used as training set. The greater the folds, the higher the accuracy. However, the accuracy of a model is not the only goal of machine learning. The model should also balance between bias and variance. For estimating the model errors (e.g. RMSE), 10-fold cross validation is usually used. The errors are the simple average of the outputs of the all cross validation models. Cross validation is used for inter-comparison of different models for the purpose of selecting the best model.

In bootstrapping, we resample (n times, where n = sample size) the observations with replacement from the original data set. If we have 1000 observations, we sample 1000 times each with 1000 samples with replacement. For replacing the selected observation, some observations are selected multiple times and some are not selected. The selected observations are used as the training data and the remaining as the test data. In this way, we will have n number of models (n = sample size). Bootstrapping is used for estimating the confidence intervals of a model’s coefficients that determine the p-values for the independent variables.

If we apply, cross validation and bootstrapping, we do not need to split the data set into training and test sets. However, we can also combine all the approaches.

Cross-validation: for more accurate model errors

Figure source: [source: https://www.tmwr.org/resampling]

Produced by OmniGraffle 7.18.6\n2022-02-14 16:09:36 +0000 Canvas 1 Layer 1 14 18 28 17 21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10 14 18 28 17 21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10
Cross validation

Cross-validation: Data set is split into folds. In each iteration, model is build on data except one hold out fold and the model is tested on the hold out fold. The result is the average of the folds. This approach decreases bias but increases variance. Lower fold more bias but less variance. Cross-validation is used for model selection (\(R^2\) and RMSE) and tuning hyperparameters of a model (e.g. lambda and number of trees).

# 10-fold cross-validation
set.seed(1)
cv_results = fit_resamples(
    workflow,
    resamples = vfold_cv(Credit, v = 10),
    metrics = metric_set(rmse, rsq, mae),
    control = control_resamples(save_pred = TRUE)
)

# outputs
collect_metrics(cv_results)
## # A tibble: 3 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 mae     standard   79.5      10 2.47    Preprocessor1_Model1
## 2 rmse    standard   99.4      10 2.82    Preprocessor1_Model1
## 3 rsq     standard    0.954    10 0.00413 Preprocessor1_Model1
# Results without cross-validation
# We see all the error statistics have reduced after cross validation
metrics(predictions, truth = Balance, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     108.   
## 2 rsq     standard       0.944
## 3 mae     standard      83.5
# Visualization
cv_metrics = collect_metrics(cv_results, summarize = FALSE)
cv_metrics %>% head()
## # A tibble: 6 × 5
##   id     .metric .estimator .estimate .config             
##   <chr>  <chr>   <chr>          <dbl> <chr>               
## 1 Fold01 rmse    standard     113.    Preprocessor1_Model1
## 2 Fold01 rsq     standard       0.945 Preprocessor1_Model1
## 3 Fold01 mae     standard      86.3   Preprocessor1_Model1
## 4 Fold02 rmse    standard      92.7   Preprocessor1_Model1
## 5 Fold02 rsq     standard       0.972 Preprocessor1_Model1
## 6 Fold02 mae     standard      79.4   Preprocessor1_Model1
ggplot(cv_metrics %>% filter(.metric == "rmse"),
       aes(x = id, y = .estimate, fill = id)) +
  geom_col() +
  labs(title = "RMSE per CV Fold", x = "Fold", y = "RMSE") +
  theme(legend.position = 'none')

# Actual vs predicted values
cv_preds <- collect_predictions(cv_results)

ggplot(cv_preds, aes(x = Balance, y = .pred)) +
  geom_point(alpha = 0.7) +
  geom_abline(linetype = "dashed", color = "red") +
  facet_wrap(~id, nrow = 2) +
  labs(title = "Predicted vs Actual across CV folds",
       x = "Actual balance",
       y = "Predicted balance")

Bootstrapping: for more accurate error and coefficient estimates

Figure source: [source: https://www.tmwr.org/resampling]

Produced by OmniGraffle 7.18.6\n2022-02-14 16:09:36 +0000 Canvas 1 Layer 1 Model Fit Using Estimate Performance Using Bootstrap Iteration 1 16 19 27 19 23 25 23 13 8 29 1 24 25 4 1 21 14 10 25 23 17 13 7 28 22 15 16 16 8 13 18 28 26 30 3 9 2 24 5 11 12 20 6 12 15 27 14 18 23 21 4 4 30 2 22 28 3 2 17 7 4 23 22 14 6 3 28 17 10 11 12 3 6 20 29 5 13 1 26 8 16 19 24 9 15 19 22 18 20 21 20 5 5 30 2 21 22 3 2 19 10 5 21 21 18 6 3 29 20 11 12 16 4 7 24 28 27 8 14 1 26 9 17 23 25 13 Bootstrap Iteration 2 Bootstrap Iteration 3
Bootstrapping

In bootstrapping, a number of samples are drawn from the data set with the same sample size with replacement. Thus, 63.2% of the observations usually fall into the train set and the remaining 36.8% in the test set (OOB = out of the bag sample) in each of the bootstrap samples. The result is then averaged and CI is calculated. This process reduces variance but may increase bias.

Bootstrapping is used to estimate uncertainty in a model, i.e. confidence interval, AUC, and accuracy.

# Bootstrapping with 100 resamples
set.seed(1)
boot_results = fit_resamples(
    workflow,
    resamples = bootstraps(Credit, times = 100),
    metrics = metric_set(rmse, rsq, mae),
    control = control_resamples(save_pred = TRUE)
)

collect_metrics(boot_results)
## # A tibble: 3 × 6
##   .metric .estimator    mean     n  std_err .config             
##   <chr>   <chr>        <dbl> <int>    <dbl> <chr>               
## 1 mae     standard    80.0     100 0.366    Preprocessor1_Model1
## 2 rmse    standard   101.      100 0.539    Preprocessor1_Model1
## 3 rsq     standard     0.952   100 0.000586 Preprocessor1_Model1
# Recall the cross-validated metrics
# mae and rmse are lower in cv results
# Difference may not be like this, if we use different set.seed()
collect_metrics(cv_results)
## # A tibble: 3 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 mae     standard   79.5      10 2.47    Preprocessor1_Model1
## 2 rmse    standard   99.4      10 2.82    Preprocessor1_Model1
## 3 rsq     standard    0.954    10 0.00413 Preprocessor1_Model1

Bootstrap confidence interval

# Split into bootstrap samples
# Each bootstrap sample contains splits and id columns.
# Each id contains training and testing datasets 
set.seed(123)
bootstraps = bootstraps(Credit, times = 100)
bootstraps
## # Bootstrap sampling 
## # A tibble: 100 × 2
##    splits            id          
##    <list>            <chr>       
##  1 <split [400/151]> Bootstrap001
##  2 <split [400/143]> Bootstrap002
##  3 <split [400/153]> Bootstrap003
##  4 <split [400/146]> Bootstrap004
##  5 <split [400/159]> Bootstrap005
##  6 <split [400/145]> Bootstrap006
##  7 <split [400/146]> Bootstrap007
##  8 <split [400/148]> Bootstrap008
##  9 <split [400/153]> Bootstrap009
## 10 <split [400/148]> Bootstrap010
## # ℹ 90 more rows
# See contents of the first splits (400 training and 151 testing observations)
bootstraps$splits[1]
## [[1]]
## <Analysis/Assess/Total>
## <400/151/400>
# Fit model on bootstrap samples (analysis set => training set)
# First define a function to extract coefficients using tidy()
boot_fit = function(split) {
    fit = lm_spec %>% 
      fit(Balance ~ Income + Limit + Cards + Age + Student, data = training(split))
    tidy(fit)
}
# Apply this defined function to all bootstrap samples (bootstraps)
# Add coefs to the bootstraps
# map() is similar to lapply(), which applies a function to all components (here splits from bootstraps)
boot_coefs = bootstraps %>% 
             mutate(coefs = map(splits, boot_fit)) %>% 
             unnest(coefs)

boot_coefs %>% head()
## # A tibble: 6 × 7
##   splits            id           term     estimate std.error statistic   p.value
##   <list>            <chr>        <chr>       <dbl>     <dbl>     <dbl>     <dbl>
## 1 <split [400/151]> Bootstrap001 (Interc… -458.     21.8        -21.0  2.26e- 66
## 2 <split [400/151]> Bootstrap001 Income     -7.92    0.226      -35.1  2.78e-123
## 3 <split [400/151]> Bootstrap001 Limit       0.268   0.00346     77.4  1.85e-240
## 4 <split [400/151]> Bootstrap001 Cards      20.4     3.82         5.34 1.55e-  7
## 5 <split [400/151]> Bootstrap001 Age        -0.612   0.289       -2.12 3.49e-  2
## 6 <split [400/151]> Bootstrap001 Student…  415.     15.9         26.2  3.99e- 88
# Calculate bootstrap confidence interval
bootstrap_ci <- boot_coefs %>% 
  group_by(term) %>% 
  summarize(
    coeff = mean(estimate), 
    conf.low = quantile(estimate, probs = 0.025),  
    conf.high = quantile(estimate, probs = 0.975)
  )

bootstrap_ci
## # A tibble: 6 × 4
##   term           coeff conf.low conf.high
##   <chr>          <dbl>    <dbl>     <dbl>
## 1 (Intercept) -467.    -515.    -428.    
## 2 Age           -0.635   -1.15    -0.0862
## 3 Cards         23.0     14.9     28.8   
## 4 Income        -7.79    -8.15    -7.40  
## 5 Limit          0.267    0.259    0.275 
## 6 StudentYes   428.     404.     452.
# Recall the confidence interval without bootstrapping
# Now compare the coefficients with the bootstrap coefficients
summ(model_outputs, confint = TRUE)
Observations 280
Dependent variable ..y
Type OLS linear regression
F(5,274) 1260.61
0.96
Adj. R² 0.96
Est. 2.5% 97.5% t val. p
(Intercept) -480.98 -531.42 -430.55 -18.77 0.00
Income -7.66 -8.21 -7.11 -27.50 0.00
Limit 0.27 0.26 0.28 64.07 0.00
Cards 23.84 15.93 31.74 5.94 0.00
Age -0.68 -1.36 0.00 -1.96 0.05
StudentYes 424.38 388.43 460.33 23.24 0.00
Standard errors: OLS
# Visualization
ggplot(boot_coefs, aes(x = estimate)) +
    geom_histogram(bins = 30, fill = 'steelblue', color = 'white') +
    facet_wrap(~term, scales = 'free') +
    labs(title = 'Bootstrap coefficient distribution')