Validation and Improving Accuracy

Machine learning steps \(\Rightarrow\) Data > Process > Split > Recipe > Model Spec > Workflow > Train > Evaluate

pacman::p_load(ISLR, glmtoolbox, DALEX, MASS, lmtest, jtools, broom, car, vip, cowplot, glmnet, caret, tidymodels, tidyverse)

# Set tidymodels ecosystem to resolve conflicts of funtions with same names
tidymodels_prefer()

# Set ggplot theme as black and white
theme_set(theme_bw())

Import dataset:

data(Credit)
str(Credit)

## 'data.frame':    400 obs. of  12 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

The name of the dataset is credit. It contains 400 observations and 11 variables. You can see the dataset in the environment tab by clicking on the dataset.

Train-Test split (70:30 thumb rule)

set.seed(1)
split = initial_split(Credit, prop = 0.70)
train = training(split)
test = testing(split)

We can also split the data set proportionate to any variable values using split = initial_split(credit, prop = 0.70, strata = Student)

Create recipe (formula: dependent ~ independent, data)

recipe = recipe(Balance ~ Income + Limit + Cards + Age + Student, data = Credit)

Specify the model

lm_spec = linear_reg() %>% 
       set_engine('lm') %>% 
       set_mode('regression')

Workflow

workflow = workflow() %>% 
           add_model(lm_spec) %>% 
           add_recipe(recipe)

Train the model

fit = workflow %>% fit(data = train)

Evaluate the model

predictions = predict(fit, new_data = test) %>% bind_cols(test)
metrics(predictions, truth = Balance, estimate = .pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     108.   
## 2 rsq     standard       0.944
## 3 mae     standard      83.5

We see that RMSE is slightly higher obtained from the model using the train dataset; because this model predicts the data based on which the model is built. On the other hand, train-test split approach, the model is built on the train data and predicts the unknown test data.

predictions_train = predict(fit, new_data = train) %>% bind_cols(train)
metrics(predictions_train, truth = Balance, estimate = .pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      94.5  
## 2 rsq     standard       0.958
## 3 mae     standard      74.6

# Model coefficients
tidy(fit)

## # A tibble: 6 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -481.     25.6        -18.8  3.99e- 51
## 2 Income        -7.66    0.278      -27.5  8.90e- 81
## 3 Limit          0.268   0.00419     64.1  6.19e-167
## 4 Cards         23.8     4.02         5.94 8.78e-  9
## 5 Age           -0.678   0.346       -1.96 5.10e-  2
## 6 StudentYes   424.     18.3         23.2  9.54e- 67

# Model quality indicators
glance(fit)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.958         0.958  95.6     1261. 9.15e-187     5 -1671. 3356. 3381.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# Model outputs
model_outputs = extract_fit_engine(fit)
summ(model_outputs)

Observations	280
Dependent variable	..y
Type	OLS linear regression

F(5,274)	1260.61
R²	0.96
Adj. R²	0.96

	Est.	S.E.	t val.	p
(Intercept)	-480.98	25.62	-18.77	0.00
Income	-7.66	0.28	-27.50	0.00
Limit	0.27	0.00	64.07	0.00
Cards	23.84	4.02	5.94	0.00
Age	-0.68	0.35	-1.96	0.05
StudentYes	424.38	18.26	23.24	0.00
Standard errors: OLS

Model diagnostic plots

# augment() also binds .pred (predicted values) and .resid (residuals in the predicted dataset)
# model_output contains these .pred and .resid special variables. We can use them by using dot (.) before them
augment(fit, new_data = test) %>% head()

## # A tibble: 6 × 14
##   .pred .resid    ID Income Limit Rating Cards   Age Education Gender   Student
##   <dbl>  <dbl> <int>  <dbl> <int>  <int> <int> <int>     <int> <fct>    <fct>  
## 1  663. -82.9      3  105.   7075    514     4    71        11 " Male"  No     
## 2  975. -10.7      4  149.   9504    681     3    36        11 "Female" No     
## 3  406. -75.1      5   55.9  4897    357     2    68        16 " Male"  No     
## 4 1106.  44.6      6   80.2  8047    569     4    77        10 " Male"  No     
## 5  290. -86.5      7   21.0  3388    259     2    37        12 "Female" No     
## 6  869.   3.07     8   71.4  7114    512     2    87         9 " Male"  No     
## # ℹ 3 more variables: Married <fct>, Ethnicity <fct>, Balance <int>

# Plot residuals vs fitted plot from model_outputs
ggplot(model_outputs, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted", x = "Fitted Values", y = "Residuals")

# Q-Q plot
qqPlot(model_outputs$residuals)

## [1] 114  34

Cross-validation and Bootstrapping

We have one data set. In exploratory or inferential models, we use the full data set to develop a model in order to explain the relationship between the dependent and independent variables.

In machine learning (predictive models), we need to validate the model, i.e. check the predictability of the model for unknown observations. Since, we do not have unknown observations, we can validate the model using some observations from the main data set that have not been used for the model development. For doing this, we have two approaches, namely train-test data split and cross validation.

In train-test data split approach, we split the data set into training data (randomly taken 70% of the observations) and test data. We develop the model using the training data and check the predictability using the test data. This is a simple approach and less computationally expensive.

In cross validation approach, we iterate the train-test method for multiple times. Each time, we use one part of the data as the training set and the remaining as the test set. For example, in LOOCV(Leave One Out Cross Validation), we leave only one observation for testing and the remaining as the train data. If we have 1000 observations, we will have 1000 cross validation models, each time we will keep one observation as the test data. This is a very rigorous technique and computationally expensive. In 10-fold cross validation, we split the data in 10 parts - one part is kept as the test set and the remaining 9 parts are used as training set. The greater the folds, the higher the accuracy. However, the accuracy of a model is not the only goal of machine learning. The model should also balance between bias and variance. For estimating the model errors (e.g. RMSE), 10-fold cross validation is usually used. The errors are the simple average of the outputs of the all cross validation models. Cross validation is used for inter-comparison of different models for the purpose of selecting the best model.

In bootstrapping, we resample (n times, where n = sample size) the observations with replacement from the original data set. If we have 1000 observations, we sample 1000 times each with 1000 samples with replacement. For replacing the selected observation, some observations are selected multiple times and some are not selected. The selected observations are used as the training data and the remaining as the test data. In this way, we will have n number of models (n = sample size). Bootstrapping is used for estimating the confidence intervals of a model’s coefficients that determine the p-values for the independent variables.

If we apply, cross validation and bootstrapping, we do not need to split the data set into training and test sets. However, we can also combine all the approaches.

Cross-validation: for more accurate model errors

Figure source: [source: https://www.tmwr.org/resampling]

Cross validation

Cross-validation: Data set is split into folds. In each iteration, model is build on data except one hold out fold and the model is tested on the hold out fold. The result is the average of the folds. This approach decreases bias but increases variance. Lower fold more bias but less variance. Cross-validation is used for model selection (\(R^2\) and RMSE) and tuning hyperparameters of a model (e.g. lambda and number of trees).

# 10-fold cross-validation
set.seed(1)
cv_results = fit_resamples(
    workflow,
    resamples = vfold_cv(Credit, v = 10),
    metrics = metric_set(rmse, rsq, mae),
    control = control_resamples(save_pred = TRUE)
)

# outputs
collect_metrics(cv_results)

## # A tibble: 3 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 mae     standard   79.5      10 2.47    Preprocessor1_Model1
## 2 rmse    standard   99.4      10 2.82    Preprocessor1_Model1
## 3 rsq     standard    0.954    10 0.00413 Preprocessor1_Model1

# Results without cross-validation
# We see all the error statistics have reduced after cross validation
metrics(predictions, truth = Balance, estimate = .pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     108.   
## 2 rsq     standard       0.944
## 3 mae     standard      83.5

# Visualization
cv_metrics = collect_metrics(cv_results, summarize = FALSE)
cv_metrics %>% head()

## # A tibble: 6 × 5
##   id     .metric .estimator .estimate .config             
##   <chr>  <chr>   <chr>          <dbl> <chr>               
## 1 Fold01 rmse    standard     113.    Preprocessor1_Model1
## 2 Fold01 rsq     standard       0.945 Preprocessor1_Model1
## 3 Fold01 mae     standard      86.3   Preprocessor1_Model1
## 4 Fold02 rmse    standard      92.7   Preprocessor1_Model1
## 5 Fold02 rsq     standard       0.972 Preprocessor1_Model1
## 6 Fold02 mae     standard      79.4   Preprocessor1_Model1

ggplot(cv_metrics %>% filter(.metric == "rmse"),
       aes(x = id, y = .estimate, fill = id)) +
  geom_col() +
  labs(title = "RMSE per CV Fold", x = "Fold", y = "RMSE") +
  theme(legend.position = 'none')

# Actual vs predicted values
cv_preds <- collect_predictions(cv_results)

ggplot(cv_preds, aes(x = Balance, y = .pred)) +
  geom_point(alpha = 0.7) +
  geom_abline(linetype = "dashed", color = "red") +
  facet_wrap(~id, nrow = 2) +
  labs(title = "Predicted vs Actual across CV folds",
       x = "Actual balance",
       y = "Predicted balance")

Bootstrapping: for more accurate error and coefficient estimates