Machine Learning in the Tidyverse

Abstract
Data + Function Introduction
Model Building
Model Evaluation

Abstract

Today we are going to show you how to use a couple of very neat packages in R to build and evaluate multiple models using the gapminder dataset. We hope to show you how these packages and workflows can be a powerful weapon in any ML toolkit so you can build and compare models more efficiently.

The suite of packages we will be emphasizing in this document include:

tibble - library(tibble)
tidyr - library(tidyr)
purr - library(purr)
broom - library(broom)

These tools will allow us to create multiple models simultaneously and compare and contrast them. The workflow for this process is based around list columns. That is, a column that is of the data type - list since it can include complex objects like models!

The workflow looks like this:

Make a list column using nest
Work with list columns using map and it’s family of functions
Simplify the list columns using unnest

Using nest

Data + Function Introduction

library(tidyverse)
library(gapminder)
# explore gapminder
head(gapminder)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

# prepare nested dataframe: gap_nested
gap_nested <- gapminder %>%
                group_by(country) %>%
                nest()
# explore gap_nested
head(gap_nested)

## # A tibble: 6 x 2
## # Groups:   country [6]
##   country     data                 
##   <fct>       <list>               
## 1 Afghanistan <tibble[,5] [12 × 5]>
## 2 Albania     <tibble[,5] [12 × 5]>
## 3 Algeria     <tibble[,5] [12 × 5]>
## 4 Angola      <tibble[,5] [12 × 5]>
## 5 Argentina   <tibble[,5] [12 × 5]>
## 6 Australia   <tibble[,5] [12 × 5]>

You can see all other columns other than country (which we grouped by) have been bundled up together into the list column data.

Let’s now unnest that and see what happens.

gap_unnested <- gap_nested %>%
                  unnest(cols = c(data))
# explore gap_unnested
head(gap_unnested)

## # A tibble: 6 x 6
## # Groups:   country [1]
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Let’s test some operations on a subset of the nested data.

algeria <- gap_nested$data[[1]]
# calculate mean, min and max of the population vector
head(algeria)

## # A tibble: 6 x 5
##   continent  year lifeExp      pop gdpPercap
##   <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Asia       1952    28.8  8425333      779.
## 2 Asia       1957    30.3  9240934      821.
## 3 Asia       1962    32.0 10267083      853.
## 4 Asia       1967    34.0 11537966      836.
## 5 Asia       1972    36.1 13079460      740.
## 6 Asia       1977    38.4 14880372      786.

paste("Mean = ", round(mean(algeria$pop),2), ", Min = ", min(algeria$pop),", Max = ", max(algeria$pop) )

## [1] "Mean =  15823715.42 , Min =  8425333 , Max =  31889923"

Ok, we can see how the nest function works and how to interact with it’s resulting column. Let’s try and use map now to do that for each country in the gap_nested data.

pop_mean <- gap_nested %>%
                mutate(mean_pop = map_dbl(data, ~mean(.x$pop)))
# explore pop_mean
head(pop_mean)

## # A tibble: 6 x 3
## # Groups:   country [6]
##   country     data                   mean_pop
##   <fct>       <list>                    <dbl>
## 1 Afghanistan <tibble[,5] [12 × 5]> 15823715.
## 2 Albania     <tibble[,5] [12 × 5]>  2580249.
## 3 Algeria     <tibble[,5] [12 × 5]> 19875406.
## 4 Angola      <tibble[,5] [12 × 5]>  7309390.
## 5 Argentina   <tibble[,5] [12 × 5]> 28602240.
## 6 Australia   <tibble[,5] [12 × 5]> 14649312.

Let’s see what we just did here and how the map function works.

We used map_dbl because that specifies to coerce the result vector to double data type. If you just were to use map it would output a list column with multiple objects of type double.

Lastly, the arguments to map are as follows:
map_dbl( .x , .f ) where .x is the nested data list column and .f is the function you want to perform + what column in the list you want to perform on.

Model Building

Ok. Now we have a good idea of what nest, unnest and map can do. Now, can you imagine how useful this might be if you want to group by a certain column and create a model for each value of that vector. This is where it becomes a very powerful technique.

gap_models <- gap_nested %>%
                mutate(model = map(data, ~lm(lifeExp ~ year, data = .x)))
# explore gap_models
head(gap_models)

## # A tibble: 6 x 3
## # Groups:   country [6]
##   country     data                  model 
##   <fct>       <list>                <list>
## 1 Afghanistan <tibble[,5] [12 × 5]> <lm>  
## 2 Albania     <tibble[,5] [12 × 5]> <lm>  
## 3 Algeria     <tibble[,5] [12 × 5]> <lm>  
## 4 Angola      <tibble[,5] [12 × 5]> <lm>  
## 5 Argentina   <tibble[,5] [12 × 5]> <lm>  
## 6 Australia   <tibble[,5] [12 × 5]> <lm>

# extract and check summary for Algerian model
algeria_model <- gap_models$model[[1]]
summary(algeria_model)

## 
## Call:
## lm(formula = lifeExp ~ year, data = .x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5447 -0.9905 -0.2757  0.8847  1.6868 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -507.53427   40.48416  -12.54 1.93e-07 ***
## year           0.27533    0.02045   13.46 9.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.223 on 10 degrees of freedom
## Multiple R-squared:  0.9477, Adjusted R-squared:  0.9425 
## F-statistic: 181.2 on 1 and 10 DF,  p-value: 9.835e-08

See how the new column model is of the lm data type which stands for linear model. Now, we’re going to add an extra package that will boost how powerful this method can be. The broom package allows you to extract model results as a dataframe to perform more operations on. There are 3 main functions we use in the broom package including:

tidy() returns the statistical findings of the model e.g. coefficients
glance() returns a one row summary of the model
augment() adds prediction columns to the data being modelled

library(broom)
# extract model coefficients for Algeria
tidy(algeria_model)

## # A tibble: 2 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept) -508.      40.5        -12.5 0.000000193 
## 2 year           0.275    0.0205      13.5 0.0000000984

# give us the statistics of the Algerian model
glance(algeria_model)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic      p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>        <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.948         0.942  1.22      181. 0.0000000984     1  -18.3  42.7  44.1
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

How cool is that! So looks like the simple linear regression model is working quite well with year as our single predictor. This might only be the case for Algeria though… See how we can read those handy stats like the \(R^2\) and deviance. Let’s now use the last function to add some predicted/fitted values for each year in Algeria’s data. They are stored in the tidy model as .fitted.

# build augmented df
algeria_fitted <- augment(algeria_model)
# compare predicted with actual values for life expectancy
algeria_fitted %>%
        ggplot(aes(x = year)) + 
        geom_point(aes(y = lifeExp)) + 
        geom_line(aes(y = .fitted), color = "red")

Let’s now compare the coefficients of all of the countries in gapminder.

# extract coeffs from each model
model_coef <- gap_models %>%
                mutate(coef = map(model, ~tidy(.x)))
# simplify coeffs dataframe
model_coef <- model_coef %>%
                unnest(coef)
# plot histogram to see coefficient estimates
model_coef %>% 
        filter(term == "year") %>%
        ggplot(aes(x = estimate)) + 
        geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So looks like most countries follow a positively sloped trend with a couple slightly decreasing. That is, majority of nations experienced a growth in average life expectancy between 1960 and 2011.

Model Evaluation

Let’s check out the fit of the many models we have built thus far. Again, we will be using a combination of map, glance and unnest.

# Extract the fit statistics of each model into dataframes
model_perf_nested <- gap_models %>% 
    mutate(fit = map(model, ~glance(.x)))
# Simplify the fit dataframes for each model    
model_perf <- model_perf_nested %>% 
    unnest(fit)
# explore model_perf
head(model_perf)

## # A tibble: 6 x 15
## # Groups:   country [6]
##   country  data     model r.squared adj.r.squared sigma statistic  p.value    df
##   <fct>    <list>   <lis>     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>
## 1 Afghani… <tibble… <lm>      0.948         0.942 1.22      181.  9.84e- 8     1
## 2 Albania  <tibble… <lm>      0.911         0.902 1.98      102.  1.46e- 6     1
## 3 Algeria  <tibble… <lm>      0.985         0.984 1.32      662.  1.81e-10     1
## 4 Angola   <tibble… <lm>      0.888         0.877 1.41       79.1 4.59e- 6     1
## 5 Argenti… <tibble… <lm>      0.996         0.995 0.292    2246.  4.22e-13     1
## 6 Austral… <tibble… <lm>      0.980         0.978 0.621     481.  8.67e-10     1
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

Cool, how simple is that (kinda). We can also create a histogram for the distribution of \(R^2\) values.

arrange(model_perf, r.squared)

## # A tibble: 142 x 15
## # Groups:   country [142]
##    country   data    model r.squared adj.r.squared sigma statistic p.value    df
##    <fct>     <list>  <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
##  1 Rwanda    <tibbl… <lm>     0.0172      -0.0811   6.56     0.175  0.685      1
##  2 Botswana  <tibbl… <lm>     0.0340      -0.0626   6.11     0.352  0.566      1
##  3 Zimbabwe  <tibbl… <lm>     0.0562      -0.0381   7.21     0.596  0.458      1
##  4 Zambia    <tibbl… <lm>     0.0598      -0.0342   4.53     0.636  0.444      1
##  5 Swaziland <tibbl… <lm>     0.0682      -0.0250   6.64     0.732  0.412      1
##  6 Lesotho   <tibbl… <lm>     0.0849      -0.00666  5.93     0.927  0.358      1
##  7 Cote d'I… <tibbl… <lm>     0.283        0.212    3.93     3.95   0.0748     1
##  8 South Af… <tibbl… <lm>     0.312        0.244    4.74     4.54   0.0588     1
##  9 Uganda    <tibbl… <lm>     0.342        0.276    3.19     5.20   0.0457     1
## 10 Congo, D… <tibbl… <lm>     0.348        0.283    2.43     5.34   0.0434     1
## # … with 132 more rows, and 6 more variables: logLik <dbl>, AIC <dbl>,
## #   BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>

# Plot a histogram of rsquared for the 77 models    
model_perf %>% 
  ggplot(aes(x = r.squared)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Extract the 4 best fitting models
best_fit <- model_perf %>% 
                ungroup() %>%
                top_n(n = 4, wt = r.squared)
# Extract the 4 models with the worst fit
worst_fit <- model_perf %>% 
                ungroup() %>%
                top_n(n = 4, wt = -r.squared)
# explore best and worst fitters
head(best_fit)

## # A tibble: 4 x 15
##   country  data     model r.squared adj.r.squared sigma statistic  p.value    df
##   <fct>    <list>   <lis>     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>
## 1 Brazil   <tibble… <lm>      0.998         0.998 0.326     5111. 6.99e-15     1
## 2 France   <tibble… <lm>      0.998         0.997 0.220     4200. 1.86e-14     1
## 3 Maurita… <tibble… <lm>      0.998         0.997 0.408     4290. 1.68e-14     1
## 4 Switzer… <tibble… <lm>      0.997         0.997 0.215     3823. 2.98e-14     1
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

head(worst_fit)

## # A tibble: 4 x 15
##   country  data      model r.squared adj.r.squared sigma statistic p.value    df
##   <fct>    <list>    <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1 Botswana <tibble[… <lm>     0.0340       -0.0626  6.11     0.352   0.566     1
## 2 Rwanda   <tibble[… <lm>     0.0172       -0.0811  6.56     0.175   0.685     1
## 3 Zambia   <tibble[… <lm>     0.0598       -0.0342  4.53     0.636   0.444     1
## 4 Zimbabwe <tibble[… <lm>     0.0562       -0.0381  7.21     0.596   0.458     1
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

# Build the augmented dataframe for each country model
best_augmented <- best_fit %>% 
                        mutate(augmented = map(model, ~augment(.x))) %>% 
                        unnest(augmented) # Expand the augmented dataframes
worst_augmented <- worst_fit %>% 
                        mutate(augmented = map(model, ~augment(.x))) %>% 
                        unnest(augmented)

So now we have extracted, gotten the model values + unnested the best and worst 4 countries in our data set. Let’s visualise how they look now.

best_augmented %>% 
  ggplot(aes(x = year)) +
  geom_point(aes(y = lifeExp)) + 
  geom_line(aes(y = .fitted), color = "green") +
  facet_wrap(~country, scales = "free_y")

worst_augmented %>% 
  ggplot(aes(x = year)) +
  geom_point(aes(y = lifeExp)) + 
  geom_line(aes(y = .fitted), color = "red") +
  facet_wrap(~country, scales = "free_y")

So it looks like some countries, particularly those in Africa have a non linear trend to their average life expectancy between 1950 and 2011. The model is pretty useless in these instances and we might want to use a different model to capture this trend. We can either change the model to become a multiple linear regression using all other variables or use another regression method (since our outcome variable is numeric) like polynomial regression, random forests or even a neural network if you think you’re that cool ;) .

Also, just remember because a model might fit super well, it does not necessarily mean it will be great at predicting new data.

Tune and Rebuild Models

Alrighty! So we are almost there. We have seen how we can use some of the packages to do some really cool modelling in quite a quick manner. Now, we want to test and evaluate our models on new data so we can deploy them into the real world. There is a few ways you can actually go about this. We are going to cover 3 of the most common methods of this process.

From most simple to more complicated, this is roughly what each method involves:

train - test split means we split the data into 2 sets where the training set is roughly 70% which is used for building the model, the remaining 30% is used to test that model on the new unseen data. This gives a fairer evaluation of how the model performs on out of sample data.
train - validate - test split is a slight variation on the above and involves splitting the train data further. The purpose of the new portion (validate) is to test which model is best. Then you can use that model to predict on the test set.
cross validation is the most robust of the 3. We basically split the data into equal parts e.g. 10 different equal portions. We train the data on 9 of those portions then test on the single remaining one. You then keep rotating the test portion until you test all 10. The result is averaged to give a much fairer estimate of the model’s accuracy.

There is a few different packages that can actually do this for you. It can also be done quite easily using base R functions too. We will use rsample.

library(rsample)
# use a seed for reproducibility
set.seed(23)
# prepare initial split object
gap_split <- initial_split(gapminder, prop = 0.75)
# Extract the training dataframe
training_data <- training(gap_split)
# Extract the testing dataframe
testing_data <- testing(gap_split)
# Calculate the dimensions of both training_data and testing_data
dim(training_data)

## [1] 1278    6

dim(testing_data)

## [1] 426   6

So you can see the functions in the rsample package have split our training and test sets. Training data has 1278 rows while the test data has 426 rows. We can also create cross validation partitions (aka k folds) very easily.

# Prepare the dataframe containing the cross validation partitions
cv_split <- vfold_cv(training_data, v = 5)
cv_data <- cv_split %>% 
  mutate(
    train = map(splits, ~training(.x)),  # Extract the train dataframe for each split
    validate = map(splits, ~testing(.x)) # Extract the validate dataframe for each split
  )
# explore cv_data
head(cv_data)

## #  5-fold cross-validation 
## # A tibble: 5 x 4
##   splits             id    train                    validate              
##   <list>             <chr> <list>                   <list>                
## 1 <split [1022/256]> Fold1 <tibble[,6] [1,022 × 6]> <tibble[,6] [256 × 6]>
## 2 <split [1022/256]> Fold2 <tibble[,6] [1,022 × 6]> <tibble[,6] [256 × 6]>
## 3 <split [1022/256]> Fold3 <tibble[,6] [1,022 × 6]> <tibble[,6] [256 × 6]>
## 4 <split [1023/255]> Fold4 <tibble[,6] [1,023 × 6]> <tibble[,6] [255 × 6]>
## 5 <split [1023/255]> Fold5 <tibble[,6] [1,023 × 6]> <tibble[,6] [255 × 6]>

Let us now build a model using training data on each CV fold.

cv_models_lm <- cv_data %>% 
  mutate(model = map(train, ~lm(formula = lifeExp ~ ., data = .x)))
# create columns for actual and predicted values 
cv_prep_lm <- cv_models_lm %>% 
  mutate(
    validate_actual = map(validate, ~.x$lifeExp),
    validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
  )

Before we get to the next bit, I just wanted to quickly point out there is a couple of ways one can evaluate a regression model. We have already seen one popular metric, that is the \(R^2\). A couple others include:

Mean Absolute Error (MAE) - absolute difference between actual and predicted/fitted values
Mean Squared Error (MSE) - squared difference between actual and predicted/fitted values
Root Squared Mean Error (RMSE) - root of the above which brings it back into the same units of the output

We will be using the Metrics package to calculate these easily for us.

library(Metrics)
# Calculate the mean absolute error for each validate fold       
cv_eval_lm <- cv_prep_lm %>% 
  mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))
# Print the validate_mae column
cv_eval_lm$validate_mae

## [1] 2.931490 2.761723 2.728796 2.757771 2.886389

# Calculate the mean of validate_mae column
mean(cv_eval_lm$validate_mae)

## [1] 2.813234

Machine Learning in the Tidyverse

Dom Nasrabadi

Table of Contents

Abstract

Data + Function Introduction

Model Building

Model Evaluation

Tune and Rebuild Models

Building, Tuning & Evaluating but for Classification