This is the latest in my series of screencasts demonstraing how t ouse the tidymodels packages, from starting out with first modeling steps to tuning more complex models.

Explore the data

Our modeling goal is to understand how student debt and inequality has been changing over time. We can build a model to understand the relationship between student debt race, and year.

library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.3.1.9000 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
student_debt <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-09/student_debt.csv")
## Rows: 30 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): race
## dbl (3): year, loan_debt, loan_debt_pct
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
student_debt
## # A tibble: 30 x 4
##     year race     loan_debt loan_debt_pct
##    <dbl> <chr>        <dbl>         <dbl>
##  1  2016 White       11108.         0.337
##  2  2016 Black       14225.         0.418
##  3  2016 Hispanic     7494.         0.219
##  4  2013 White        8364.         0.285
##  5  2013 Black       10303.         0.412
##  6  2013 Hispanic     3177.         0.157
##  7  2010 White        8042.         0.280
##  8  2010 Black        9510.         0.321
##  9  2010 Hispanic     3089.         0.144
## 10  2007 White        5264.         0.197
## # ... with 20 more rows

This is a very small data set, and we can build a visualization to understand it better

theme_set(theme_light())

student_debt %>% ggplot(aes(year, loan_debt_pct, color =race)) + 
               geom_point(size = 2.5, alpha =0.8) +
               geom_smooth(method = 'lm', se = FALSE) +
               labs(x = NULL, y = '% of̀ families with student loan debt', color = NULL )
## `geom_smooth()` using formula 'y ~ x'

Notice that the proportion of families with student has been rising (dramatically!) but at different rates for different races/ ethnicities

Build a model

We can start by loading the tidymodels metapackage, and building a straightforward model specification for linear regression

library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## -- Attaching packages --------------------------------- tidymodels 0.1.3.9000 --
## v broom        0.7.9      v rsample      0.1.0 
## v dials        0.0.9      v tune         0.1.6 
## v infer        1.0.0      v workflows    0.2.3 
## v modeldata    0.1.1      v workflowsets 0.1.0 
## v parsnip      0.1.7      v yardstick    0.0.8 
## v recipes      0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
lm_spec <- linear_reg() %>% set_engine('lm')

Let’s fit that model to our dât, using an interation to account for hơ the rates/slopes have been changing at different, well, rates for the different groups

head(student_debt)
## # A tibble: 6 x 4
##    year race     loan_debt loan_debt_pct
##   <dbl> <chr>        <dbl>         <dbl>
## 1  2016 White       11108.         0.337
## 2  2016 Black       14225.         0.418
## 3  2016 Hispanic     7494.         0.219
## 4  2013 White        8364.         0.285
## 5  2013 Black       10303.         0.412
## 6  2013 Hispanic     3177.         0.157
lm_fit <- lm_spec %>% 
               fit(loan_debt_pct ~ year*race, data = student_debt)

lm_fit
## parsnip model object
## 
## Fit time:  10ms 
## 
## Call:
## stats::lm(formula = loan_debt_pct ~ year * race, data = data)
## 
## Coefficients:
##       (Intercept)               year       raceHispanic          raceWhite  
##        -21.161193           0.010690          15.563202           5.933064  
## year:raceHispanic     year:raceWhite  
##         -0.007827          -0.002986

What we do with this now, to understand it better? We could tidy() the model to get a dataframe

tidy(lm_fit)
## # A tibble: 6 x 5
##   term               estimate std.error statistic      p.value
##   <chr>                 <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)       -21.2       2.58        -8.20 0.0000000204
## 2 year                0.0107    0.00129      8.29 0.0000000166
## 3 raceHispanic       15.6       3.65         4.26 0.000270    
## 4 raceWhite           5.93      3.65         1.63 0.117       
## 5 year:raceHispanic  -0.00783   0.00182     -4.29 0.000250    
## 6 year:raceWhite     -0.00299   0.00182     -1.64 0.114
library(vip)
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi

However, I find it hard to look at model coefficients like this with an interaction term and know what it is going on! This is also true of many kinds of models where the model output doesn’t give you a lot of, insight into what it is doing.

Explore Results

Instead, we can use augment() to explore our model in a situation like this. The augment() function adds columnns for predictions given data. To do that we need some data, so ;et’s make some up

new_points <- crossing(
               race = c('Black', 'Hispanic', 'White'),
               year = 1990:2020
)

new_points
## # A tibble: 93 x 2
##    race   year
##    <chr> <int>
##  1 Black  1990
##  2 Black  1991
##  3 Black  1992
##  4 Black  1993
##  5 Black  1994
##  6 Black  1995
##  7 Black  1996
##  8 Black  1997
##  9 Black  1998
## 10 Black  1999
## # ... with 83 more rows

This is way more points we used to train this model, actually

Now we can augment() this data with prediction, and then make a visualization to undertand how the model is behaving

new_points %>% mutate(`.` = predict(lm_fit, 
                                        new_data = new_points)) %>%
               ggplot(aes(year, .pred, color = race)) +
               geom_line(size = 1.2, alpha = 0.7) +
               labs(x = NULL, y = '% of families with student loan debt', color = NULL)