Functional programming

Gosh, this a very important topic. The ability to write functions–to concisely express some statistical calculation, and have R do all the tedious record keeping of what goes with what–is a massive multiplier for your powers as bench scientist.

Functions let you follow maybe the most important axiom of programming – that the code you write should not be

W rite
E
everything
T
wice

instead, it should be

D on’t
R
epeat
Y
ourself

This quality really separates the pros and the joes. When we get started writing code, and we’re interest in estimating a model with a state level fixed effect for counties’ presidential vote share, we might write:

library(tidyverse)
library(magrittr)

pres <- "https://github.com/thomasjwood/ps7160/raw/master/countyvote/county_pres_1868_2020.rds" %>% 
  url %>% 
  readRDS %>% 
  filter(
    election_year >= 1948
  ) %>% 
  mutate(
    gop_perc = republican_raw_votes %>% 
      divide_by(
        pres_raw_county_vote_totals_two_party
      ) %>% 
      multiply_by(100)
  )

lm_pres <- lm(
  gop_perc ~ state, 
  data = pres
)

But then, we realize we’re missing the obvious and important year to year variation! We might want to estimate separate models for each year–so we’d write:

lm_48 <- lm(
  gop_perc ~ state, 
  data = pres %>% 
    filter(
      election_year == 1948
    )
)

lm_52 <- lm(
  gop_perc ~ state, 
  data = pres %>% 
    filter(
      election_year == 1952
    )
)

And on and on and on…. gosh.

what a hassle 🥱.

How do other languages solve for this problem? With a for loop:

for (year in seq(1948, 2020, 4)) {
  lm(
    gop_perc ~ state, 
    data = pres %>% 
      filter(
        election_year == year
      )
  ) %>% 
    print
}

yeech, this is also cringe.1

In its place, we’re going to use tidyverse’s functional toolkit purrr

How lucky we are to have Hadley, who understands our desire for powerful, intuitive software, and also appreciates our generations’ interests in cloying cuteness.

The purrr library

purrr, simply replicates the existing functional programming tools in R, but aims to be more consistent, more performant, and use a clearer naming convention.

It’s like dplyr, in that the performance advantages are not the point. The advantage is the legibility and clarity. If you pay the opening costs of purrr approach, eventually, when you see purrr syntax, the record keeping pieces of syntax will just vanish, and you’ll see the author’s intent.

purrr features many strengths of the tidyverse choices–most conspicuously, a common map prefix which lets you use RStudio autocomplete.

purrr function base equivalent Purpose
map() lapply() apply a function to each element of a list or vector.
map2() mapply() apply a function to the corresponding elements of two lists
pmap() --- apply a function to the corresponding elements of n lists or vectors.
map_dbl()/map_lgl()/map_int()/map_chr() sapply()/vapply() apply a function to each element of a list or vector, and restrict the out put to a double, a logical, an integer, or character vector, respectively.

Whoa, there’s a lot to cover. Let’s get to typing.

Applying with map()

map(
  .x = c(1, 2, 3, 4),
  .f =  function(i)
    LETTERS[seq(i)]
)

Just indulge me a moment:

.x

is the list/vector we’re operating on

.f

is the function or thing we’ll do to each element of .x

But because of pipes and such things, we can be more concise

1:4 %>% 
  map(
    \(i)
    LETTERS[seq(i)]
  )

A new piece of tech:

\(i)

is a nice piece of shorthand–it’s just short for \function(i)

How would we use map to replace the for loop we used up above? we’d do this:

seq(1948, 2020, 4) %>% 
  map(
    \(i)
    lm(
      gop_perc ~ state, 
      data = pres %>% 
        filter(
          election_year == i
        )
    )
  ) 

What’s a common social science application? when we map over a list of models!

str_c(
  "hwy ~ ",
  c(
    "displ",
    "displ + cyl",
    "displ + cyl + fl"
    )
  ) %>% 
  map(
    \(i)
    i %>% 
      lm(
        data = mpg
      ) %>% 
      summary
    ) 

or maybe you want to compare the fit of different factor analysis models, as a function of how many factors you are trying to extract?

1:4 %>%  
  map(
    \(i)
    factanal(
      factors = i,
      covmat = mtcars %>% 
        cor
      )
    ) 

map2()

When you want to match items on two lists of equal length, we use map2().

letters[1:10] %>% 
  map2(
    1:10,
    \(i, j)
    str_c(
      i, j, 
      sep = "_"
      )
    ) 

pmap()

We use pmap when we have more than 2 lists/vectors and we want to map over them simultaneously. We use them as a list of lists:

list(
  n = c(100, 1000, 10000),
  mu = c(-5, 0, 5),
  sd = c(10, 30, 60)
  ) %>%
  pmap(
    \(n, mu, sd)
    rnorm(n, mu, sd) %>% 
      fivenum
  )

We’ll come back to pmap later.

map’s type variants: map_int, map_dbl, map_lgl, map_chr.

These are useful when your function only returns a single value. This will also become apparent

map’s super power–when we’re working with nested list columns

Up til now we’ve been operating on simple vectors of numbers or characters. But data frames themselves can be vectors to be maped on!

let’s load an analysis data frame on the 2020 presidential election, with presidential votes by county

t20 <- "https://github.com/thomasjwood/ps7160/raw/master/election_analysis/election_context_2020.csv" %>% 
  read_csv %>% 
  mutate(
    gop_dif = trump20 %>% 
      divide_by(
        biden20 %>% add(trump20)
      ) %>% 
      subtract(
       trump16 %>%
         divide_by(
           clinton16 %>% add(trump16)
           ) 
      ) 
    )

Here’s the kind of thing political scientists often do when armed with electoral returns–checking which area-measured covariates are correlated with an electoral shift.

The piece of tech we need is nest()

t20n <- t20 %>% 
  group_by(state) %>%
  nest %>%
  filter(
    state != "District of Columbia"
  ) %>% 
  mutate(
    mods = data %>% 
      map(
        \(i)
        lm(
          gop_dif ~ median_hh_inc, data = i
        )
      )
  )

A fair bit happened, so let’s make sure we’re current

  • t20n has 49 rows and t20 has 3,114. But no data have been lost (well, we threw away DC, which isn’t a state and doesn’t have counties.

  • Confirm this for yourself:

    • print t20n$data

    • try t20$data %>% map(head)

    • try t20$data %>% map_int(nrow) or t20$data %>% map_int(ncol)

    • try t20$data %>% map(\(i) i$gop_dif %>% hist)

  • this bears emphasis

    • t20n$data is every row of the original t20, split by state

    • t20n$mods is a list of linear model regressing change in GOP presidential vote by county on household income, split by state

Whoa, this is powerful!

Let’s see which state has the largest and smallest coefficient for income

library(broom)

t20n$coef <- t20n$mods %>% 
  map_dbl(
    \(k)
    k %>% 
      tidy %>% 
      pluck("estimate", 2)
    )

t20n %<>%
  arrange(
    abs(coef) %>% 
      desc
    )

Which states have the most and least income stratified vote differences?

t20n %>% 
  ungroup %>% 
  slice(
    c(1:5, 45:49)
  )
## # A tibble: 10 × 4
##    state          data                mods            coef
##    <chr>          <list>              <list>         <dbl>
##  1 Arkansas       <tibble [75 × 42]>  <lm>   -0.00000227  
##  2 Maine          <tibble [16 × 42]>  <lm>   -0.00000207  
##  3 South Carolina <tibble [46 × 42]>  <lm>   -0.00000201  
##  4 Texas          <tibble [254 × 42]> <lm>   -0.00000198  
##  5 Delaware       <tibble [3 × 42]>   <lm>    0.00000173  
##  6 New York       <tibble [62 × 42]>  <lm>   -0.000000101 
##  7 Washington     <tibble [39 × 42]>  <lm>    0.0000000970
##  8 Montana        <tibble [56 × 42]>  <lm>    0.0000000685
##  9 Oregon         <tibble [36 × 42]>  <lm>    0.0000000626
## 10 North Dakota   <tibble [53 × 42]>  <lm>    0.0000000422

R’s taking care of which model corresponds with which state. This becomes even more apparent when we’re nesting multiple models per state (just for illustrative purposes, let’s do predictions by counties racial composition, by educational attainment, or by rural percentage)

t20n  %<>% 
  ungroup %>% 
  select(
    -c(mods, coef)
    ) 

t20n[,
     str_c(
       "mod_",
       c("race",
         "educ",
         "rural"
         )
       )
     ] <- str_c(
       "gop_dif ~ ",
       c("white_pct",
         "lesscollege_pct",
         "rural_pct")
       ) %>% 
  map(
    \(k)
    
    t20n$data %>% 
      map(
        \(l)
        lm(k, data = l)
      )
    )

t20n
## # A tibble: 49 × 5
##    state          data                mod_race mod_educ mod_rural
##    <chr>          <list>              <list>   <list>   <list>   
##  1 Arkansas       <tibble [75 × 42]>  <lm>     <lm>     <lm>     
##  2 Maine          <tibble [16 × 42]>  <lm>     <lm>     <lm>     
##  3 South Carolina <tibble [46 × 42]>  <lm>     <lm>     <lm>     
##  4 Texas          <tibble [254 × 42]> <lm>     <lm>     <lm>     
##  5 Delaware       <tibble [3 × 42]>   <lm>     <lm>     <lm>     
##  6 Oklahoma       <tibble [77 × 42]>  <lm>     <lm>     <lm>     
##  7 Rhode Island   <tibble [5 × 42]>   <lm>     <lm>     <lm>     
##  8 North Carolina <tibble [100 × 42]> <lm>     <lm>     <lm>     
##  9 Indiana        <tibble [92 × 42]>  <lm>     <lm>     <lm>     
## 10 Georgia        <tibble [159 × 42]> <lm>     <lm>     <lm>     
## # ℹ 39 more rows

Mmmm I sense a pivot_longer in the offing! (By the way, we’re keeping track of 49 * 3 models here! 😰)

p1 <- t20n  %>% 
  ungroup %>% 
  pivot_longer(
    names_to = "mod_type", 
    values_to = "mods", 
    cols = starts_with("mod_")
    ) %>% 
  mutate(
    coef = mods %>% 
      map_dbl(
        \(i)
        i %>% 
          tidy %>% 
          pluck("estimate", 2)
      ),
    state = state %>% 
      fct_reorder(coef)
  ) %>% 
  arrange(
    mod_type, state
  ) %>% 
  ggplot(
    aes(
      coef, state, color = mod_type, group = mod_type
    )
  ) +
  geom_vline(
    xintercept = 0,
    linetype = "dashed",
    linewidth = .3
  ) +
  geom_path() +
  geom_point() +
  facet_grid(
   . ~ mod_type, space = "free_x", scales = "free_x" 
  ) +
  theme(
    legend.position = "none"
    )

Which should return

Whoa, education polarization is a big thing! And just so we get what we just did–we’re reporting 147 separate regression models! in ~20 lines of code! And we got to use our old friends from the tidyverse!


  1. Now, the true R elect denigrate for loops constantly, as slow and inefficient, but that’s actually not an informed attitude. Many base R functions merely wrap different C++ loops. But they’re uncool and clunky, nq.↩︎