Functional Programming II

Let’s reprise some of the purrr toolkit. It’s a very general way to solve the problems we meet in applied statistics.

First, our friend map

library(tidyverse)
library(magrittr)
library(palmerpenguins)

map(
  .x = c(2, 4, 8, 16, 32),
  .f = function(i){
    i ^ 2
  }
)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 16
## 
## [[3]]
## [1] 64
## 
## [[4]]
## [1] 256
## 
## [[5]]
## [1] 1024
map(
  .x = 1:5, 
  .f = function(i){
    letters %>% 
      extract(1:i)
  }
)
## [[1]]
## [1] "a"
## 
## [[2]]
## [1] "a" "b"
## 
## [[3]]
## [1] "a" "b" "c"
## 
## [[4]]
## [1] "a" "b" "c" "d"
## 
## [[5]]
## [1] "a" "b" "c" "d" "e"

What does each argument do?

  • .x is the list or vector you’re operating on. This is a very general thing to work with–factor levels, formula, lists of tibbles, etc. The obstacle here is really only your imagination.

  • .f is the function, or operation, you’re going to do to every element of .x . (You’ve seen me using \() as a shortcut for function() )

  • i is just a variable or index. Every element of .x you plan to do .f to, is referred to as i.

Just to make sure you understand what’s going on, and how general map is a toolkit:

1:4 %>% 
  map(
    \(i)
    str_c(
      i, " * ", i, " = ", i*i
    )
  )

list(
  c("Alabama", "Alaska", "Arizona"),
  c("California", "Colorado", "Connecticut"),
  c("Maine", "Maryland", "Massachusetts","Michigan",
    "Minnesota", "Mississippi", "Missouri", "Montana")
) %>% 
  map(
    \(g)
    us_rent_income %>% 
      filter(
        NAME %>% 
          is_in(g)
      )
  )

c("bill", "flipper", "body") %>% 
  map(
    \(k)
    penguins %>% 
      select(
        species, island, starts_with(k)
      )
  )

As I’ve said elsewhere, you’re really only limited by your ingenuity.

map also has some cousin functions which require your output to have a specific form (a particular kind of programmer autism mandates type stability. These cousins include map_chr, map_int, map_lgl, map_dbl, etc)

1:4 %>% 
  map(
    \(i)
    str_c(
      i, " * ", i, " = ", i*i
    )
  )

list(
  c("Alabama", "Alaska", "Arizona"),
  c("California", "Colorado", "Connecticut"),
  c("Maine", "Maryland", "Massachusetts","Michigan",
    "Minnesota", "Mississippi", "Missouri", "Montana")
) %>% 
  map(
    \(g)
    us_rent_income %>% 
      filter(
        NAME %>% 
          is_in(g)
      )
  )

c("bill", "flipper", "body") %>% 
  map(
    \(k)
    penguins %>% 
      select(
        species, island, starts_with(k)
      )
  )

map2 is a very useful special case, where you have two lists of equal length, and you want to operate over both lists/vectors simultaneously, matching each element by their respective position.

map2_chr(
  .x = state.name[1:10],
  .y = state.abb[1:10],
  .f = \(i, j)
  
  str_c(
    i," - ", j
  )
)

Functional Placeholders (Or, a dirty secret that a lowly bench social scientists can use to write useful functions)

A placeholder is when you manually set the indexant to a representative value so that you can practice on this representative example. You then get the function to work on this particular example, before you turn this function loose on the entire list.

I used to loathe that I did this, since it’s obviously messy programming practice, but I’ve come around in the fullness of time.

Let’s try this method with an example, recoding income from the ANES.

# if necessary
# remotes::install_github("jamesmartherus/anesr")

library(anesr)
library(haven)

data("timeseries_2020")

t20 <- timeseries_2020 %>% 
  as_factor %>% 
  select(
    V200001, V202468x
  )

inc_levs <- t20$V202468x %>%
  levels

inc_levs
##  [1] "-9. Refused"                         "-5. Breakoff, sufficient partial IW"
##  [3] "1. Under $9,999"                     "2. $10,000-14,999"                  
##  [5] "3. $15,000-19,999"                   "4. $20,000-24,999"                  
##  [7] "5. $25,000-29,999"                   "6. $30,000-34,999"                  
##  [9] "7. $35,000-39,999"                   "8. $40,000-44,999"                  
## [11] "9. $45,000-49,999"                   "10. $50,000-59,999"                 
## [13] "11. $60,000-64,999"                  "12. $65,000-69,999"                 
## [15] "13. $70,000-74,999"                  "14. $75,000-79,999"                 
## [17] "15. $80,000-89,999"                  "16. $90,000-99,999"                 
## [19] "17. $100,000-109,999"                "18. $110,000-124,999"               
## [21] "19. $125,000-149,999"                "20. $150,000-174,999"               
## [23] "21. $175,000-249,999"                "22. $250,000 or more"

Imagine we wanted to replace each factor level with a coarse numeric estimate, which is equal to the income level if the factor only lists 1, or the mean of the income levels if the factor lists two.1

How would we go about this?

First, we’d set up the general infrastructure of a map call

inc_num <- inc_levs %>% 
  map(
    \(i)
    
    i
    
  )

Now, we insert a placeholder, indicative level for i, which we can use to experiment on as an input for our function. Most importantly, we’re going to comment out this placeholder, and only run it interactively, in our global environment, so it won’t break the function when it runs on its own

inc_num <- inc_levs %>% 
  map(
    \(i)
    
    # i <- inc_levs[[3]]
    
    i
    
  )

What does the placeholder let us do? We can interactively noodle around with an indicative level, get the function to work on it, and then see how to make the function work on other levels.

Specifically, we could write

i <- inc_levs[[3]]

i %>% 
  str_extract("\\d{1,3},999|\\d{1,3},000") %>% 
  str_remove_all(",") %>% 
  as.numeric
## [1] 9999

Whoa, what a start! Let’s drop that into map, see if it works for the other levels!

inc_num <- inc_levs %>% 
  map(
    \(i)
    
    # i <- inc_levs[[3]]
    
    i %>% 
      str_extract("\\d{1,3},999|\\d{1,3},000") %>% 
      str_remove_all(",") %>% 
      as.numeric
  )

inc_num

[[1]] [1] NA

[[2]] [1] NA

[[3]] [1] 9999

[[4]] [1] 10000

[[5]] [1] 15000

[[6]] [1] 20000

[[7]] [1] 25000

[[8]] [1] 30000

[[9]] [1] 35000

[[10]] [1] 40000

[[11]] [1] 45000

[[12]] [1] 50000

[[13]] [1] 60000

[[14]] [1] 65000

[[15]] [1] 70000

[[16]] [1] 75000

[[17]] [1] 80000

[[18]] [1] 90000

[[19]] [1] 1e+05

[[20]] [1] 110000

[[21]] [1] 125000

[[22]] [1] 150000

[[23]] [1] 175000

[[24]] [1] 250000

Ahhh, dang! We need to adapt it to the levels like "7. $35,000-39,999" where we need to extract two income bounds and report a mean. Placeholders can adapt to this!

Exercise

  • set i <- inc_levs[[5]] and fix the map call above to work additionally for this level

map2

Let’s see how we could work map2 in an indicative example, using the most famous framing effect in the history of American public opinion.

t1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/gss_welfare.rds" %>% 
  url %>% 
  readRDS

t1
## # A tibble: 49,652 × 7
##    year     id polviews                     race  version          ans   ans_num
##    <fct> <dbl> <fct>                        <fct> <chr>            <fct>   <dbl>
##  1 1984      1 conservative                 white welfare          too …       3
##  2 1984      2 slightly liberal             white welfare          too …       1
##  3 1984      3 liberal                      black welfare          too …       1
##  4 1984      4 moderate, middle of the road white assistance to t… too …       1
##  5 1984      7 slightly liberal             white welfare          abou…       2
##  6 1984      8 don't know                   white assistance to t… too …       1
##  7 1984      9 don't know                   other assistance to t… too …       1
##  8 1984     11 slightly conservative        black welfare          too …       1
##  9 1984     12 moderate, middle of the road white assistance to t… too …       3
## 10 1984     14 don't know                   black assistance to t… too …       1
## # ℹ 49,642 more rows

Imagine we wanted to model the effect overall, and then interacted with ideology and race, all separately by year (so that we can figure out if these framing differences are growing or shrinking).

t2 <- t1 %>% 
  group_by(year) %>% 
  nest %>% 
  cross_join(
    tibble(
      frm = c(
        "ans_num ~ version",
        "ans_num ~ version * polviews",
        "ans_num ~ version * race"
      )
    )
  )

Can you guess which two equal-length lists over which we’re going to map2? What’s t2$frm?

t2$frm %>% length
## [1] 72
t2$data %>% length
## [1] 72

So we can just

t2$mods <- map2(
  .x = t2$frm,
  .y = t2$data, 
  \(i, j)
  lm(
    formula = i, data = j
  )
)

Now you might have heard this said–we estimate linear models, but what do we really want, to plot & table estimates or contrasts? that’s right, the emmean!

library(emmeans)
library(broom)

t2$emm_form <- t2$frm %>% 
  str_replace(
    "ans_num ~ ",
    "trt.vs.ctrl ~ "
    )

t2$emm <- map2(
  .x = t2$mods,
  .y = t2$emm_form, 
  \(i, j)
  emmeans(
    object = i,
    specs = as.formula(j), 
    options = list(
      adjust = "none"
      )
    )
  )

Now we can measure effects by year

t3 <- t2 %>% 
  filter(
    frm == "ans_num ~ version"
  ) %>% 
  select(
    year, emm
  ) %>% 
  mutate(
    cont = emm %>% 
      map(
        \(i)
        i[[2]] %>% 
          tidy
        )
    )

map2(
  t3$cont,
  t3$year,
  \(i, j)
  i %>% 
    mutate(year = j)
  ) %>% 
  list_rbind %>% 
  select(year, everything())
## # A tibble: 24 × 9
##    year  term   contrast null.value estimate std.error    df statistic   p.value
##    <fct> <chr>  <chr>         <dbl>    <dbl>     <dbl> <dbl>     <dbl>     <dbl>
##  1 1984  versi… welfare…          0    0.689    0.0484   938      14.2 8.77e- 42
##  2 1985  versi… welfare…          0    0.817    0.0373  1474      21.9 4.32e- 92
##  3 1986  versi… welfare…          0    0.724    0.0385  1413      18.8 1.85e- 70
##  4 1987  versi… welfare…          0    0.764    0.0360  1708      21.2 1.05e- 88
##  5 1988  versi… welfare…          0    0.824    0.0376  1409      21.9 8.16e- 92
##  6 1989  versi… welfare…          0    0.791    0.0382  1454      20.7 7.24e- 84
##  7 1990  versi… welfare…          0    0.771    0.0391  1295      19.7 5.20e- 76
##  8 1991  versi… welfare…          0    0.735    0.0382  1422      19.2 2.08e- 73
##  9 1993  versi… welfare…          0    0.912    0.0378  1516      24.2 2.67e-109
## 10 1994  versi… welfare…          0    0.934    0.0273  2840      34.2 5.02e-215
## # ℹ 14 more rows

Another exercise

A student asked me to help format some untidy data, and it turned out to be an elegant advertisement for the purrr approach.

Read this file here

library(openxlsx)
library(magrittr)

t4 <- "https://github.com/thomasjwood/code_lab/raw/main/data/Parliament-results%202014.xlsx" %>%
  read.xlsx(
    colNames = F,
    sheet = 1) %>% 
  tibble %>%
  set_colnames(
    str_c(
      "x", 1:6
      )
    )

For each constituency,

  • You want one table on votes received by candidate
  • You want another table reporting votes/turnout/valid votes

  1. Just to satisfy your curiosity, would it work if you wanted to simply write as.numeric(t20$V202468x) ?↩︎