Lists are important

As trained statisticians I’m sure you have a deep affinity for tabular data.

But as statistical programmers, as people who want just enough coding to be dangerous, lists are your beau ideal.

They are the most general tool you can deploy:

  • A list of models, estimated on subsets of your data
  • A list of data frames, bootstrapped from an original data frame
  • A corpus where every element is a separate text document
  • A list of ggplot objects where every figure has a different set of parameters.

Lists, we see lists everywhere!

We’re going to go through the three kinds of lists you’ll see all the time, that solve 99% of the functional programming demands in data analysis – map, map2, and pmap

1. List type 1: map

Imagine a very simple sequence of containers. Lists being so simple is precisely what makes them so general and flexible.

In this diagram

  • i indexes the separate list elements

  • [1], [2],… provides the location of each element

  • [n] details the list’s length

Map isn’t special, it’s just one highly opinionated way to address these types of challenges. These three approaches:

library(tidyverse)
map(
  .x = 1:5,
  .f = \(i)
  i^2
  )
for (i in 1:5) {
  print(i^2)
}
sapply(
  X = 1:5,
  function(i)
    i^2
  )

These are just three ways to do the same thing!

A couple of notes:

  • I’m needlessly concise/magrittr y by using a pipe to the first argument of map .

Ie this:

library(tidyverse)
library(magrittr)

letters[1:5] %>% 
  map(
    \(i)
    str_c(i, "_", i)
  )

Is the same as this

map(
  letters[1:5], 
  \(i)
  str_c(i, "_", i)
)

It would probably help embed the intuition to do it both ways to your own satisfaction.

  • You can also be a little concise in slightly unhelpful way when the function inside the map accepts the list itinerant as its first argument
list(
  seq_len(1),
  seq_len(2), 
  seq_len(3),
  seq_len(4),
  seq_len(5)
) %>% 
  map(length)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5

A quick map example

Let’s load the 2022 CES (you may need to install rio to import a compressed .rds)

t1 <- "https://github.com/thomasjwood/constraint/raw/main/ces/ces_22_c.rds" %>% 
  rio::import(
    setclass = "tbl_df"
  )

Imagine if you want to computer series of weighted vote choice tables for each US census region

state.name %>% 
  split(state.region) %>% 
  map(
    \(i)
    
    t1 %>% 
      filter(
        inputstate %>% 
          is_in(i)
      ) %>% 
      group_by(presvote20post) %>% 
      tally(commonweight) %>% 
      na.omit %>% 
      mutate(
        perc = n %>% 
          divide_by(
            n %>% sum
          ) %>% 
          multiply_by(100) %>% 
          round
      )
  )
## $Northeast
## # A tibble: 6 × 3
##   presvote20post                  n  perc
##   <fct>                       <dbl> <dbl>
## 1 Joe Biden                  4172.     41
## 2 Donald Trump               2799.     27
## 3 Jo Jorgensen                 74.8     1
## 4 Howie Hawkins                25.2     0
## 5 Other                        73.4     1
## 6 Did not vote for President 3079.     30
## 
## $South
## # A tibble: 6 × 3
##   presvote20post                  n  perc
##   <fct>                       <dbl> <dbl>
## 1 Joe Biden                  6985.     31
## 2 Donald Trump               7517.     34
## 3 Jo Jorgensen                171.      1
## 4 Howie Hawkins                47.7     0
## 5 Other                       108.      0
## 6 Did not vote for President 7373.     33
## 
## $`North Central`
## # A tibble: 6 × 3
##   presvote20post                  n  perc
##   <fct>                       <dbl> <dbl>
## 1 Joe Biden                  4355.     35
## 2 Donald Trump               4419.     35
## 3 Jo Jorgensen                113.      1
## 4 Howie Hawkins                29.8     0
## 5 Other                        58.2     0
## 6 Did not vote for President 3575.     28
## 
## $West
## # A tibble: 6 × 3
##   presvote20post                  n  perc
##   <fct>                       <dbl> <dbl>
## 1 Joe Biden                  5303.     40
## 2 Donald Trump               3814.     29
## 3 Jo Jorgensen                150.      1
## 4 Howie Hawkins                51.5     0
## 5 Other                        92.9     1
## 6 Did not vote for President 3903.     29

Or imagine if you want to do separate analysis on survey items dealing with immigration, gun control, or abortion.

First build a codebook

cb1 <- tibble(
  nms = t1 %>% 
    map(
      \(i)
      i %>% 
        attr(., "label")
    ) %>% 
    unlist %>% 
    names,
  labs = t1 %>% 
    map(
      \(i)
      i %>% 
        attr(., "label")
    ) %>% 
    unlist
)

then we look at the specific items we’re interested in

c("Immigration -- ",
  "Abortion --",
  "Gun control --") %>% 
  map(
    \(i)
    
    cb1 %>% 
      filter(
        labs %>% 
          str_detect(i)
      )
  )
## [[1]]
## # A tibble: 4 × 2
##   nms       labs                                                                
##   <chr>     <chr>                                                               
## 1 CC22_331a Immigration -- Grant legal status to all illegal immigrants who hav…
## 2 CC22_331b Immigration -- Increase the number of border patrols on the US-Mexi…
## 3 CC22_331c Immigration -- Reduce legal immigration by 50 percent over the next…
## 4 CC22_331d Immigration -- Increase spending on border security by $25 billion,…
## 
## [[2]]
## # A tibble: 6 × 2
##   nms       labs                                                                
##   <chr>     <chr>                                                               
## 1 CC22_332a Abortion -- Always allow a woman to obtain an abortion as a matter …
## 2 CC22_332b Abortion -- Permit abortion only in case of rape, incest or when th…
## 3 CC22_332c Abortion -- Prohibit all abortions after the 20th week of pregnancy 
## 4 CC22_332d Abortion -- Allow employers to decline coverage of abortions in ins…
## 5 CC22_332e Abortion -- Prohibit the expenditure of funds authorized or appropr…
## 6 CC22_332f Abortion -- Make abortions illegal in all circumstances             
## 
## [[3]]
## # A tibble: 6 × 2
##   nms       labs                                                                
##   <chr>     <chr>                                                               
## 1 CC22_330a Gun control -- Prohibit state and local governments from publishing…
## 2 CC22_330b Gun control -- Ban assault rifles                                   
## 3 CC22_330c Gun control -- Make it easier for people to obtain concealed-carry …
## 4 CC22_330d Gun control -- Provide federal funding to encourage states to take …
## 5 CC22_330e Gun control -- Improve background checks to give authorities time t…
## 6 CC22_330f Gun control -- Allow teachers and school officials to carry guns in…

We’ll pipe this map of variable names into a map using select

c("Immigration -- ",
  "Abortion --",
  "Gun control --") %>% 
  map(
    \(i)
    
    cb1 %>% 
      filter(
        labs %>% 
          str_detect(i)
      ) %>% 
      use_series(nms)
  ) %>% 
  map(
    \(i)
    
    t1 %>% 
      select(
        any_of(i)
      )
  )
## [[1]]
## # A tibble: 60,000 × 4
##    CC22_331a CC22_331b CC22_331c CC22_331d
##    <fct>     <fct>     <fct>     <fct>    
##  1 Support   Oppose    Oppose    Oppose   
##  2 Support   Support   Oppose    Support  
##  3 Support   Support   Oppose    Oppose   
##  4 Support   Oppose    Oppose    Oppose   
##  5 Support   Oppose    Support   Oppose   
##  6 Oppose    Support   Support   Support  
##  7 Support   Support   Oppose    Oppose   
##  8 Support   Support   Oppose    Oppose   
##  9 Support   Oppose    Support   Oppose   
## 10 Support   Oppose    Oppose    Oppose   
## # ℹ 59,990 more rows
## 
## [[2]]
## # A tibble: 60,000 × 6
##    CC22_332a CC22_332b CC22_332c CC22_332d CC22_332e CC22_332f
##    <fct>     <fct>     <fct>     <fct>     <fct>     <fct>    
##  1 Support   Oppose    Oppose    Oppose    Oppose    Oppose   
##  2 Support   Support   Oppose    Oppose    Support   Oppose   
##  3 Support   Oppose    Oppose    Oppose    Oppose    Oppose   
##  4 Support   Oppose    Oppose    Oppose    Oppose    Oppose   
##  5 Support   Oppose    Support   Oppose    Oppose    Oppose   
##  6 Oppose    Support   Support   Support   Support   Oppose   
##  7 Support   Support   Oppose    Oppose    Oppose    Oppose   
##  8 Support   Oppose    Oppose    Oppose    Oppose    Oppose   
##  9 Oppose    Support   Oppose    Oppose    Support   Oppose   
## 10 Support   Support   Oppose    Oppose    Support   Oppose   
## # ℹ 59,990 more rows
## 
## [[3]]
## # A tibble: 60,000 × 6
##    CC22_330a CC22_330b CC22_330c CC22_330d CC22_330e CC22_330f
##    <fct>     <fct>     <fct>     <fct>     <fct>     <fct>    
##  1 Oppose    Support   Oppose    Support   Support   Oppose   
##  2 Support   Oppose    Support   Oppose    Support   Oppose   
##  3 Oppose    Support   Oppose    Support   Support   Oppose   
##  4 Oppose    Support   Oppose    Support   Support   Oppose   
##  5 Oppose    Oppose    Support   Oppose    Support   Support  
##  6 Support   Oppose    Support   Oppose    Oppose    Support  
##  7 Oppose    Oppose    Oppose    Oppose    Support   Support  
##  8 Oppose    Support   Oppose    Support   Support   Oppose   
##  9 Oppose    Support   Oppose    Support   Support   Support  
## 10 Oppose    Support   Oppose    Support   Support   Oppose   
## # ℹ 59,990 more rows

A quick map task

Regress the 100pt tax cut spending tradeoff (CC22_421r) indicator on partisanship, ideology, and religious importance. For each of the three linear models, separately control for race and education.

2. List type 2: map2

This function appears to be a special case, but it’s super useful – when you have

  1. two lists of equal length.
  2. You want to operate on each pair of values as a single element.

(there’s a special case where one of the lists can be of length 1, and it will be recycled to the length of the paired list, but that’s actually the same as a map)

Take a moment to think about this.

In these two lists,

In this diagram

  • i indexes the separate list elements of the first list

  • j indexes the separate list elements of the second list

  • [1], [2],… provides the location of each element

  • [n] details both lists’ common length

Some toy examples

map2(
  .x = state.name,
  .y = state.abb,
  \(i, j)
  str_c(i, "_", j)
)
## [[1]]
## [1] "Alabama_AL"
## 
## [[2]]
## [1] "Alaska_AK"
## 
## [[3]]
## [1] "Arizona_AZ"
## 
## [[4]]
## [1] "Arkansas_AR"
## 
## [[5]]
## [1] "California_CA"
## 
## [[6]]
## [1] "Colorado_CO"
## 
## [[7]]
## [1] "Connecticut_CT"
## 
## [[8]]
## [1] "Delaware_DE"
## 
## [[9]]
## [1] "Florida_FL"
## 
## [[10]]
## [1] "Georgia_GA"
## 
## [[11]]
## [1] "Hawaii_HI"
## 
## [[12]]
## [1] "Idaho_ID"
## 
## [[13]]
## [1] "Illinois_IL"
## 
## [[14]]
## [1] "Indiana_IN"
## 
## [[15]]
## [1] "Iowa_IA"
## 
## [[16]]
## [1] "Kansas_KS"
## 
## [[17]]
## [1] "Kentucky_KY"
## 
## [[18]]
## [1] "Louisiana_LA"
## 
## [[19]]
## [1] "Maine_ME"
## 
## [[20]]
## [1] "Maryland_MD"
## 
## [[21]]
## [1] "Massachusetts_MA"
## 
## [[22]]
## [1] "Michigan_MI"
## 
## [[23]]
## [1] "Minnesota_MN"
## 
## [[24]]
## [1] "Mississippi_MS"
## 
## [[25]]
## [1] "Missouri_MO"
## 
## [[26]]
## [1] "Montana_MT"
## 
## [[27]]
## [1] "Nebraska_NE"
## 
## [[28]]
## [1] "Nevada_NV"
## 
## [[29]]
## [1] "New Hampshire_NH"
## 
## [[30]]
## [1] "New Jersey_NJ"
## 
## [[31]]
## [1] "New Mexico_NM"
## 
## [[32]]
## [1] "New York_NY"
## 
## [[33]]
## [1] "North Carolina_NC"
## 
## [[34]]
## [1] "North Dakota_ND"
## 
## [[35]]
## [1] "Ohio_OH"
## 
## [[36]]
## [1] "Oklahoma_OK"
## 
## [[37]]
## [1] "Oregon_OR"
## 
## [[38]]
## [1] "Pennsylvania_PA"
## 
## [[39]]
## [1] "Rhode Island_RI"
## 
## [[40]]
## [1] "South Carolina_SC"
## 
## [[41]]
## [1] "South Dakota_SD"
## 
## [[42]]
## [1] "Tennessee_TN"
## 
## [[43]]
## [1] "Texas_TX"
## 
## [[44]]
## [1] "Utah_UT"
## 
## [[45]]
## [1] "Vermont_VT"
## 
## [[46]]
## [1] "Virginia_VA"
## 
## [[47]]
## [1] "Washington_WA"
## 
## [[48]]
## [1] "West Virginia_WV"
## 
## [[49]]
## [1] "Wisconsin_WI"
## 
## [[50]]
## [1] "Wyoming_WY"
map2(
  .x = seq(2, 10, by = 2),
  .y = seq(3, 15, by = 3),
  \(a, b)
  a * b
)
## [[1]]
## [1] 6
## 
## [[2]]
## [1] 24
## 
## [[3]]
## [1] 54
## 
## [[4]]
## [1] 96
## 
## [[5]]
## [1] 150

What’s a more typical survey approach? When you’re doing exploratory work, and you want to test relationships between pairs of variables

c("CC22_420_1",
  "CC22_420_2",
  "CC22_420_3",
  "CC22_420_4") %>% 
  map2(
    c("CC22_433a",
      "newsint",
      "ideo5",
      "urbancity"),
    \(i, j)
    
    str_c(
      "commonweight ~ ", 
      i, 
      " + ", 
      j
    ) %>% 
      xtabs(
        data = t1
      ) %>% 
      prop.table(2) %>% 
      round(2)
  )
## [[1]]
##               CC22_433a
## CC22_420_1     Democrat Republican Independent Other skipped not asked
##   selected         0.17       0.34        0.21  0.15                  
##   not selected     0.83       0.66        0.79  0.85                  
##   skipped          0.00       0.00        0.00  0.00                  
##   not asked        0.00       0.00        0.00  0.00                  
## 
## [[2]]
##               newsint
## CC22_420_2     Most of the time Some of the time Only now and then
##   selected                 0.61             0.51              0.42
##   not selected             0.39             0.49              0.58
##   skipped                  0.00             0.00              0.00
##   not asked                0.00             0.00              0.00
##               newsint
## CC22_420_2     Hardly at all Don't know skipped not asked
##   selected              0.38       0.18                  
##   not selected          0.62       0.82                  
##   skipped               0.00       0.00                  
##   not asked             0.00       0.00                  
## 
## [[3]]
##               ideo5
## CC22_420_3     Very liberal Liberal Moderate Conservative Very conservative
##   selected             0.52    0.49     0.36         0.28              0.27
##   not selected         0.48    0.51     0.64         0.72              0.73
##   skipped              0.00    0.00     0.00         0.00              0.00
##   not asked            0.00    0.00     0.00         0.00              0.00
##               ideo5
## CC22_420_3     Not sure skipped not asked
##   selected         0.20                  
##   not selected     0.80                  
##   skipped          0.00                  
##   not asked        0.00                  
## 
## [[4]]
##               urbancity
## CC22_420_4     City Suburb Town Rural area Other skipped not asked
##   selected     0.18   0.16 0.14       0.15  0.21                  
##   not selected 0.82   0.84 0.86       0.85  0.79                  
##   skipped      0.00   0.00 0.00       0.00  0.00                  
##   not asked    0.00   0.00 0.00       0.00  0.00

A quick map2 task

Take the ‘use US troops survey items’ CC22_420_1:CC22_420_7, and ideology, partisanship, and income, and for every combination of item and predictor, report the categorical correspondence.

3. List type 3: pmap

This function also appears to be a special case, but is just exquisitely useful for those of use who do modelling in R and like to use list tibbles.

pmap stands for parrallel map. It can be conceived on it most fundamental level as an abstraction of map2 to map over any number of lists, which themselves are nested in a list

list(
  list("a", "pets", "sunny"),
  list("quick", "left", "days"),
  list("story", "home", "melt"),
  list("time", "alone", "snow")
) %>% 
  pmap(
    \(i, j, k, m)
    
    str_c(
      i, j, k, m, 
      sep = " "
    )
  )
## [[1]]
## [1] "a quick story time"
## 
## [[2]]
## [1] "pets left home alone"
## 
## [[3]]
## [1] "sunny days melt snow"

see what’s going on? Maybe do a quick couple of your own so you completely grok the process.

What’s the typical statistical application of a parallel map? We exploit the fact that our data frames are themselves parallel lists!

  • v(1) though v(k) are variable names

  • 1 thought n indexes each variable’s length

Here’s an example using the EPA’s expanded mileage data

bmc <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-15/big_epa_cars.csv" %>% 
  read_csv %>% 
  group_by(
    make, year
  ) %>% 
  nest

bmc
## # A tibble: 1,770 × 3
## # Groups:   make, year [1,770]
##    make        year data               
##    <chr>      <dbl> <list>             
##  1 Alfa Romeo  1985 <tibble [2 × 81]>  
##  2 Ferrari     1985 <tibble [3 × 81]>  
##  3 Dodge       1985 <tibble [190 × 81]>
##  4 Subaru      1993 <tibble [33 × 81]> 
##  5 Toyota      1993 <tibble [59 × 81]> 
##  6 Volkswagen  1993 <tibble [26 × 81]> 
##  7 Volvo       1993 <tibble [12 × 81]> 
##  8 Audi        1993 <tibble [10 × 81]> 
##  9 BMW         1993 <tibble [16 × 81]> 
## 10 Buick       1993 <tibble [18 × 81]> 
## # ℹ 1,760 more rows

we’re going to pmap over each element

library(broom)

t2 <- bmc %>% 
  pmap(
    possibly(
      \(make, year, data)
      
      lm(
        highway08 ~ displ, 
        data = data
      ) %>% 
        tidy %>% 
        mutate(
          make, year
        ),
      NULL
    )
  ) %>% 
  list_rbind %>% 
  na.omit %>% 
  filter(
    term == "displ"
  )

Oh look, with a nice long tibble, we can make a nice figure easily.

t2 %>% 
  mutate(
    make = make %>% 
      fct_reorder(estimate)
  ) %>% 
  ggplot(
    aes(
      estimate, make
    )
  ) +
  geom_point() +
  geom_vline(
    xintercept = 0,
    linetype = "dashed"
  )