As trained statisticians I’m sure you have a deep affinity for tabular data.
But as statistical programmers, as people who want just enough coding to be dangerous, lists are your beau ideal.
They are the most general tool you can deploy:
ggplot
objects where every figure has a
different set of parameters.Lists, we see lists everywhere!
We’re going to go through the three kinds of lists you’ll see all the
time, that solve 99% of the functional programming demands in data
analysis – map
, map2
, and
pmap
map
Imagine a very simple sequence of containers. Lists being so simple is precisely what makes them so general and flexible.
In this diagram
i indexes the separate list elements
[1], [2],… provides the location of each element
[n] details the list’s length
Map isn’t special, it’s just one highly opinionated way to address these types of challenges. These three approaches:
|
|
|
These are just three ways to do the same thing!
A couple of notes:
magrittr
y by using a pipe to
the first argument of map
.Ie this:
library(tidyverse)
library(magrittr)
letters[1:5] %>%
map(
\(i)
str_c(i, "_", i)
)
Is the same as this
map(
letters[1:5],
\(i)
str_c(i, "_", i)
)
It would probably help embed the intuition to do it both ways to your own satisfaction.
map
accepts the list itinerant as its
first argumentlist(
seq_len(1),
seq_len(2),
seq_len(3),
seq_len(4),
seq_len(5)
) %>%
map(length)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 5
Let’s load the 2022 CES (you may need to install rio
to
import a compressed .rds
)
t1 <- "https://github.com/thomasjwood/constraint/raw/main/ces/ces_22_c.rds" %>%
rio::import(
setclass = "tbl_df"
)
Imagine if you want to computer series of weighted vote choice tables for each US census region
state.name %>%
split(state.region) %>%
map(
\(i)
t1 %>%
filter(
inputstate %>%
is_in(i)
) %>%
group_by(presvote20post) %>%
tally(commonweight) %>%
na.omit %>%
mutate(
perc = n %>%
divide_by(
n %>% sum
) %>%
multiply_by(100) %>%
round
)
)
## $Northeast
## # A tibble: 6 × 3
## presvote20post n perc
## <fct> <dbl> <dbl>
## 1 Joe Biden 4172. 41
## 2 Donald Trump 2799. 27
## 3 Jo Jorgensen 74.8 1
## 4 Howie Hawkins 25.2 0
## 5 Other 73.4 1
## 6 Did not vote for President 3079. 30
##
## $South
## # A tibble: 6 × 3
## presvote20post n perc
## <fct> <dbl> <dbl>
## 1 Joe Biden 6985. 31
## 2 Donald Trump 7517. 34
## 3 Jo Jorgensen 171. 1
## 4 Howie Hawkins 47.7 0
## 5 Other 108. 0
## 6 Did not vote for President 7373. 33
##
## $`North Central`
## # A tibble: 6 × 3
## presvote20post n perc
## <fct> <dbl> <dbl>
## 1 Joe Biden 4355. 35
## 2 Donald Trump 4419. 35
## 3 Jo Jorgensen 113. 1
## 4 Howie Hawkins 29.8 0
## 5 Other 58.2 0
## 6 Did not vote for President 3575. 28
##
## $West
## # A tibble: 6 × 3
## presvote20post n perc
## <fct> <dbl> <dbl>
## 1 Joe Biden 5303. 40
## 2 Donald Trump 3814. 29
## 3 Jo Jorgensen 150. 1
## 4 Howie Hawkins 51.5 0
## 5 Other 92.9 1
## 6 Did not vote for President 3903. 29
Or imagine if you want to do separate analysis on survey items dealing with immigration, gun control, or abortion.
First build a codebook
cb1 <- tibble(
nms = t1 %>%
map(
\(i)
i %>%
attr(., "label")
) %>%
unlist %>%
names,
labs = t1 %>%
map(
\(i)
i %>%
attr(., "label")
) %>%
unlist
)
then we look at the specific items we’re interested in
c("Immigration -- ",
"Abortion --",
"Gun control --") %>%
map(
\(i)
cb1 %>%
filter(
labs %>%
str_detect(i)
)
)
## [[1]]
## # A tibble: 4 × 2
## nms labs
## <chr> <chr>
## 1 CC22_331a Immigration -- Grant legal status to all illegal immigrants who hav…
## 2 CC22_331b Immigration -- Increase the number of border patrols on the US-Mexi…
## 3 CC22_331c Immigration -- Reduce legal immigration by 50 percent over the next…
## 4 CC22_331d Immigration -- Increase spending on border security by $25 billion,…
##
## [[2]]
## # A tibble: 6 × 2
## nms labs
## <chr> <chr>
## 1 CC22_332a Abortion -- Always allow a woman to obtain an abortion as a matter …
## 2 CC22_332b Abortion -- Permit abortion only in case of rape, incest or when th…
## 3 CC22_332c Abortion -- Prohibit all abortions after the 20th week of pregnancy
## 4 CC22_332d Abortion -- Allow employers to decline coverage of abortions in ins…
## 5 CC22_332e Abortion -- Prohibit the expenditure of funds authorized or appropr…
## 6 CC22_332f Abortion -- Make abortions illegal in all circumstances
##
## [[3]]
## # A tibble: 6 × 2
## nms labs
## <chr> <chr>
## 1 CC22_330a Gun control -- Prohibit state and local governments from publishing…
## 2 CC22_330b Gun control -- Ban assault rifles
## 3 CC22_330c Gun control -- Make it easier for people to obtain concealed-carry …
## 4 CC22_330d Gun control -- Provide federal funding to encourage states to take …
## 5 CC22_330e Gun control -- Improve background checks to give authorities time t…
## 6 CC22_330f Gun control -- Allow teachers and school officials to carry guns in…
We’ll pipe this map
of variable names into a
map
using select
c("Immigration -- ",
"Abortion --",
"Gun control --") %>%
map(
\(i)
cb1 %>%
filter(
labs %>%
str_detect(i)
) %>%
use_series(nms)
) %>%
map(
\(i)
t1 %>%
select(
any_of(i)
)
)
## [[1]]
## # A tibble: 60,000 × 4
## CC22_331a CC22_331b CC22_331c CC22_331d
## <fct> <fct> <fct> <fct>
## 1 Support Oppose Oppose Oppose
## 2 Support Support Oppose Support
## 3 Support Support Oppose Oppose
## 4 Support Oppose Oppose Oppose
## 5 Support Oppose Support Oppose
## 6 Oppose Support Support Support
## 7 Support Support Oppose Oppose
## 8 Support Support Oppose Oppose
## 9 Support Oppose Support Oppose
## 10 Support Oppose Oppose Oppose
## # ℹ 59,990 more rows
##
## [[2]]
## # A tibble: 60,000 × 6
## CC22_332a CC22_332b CC22_332c CC22_332d CC22_332e CC22_332f
## <fct> <fct> <fct> <fct> <fct> <fct>
## 1 Support Oppose Oppose Oppose Oppose Oppose
## 2 Support Support Oppose Oppose Support Oppose
## 3 Support Oppose Oppose Oppose Oppose Oppose
## 4 Support Oppose Oppose Oppose Oppose Oppose
## 5 Support Oppose Support Oppose Oppose Oppose
## 6 Oppose Support Support Support Support Oppose
## 7 Support Support Oppose Oppose Oppose Oppose
## 8 Support Oppose Oppose Oppose Oppose Oppose
## 9 Oppose Support Oppose Oppose Support Oppose
## 10 Support Support Oppose Oppose Support Oppose
## # ℹ 59,990 more rows
##
## [[3]]
## # A tibble: 60,000 × 6
## CC22_330a CC22_330b CC22_330c CC22_330d CC22_330e CC22_330f
## <fct> <fct> <fct> <fct> <fct> <fct>
## 1 Oppose Support Oppose Support Support Oppose
## 2 Support Oppose Support Oppose Support Oppose
## 3 Oppose Support Oppose Support Support Oppose
## 4 Oppose Support Oppose Support Support Oppose
## 5 Oppose Oppose Support Oppose Support Support
## 6 Support Oppose Support Oppose Oppose Support
## 7 Oppose Oppose Oppose Oppose Support Support
## 8 Oppose Support Oppose Support Support Oppose
## 9 Oppose Support Oppose Support Support Support
## 10 Oppose Support Oppose Support Support Oppose
## # ℹ 59,990 more rows
map
taskRegress the 100pt tax cut spending tradeoff (CC22_421r
)
indicator on partisanship, ideology, and religious importance. For each
of the three linear models, separately control for race and
education.
map2
This function appears to be a special case, but it’s super useful – when you have
(there’s a special case where one of the lists can be of length 1,
and it will be recycled to the length of the paired list, but that’s
actually the same as a map
)
Take a moment to think about this.
In these two lists,
In this diagram
i indexes the separate list elements of the first list
j indexes the separate list elements of the second list
[1], [2],… provides the location of each element
[n] details both lists’ common length
Some toy examples
map2(
.x = state.name,
.y = state.abb,
\(i, j)
str_c(i, "_", j)
)
## [[1]]
## [1] "Alabama_AL"
##
## [[2]]
## [1] "Alaska_AK"
##
## [[3]]
## [1] "Arizona_AZ"
##
## [[4]]
## [1] "Arkansas_AR"
##
## [[5]]
## [1] "California_CA"
##
## [[6]]
## [1] "Colorado_CO"
##
## [[7]]
## [1] "Connecticut_CT"
##
## [[8]]
## [1] "Delaware_DE"
##
## [[9]]
## [1] "Florida_FL"
##
## [[10]]
## [1] "Georgia_GA"
##
## [[11]]
## [1] "Hawaii_HI"
##
## [[12]]
## [1] "Idaho_ID"
##
## [[13]]
## [1] "Illinois_IL"
##
## [[14]]
## [1] "Indiana_IN"
##
## [[15]]
## [1] "Iowa_IA"
##
## [[16]]
## [1] "Kansas_KS"
##
## [[17]]
## [1] "Kentucky_KY"
##
## [[18]]
## [1] "Louisiana_LA"
##
## [[19]]
## [1] "Maine_ME"
##
## [[20]]
## [1] "Maryland_MD"
##
## [[21]]
## [1] "Massachusetts_MA"
##
## [[22]]
## [1] "Michigan_MI"
##
## [[23]]
## [1] "Minnesota_MN"
##
## [[24]]
## [1] "Mississippi_MS"
##
## [[25]]
## [1] "Missouri_MO"
##
## [[26]]
## [1] "Montana_MT"
##
## [[27]]
## [1] "Nebraska_NE"
##
## [[28]]
## [1] "Nevada_NV"
##
## [[29]]
## [1] "New Hampshire_NH"
##
## [[30]]
## [1] "New Jersey_NJ"
##
## [[31]]
## [1] "New Mexico_NM"
##
## [[32]]
## [1] "New York_NY"
##
## [[33]]
## [1] "North Carolina_NC"
##
## [[34]]
## [1] "North Dakota_ND"
##
## [[35]]
## [1] "Ohio_OH"
##
## [[36]]
## [1] "Oklahoma_OK"
##
## [[37]]
## [1] "Oregon_OR"
##
## [[38]]
## [1] "Pennsylvania_PA"
##
## [[39]]
## [1] "Rhode Island_RI"
##
## [[40]]
## [1] "South Carolina_SC"
##
## [[41]]
## [1] "South Dakota_SD"
##
## [[42]]
## [1] "Tennessee_TN"
##
## [[43]]
## [1] "Texas_TX"
##
## [[44]]
## [1] "Utah_UT"
##
## [[45]]
## [1] "Vermont_VT"
##
## [[46]]
## [1] "Virginia_VA"
##
## [[47]]
## [1] "Washington_WA"
##
## [[48]]
## [1] "West Virginia_WV"
##
## [[49]]
## [1] "Wisconsin_WI"
##
## [[50]]
## [1] "Wyoming_WY"
map2(
.x = seq(2, 10, by = 2),
.y = seq(3, 15, by = 3),
\(a, b)
a * b
)
## [[1]]
## [1] 6
##
## [[2]]
## [1] 24
##
## [[3]]
## [1] 54
##
## [[4]]
## [1] 96
##
## [[5]]
## [1] 150
What’s a more typical survey approach? When you’re doing exploratory work, and you want to test relationships between pairs of variables
c("CC22_420_1",
"CC22_420_2",
"CC22_420_3",
"CC22_420_4") %>%
map2(
c("CC22_433a",
"newsint",
"ideo5",
"urbancity"),
\(i, j)
str_c(
"commonweight ~ ",
i,
" + ",
j
) %>%
xtabs(
data = t1
) %>%
prop.table(2) %>%
round(2)
)
## [[1]]
## CC22_433a
## CC22_420_1 Democrat Republican Independent Other skipped not asked
## selected 0.17 0.34 0.21 0.15
## not selected 0.83 0.66 0.79 0.85
## skipped 0.00 0.00 0.00 0.00
## not asked 0.00 0.00 0.00 0.00
##
## [[2]]
## newsint
## CC22_420_2 Most of the time Some of the time Only now and then
## selected 0.61 0.51 0.42
## not selected 0.39 0.49 0.58
## skipped 0.00 0.00 0.00
## not asked 0.00 0.00 0.00
## newsint
## CC22_420_2 Hardly at all Don't know skipped not asked
## selected 0.38 0.18
## not selected 0.62 0.82
## skipped 0.00 0.00
## not asked 0.00 0.00
##
## [[3]]
## ideo5
## CC22_420_3 Very liberal Liberal Moderate Conservative Very conservative
## selected 0.52 0.49 0.36 0.28 0.27
## not selected 0.48 0.51 0.64 0.72 0.73
## skipped 0.00 0.00 0.00 0.00 0.00
## not asked 0.00 0.00 0.00 0.00 0.00
## ideo5
## CC22_420_3 Not sure skipped not asked
## selected 0.20
## not selected 0.80
## skipped 0.00
## not asked 0.00
##
## [[4]]
## urbancity
## CC22_420_4 City Suburb Town Rural area Other skipped not asked
## selected 0.18 0.16 0.14 0.15 0.21
## not selected 0.82 0.84 0.86 0.85 0.79
## skipped 0.00 0.00 0.00 0.00 0.00
## not asked 0.00 0.00 0.00 0.00 0.00
map2
taskTake the ‘use US troops survey items’
CC22_420_1:CC22_420_7
, and ideology, partisanship, and
income, and for every combination of item and predictor, report the
categorical correspondence.
pmap
This function also appears to be a special case, but is just exquisitely useful for those of use who do modelling in R and like to use list tibbles.
pmap
stands for parrallel map. It can be conceived on it
most fundamental level as an abstraction of map2 to map over any number
of lists, which themselves are nested in a list
list(
list("a", "pets", "sunny"),
list("quick", "left", "days"),
list("story", "home", "melt"),
list("time", "alone", "snow")
) %>%
pmap(
\(i, j, k, m)
str_c(
i, j, k, m,
sep = " "
)
)
## [[1]]
## [1] "a quick story time"
##
## [[2]]
## [1] "pets left home alone"
##
## [[3]]
## [1] "sunny days melt snow"
see what’s going on? Maybe do a quick couple of your own so you completely grok the process.
What’s the typical statistical application of a parallel map? We exploit the fact that our data frames are themselves parallel lists!
v(1) though v(k) are variable names
1 thought n indexes each variable’s length
Here’s an example using the EPA’s expanded mileage data
bmc <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-15/big_epa_cars.csv" %>%
read_csv %>%
group_by(
make, year
) %>%
nest
bmc
## # A tibble: 1,770 × 3
## # Groups: make, year [1,770]
## make year data
## <chr> <dbl> <list>
## 1 Alfa Romeo 1985 <tibble [2 × 81]>
## 2 Ferrari 1985 <tibble [3 × 81]>
## 3 Dodge 1985 <tibble [190 × 81]>
## 4 Subaru 1993 <tibble [33 × 81]>
## 5 Toyota 1993 <tibble [59 × 81]>
## 6 Volkswagen 1993 <tibble [26 × 81]>
## 7 Volvo 1993 <tibble [12 × 81]>
## 8 Audi 1993 <tibble [10 × 81]>
## 9 BMW 1993 <tibble [16 × 81]>
## 10 Buick 1993 <tibble [18 × 81]>
## # ℹ 1,760 more rows
we’re going to pmap
over each element
library(broom)
t2 <- bmc %>%
pmap(
possibly(
\(make, year, data)
lm(
highway08 ~ displ,
data = data
) %>%
tidy %>%
mutate(
make, year
),
NULL
)
) %>%
list_rbind %>%
na.omit %>%
filter(
term == "displ"
)
Oh look, with a nice long tibble, we can make a nice figure easily.
t2 %>%
mutate(
make = make %>%
fct_reorder(estimate)
) %>%
ggplot(
aes(
estimate, make
)
) +
geom_point() +
geom_vline(
xintercept = 0,
linetype = "dashed"
)