Gosh, this a very important topic. The ability to write functions–to concisely express some statistical calculation, and have R do all the tedious record keeping of what goes with what–is a massive multiplier for your powers as bench scientist.
Functions let you follow maybe the most important axiom of programming – that the code you write should not be
W rite
E everything
T wice
instead, it should be
D on’t
R epeat
Y ourself
This quality really separates the pros and the joes. When we get started writing code, and we’re interest in estimating a model with a state level fixed effect for counties’ presidential vote share, we might write:
library(tidyverse)
library(magrittr)
pres <- "https://github.com/thomasjwood/ps7160/raw/master/countyvote/county_pres_1868_2020.rds" %>%
url %>%
readRDS %>%
filter(
election_year >= 1948
) %>%
mutate(
gop_perc = republican_raw_votes %>%
divide_by(
pres_raw_county_vote_totals_two_party
) %>%
multiply_by(100)
)
lm_pres <- lm(
gop_perc ~ state,
data = pres
)
But then, we realize we’re missing the obvious and important year to year variation! We might want to estimate separate models for each year–so we’d write:
lm_48 <- lm(
gop_perc ~ state,
data = pres %>%
filter(
election_year == 1948
)
)
lm_52 <- lm(
gop_perc ~ state,
data = pres %>%
filter(
election_year == 1952
)
)
And on and on and on…. gosh.
what a hassle 🥱.
How do other languages solve for this problem? With a
for
loop:
for (year in seq(1948, 2020, 4)) {
lm(
gop_perc ~ state,
data = pres %>%
filter(
election_year == year
)
) %>%
print
}
yeech, this is also cringe.1
In its place, we’re going to use tidyverse’s functional toolkit
purrr
How lucky we are to have Hadley, who understands our desire for powerful, intuitive software, and also appreciates our generations’ interests in cloying cuteness.
purrr
librarypurrr
, simply replicates the existing functional
programming tools in R, but aims to be more consistent, more performant,
and use a clearer naming convention.
It’s like dplyr
, in that the performance advantages are
not the point. The advantage is the legibility and clarity. If
you pay the opening costs of purrr
approach, eventually,
when you see purrr
syntax, the record keeping pieces of
syntax will just vanish, and you’ll see the author’s intent.
purrr
features many strengths of the tidyverse
choices–most conspicuously, a common map
prefix which lets
you use RStudio autocomplete.
purrr function |
base equivalent |
Purpose |
---|---|---|
map() |
lapply () |
apply a function to each element of a list or vector. |
map2() |
mapply() |
apply a function to the corresponding elements of two lists |
pmap() |
--- | apply a function to the corresponding elements of n lists or vectors. |
map_dbl()/map_lgl()/map_int()/map_chr() |
sapply()/vapply() |
apply a function to each element of a list or vector, and restrict the out put to a double, a logical, an integer, or character vector, respectively. |
Whoa, there’s a lot to cover. Let’s get to typing.
map()
map(
.x = c(1, 2, 3, 4),
.f = function(i)
LETTERS[seq(i)]
)
Just indulge me a moment:
.x
is the list/vector we’re operating on
.f
is the function or thing we’ll do to each element of .x
But because of pipes and such things, we can be more concise
1:4 %>%
map(
\(i)
LETTERS[seq(i)]
)
A new piece of tech:
\(i)
is a nice piece of shorthand–it’s just short for
\function(i)
How would we use map
to replace the for
loop we used up above? we’d do this:
seq(1948, 2020, 4) %>%
map(
\(i)
lm(
gop_perc ~ state,
data = pres %>%
filter(
election_year == i
)
)
)
What’s a common social science application? when we map over a list of models!
str_c(
"hwy ~ ",
c(
"displ",
"displ + cyl",
"displ + cyl + fl"
)
) %>%
map(
\(i)
i %>%
lm(
data = mpg
) %>%
summary
)
or maybe you want to compare the fit of different factor analysis models, as a function of how many factors you are trying to extract?
1:4 %>%
map(
\(i)
factanal(
factors = i,
covmat = mtcars %>%
cor
)
)
When you want to match items on two lists of equal length, we use
map2()
.
letters[1:10] %>%
map2(
1:10,
\(i, j)
str_c(
i, j,
sep = "_"
)
)
We use pmap
when we have more than 2 lists/vectors and
we want to map over them simultaneously. We use them as a list of
lists:
list(
n = c(100, 1000, 10000),
mu = c(-5, 0, 5),
sd = c(10, 30, 60)
) %>%
pmap(
\(n, mu, sd)
rnorm(n, mu, sd) %>%
fivenum
)
We’ll come back to pmap
later.
map
’s type variants: map_int
,
map_dbl
, map_lgl
, map_chr
.These are useful when your function only returns a single value. This will also become apparent
map
’s super power–when we’re working with nested list
columnsUp til now we’ve been operating on simple vectors of numbers or
characters. But data frames themselves can be vectors to be
map
ed on!
let’s load an analysis data frame on the 2020 presidential election, with presidential votes by county
t20 <- "https://github.com/thomasjwood/ps7160/raw/master/election_analysis/election_context_2020.csv" %>%
read_csv %>%
mutate(
gop_dif = trump20 %>%
divide_by(
biden20 %>% add(trump20)
) %>%
subtract(
trump16 %>%
divide_by(
clinton16 %>% add(trump16)
)
)
)
Here’s the kind of thing political scientists often do when armed with electoral returns–checking which area-measured covariates are correlated with an electoral shift.
The piece of tech we need is nest()
t20n <- t20 %>%
group_by(state) %>%
nest %>%
filter(
state != "District of Columbia"
) %>%
mutate(
mods = data %>%
map(
\(i)
lm(
gop_dif ~ median_hh_inc, data = i
)
)
)
A fair bit happened, so let’s make sure we’re current
t20n
has 49 rows and t20
has 3,114.
But no data have been lost (well, we threw away DC,
which isn’t a state and doesn’t have counties.
Confirm this for yourself:
print t20n$data
try t20$data %>% map(head)
try t20$data %>% map_int(nrow)
or
t20$data %>% map_int(ncol)
try
t20$data %>% map(\(i) i$gop_dif %>% hist)
this bears emphasis
t20n$data is every row of the original t20, split by state
t20n$mods is a list of linear model regressing change in GOP presidential vote by county on household income, split by state
Whoa, this is powerful!
Let’s see which state has the largest and smallest coefficient for income
library(broom)
t20n$coef <- t20n$mods %>%
map_dbl(
\(k)
k %>%
tidy %>%
pluck("estimate", 2)
)
t20n %<>%
arrange(
abs(coef) %>%
desc
)
Which states have the most and least income stratified vote differences?
t20n %>%
ungroup %>%
slice(
c(1:5, 45:49)
)
## # A tibble: 10 × 4
## state data mods coef
## <chr> <list> <list> <dbl>
## 1 Arkansas <tibble [75 × 42]> <lm> -0.00000227
## 2 Maine <tibble [16 × 42]> <lm> -0.00000207
## 3 South Carolina <tibble [46 × 42]> <lm> -0.00000201
## 4 Texas <tibble [254 × 42]> <lm> -0.00000198
## 5 Delaware <tibble [3 × 42]> <lm> 0.00000173
## 6 New York <tibble [62 × 42]> <lm> -0.000000101
## 7 Washington <tibble [39 × 42]> <lm> 0.0000000970
## 8 Montana <tibble [56 × 42]> <lm> 0.0000000685
## 9 Oregon <tibble [36 × 42]> <lm> 0.0000000626
## 10 North Dakota <tibble [53 × 42]> <lm> 0.0000000422
R’s taking care of which model corresponds with which state. This becomes even more apparent when we’re nesting multiple models per state (just for illustrative purposes, let’s do predictions by counties racial composition, by educational attainment, or by rural percentage)
t20n %<>%
ungroup %>%
select(
-c(mods, coef)
)
t20n[,
str_c(
"mod_",
c("race",
"educ",
"rural"
)
)
] <- str_c(
"gop_dif ~ ",
c("white_pct",
"lesscollege_pct",
"rural_pct")
) %>%
map(
\(k)
t20n$data %>%
map(
\(l)
lm(k, data = l)
)
)
t20n
## # A tibble: 49 × 5
## state data mod_race mod_educ mod_rural
## <chr> <list> <list> <list> <list>
## 1 Arkansas <tibble [75 × 42]> <lm> <lm> <lm>
## 2 Maine <tibble [16 × 42]> <lm> <lm> <lm>
## 3 South Carolina <tibble [46 × 42]> <lm> <lm> <lm>
## 4 Texas <tibble [254 × 42]> <lm> <lm> <lm>
## 5 Delaware <tibble [3 × 42]> <lm> <lm> <lm>
## 6 Oklahoma <tibble [77 × 42]> <lm> <lm> <lm>
## 7 Rhode Island <tibble [5 × 42]> <lm> <lm> <lm>
## 8 North Carolina <tibble [100 × 42]> <lm> <lm> <lm>
## 9 Indiana <tibble [92 × 42]> <lm> <lm> <lm>
## 10 Georgia <tibble [159 × 42]> <lm> <lm> <lm>
## # ℹ 39 more rows
Mmmm I sense a pivot_longer
in the offing! (By the way,
we’re keeping track of 49 * 3 models here! 😰)
p1 <- t20n %>%
ungroup %>%
pivot_longer(
names_to = "mod_type",
values_to = "mods",
cols = starts_with("mod_")
) %>%
mutate(
coef = mods %>%
map_dbl(
\(i)
i %>%
tidy %>%
pluck("estimate", 2)
),
state = state %>%
fct_reorder(coef)
) %>%
arrange(
mod_type, state
) %>%
ggplot(
aes(
coef, state, color = mod_type, group = mod_type
)
) +
geom_vline(
xintercept = 0,
linetype = "dashed",
linewidth = .3
) +
geom_path() +
geom_point() +
facet_grid(
. ~ mod_type, space = "free_x", scales = "free_x"
) +
theme(
legend.position = "none"
)
Which should return
Whoa, education polarization is a big thing! And just so we get what we just did–we’re reporting 147 separate regression models! in ~20 lines of code! And we got to use our old friends from the tidyverse!
Now, the true R elect denigrate for
loops
constantly, as slow and inefficient, but that’s actually not an informed
attitude. Many base R
functions merely wrap different C++
loops. But they’re uncool and clunky, nq.↩︎