dplyr’s 5 verbs (and 1 adverb)

The principle advantage to the dplyr approach is cognitive–that a huge, potentially limitless set of challenges and statistical tasks…

how do I do report means and standard errors by groups?
how do I include a new variable with population mean evaluation with each separate respondent evaluation?
how do I compare respondents’ answer to this variable, but plot the level of support compared to the typical member of the respondent’s partisanship?

….are actually special cases of a more general problem.

Let me emphasize this: the reason you make an effort with dplyr is that you will eventually come to learn that each seemingly unique analytics challenge, which begins is actually, a special case of a more general statistical process.

You used to be verklempt! “gosh, how to proceed with this new challenge, given all the rich complexity inherent herein?!”

Armed with this approach, you instead see patterns and similarities. Becoming adept at this pattern recognition is the goal of the tidyverse approach.

Just to ruin the suspense, our verbs are

filter (to return a subset of rows)
select (to return a subset of columns)
arrange (to sort a table)
mutate (to add a new column)
summarize (to generate a new table, usually with statistical result)

and our one adverb

group_by (to repeat some analysis separately by each group)

Ok let’s get started. filtering to begin:

`dplyr::filter`

We filter to select cases. We have a data.frame with n rows, and we’re actually only interested in k rows which meet some other condition.

A rich and illustrative graphical example

We need some data—this is the CMAG¹ table with all the television ads run in 2020 presidential election.

library(tidyverse)
library(magrittr)

d1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/cmag_2020_codelab.rds" %>% 
  url %>% 
  gzcon %>% 
  readRDS

We can do the all the following

d1 %>% 
  filter(
    sponsor %>% 
      equals("TRUMP, DONALD")
  )

d1 %>% 
  filter(
    sponsor %>% 
      is_in(
        c("BIDEN, JOE",
          "AMERICA FIRST ACTION")
        )
    )

d1 %>% 
  filter(
    est_spending %>% 
      is_greater_than(10000) &
    air_date %>% 
      is_weakly_greater_than(
        "2020-10-01" %>% 
          as.Date
        )
    )

Stringing together logical conditions is a hugely powerful technique which you’ll use often. Imagine, for instance, you want to check the spots run by President Trump in October or November, or the ads run by then-Vice President Biden in June or July. You might use a nested sets of intersection operators &, or union operators |, and () to communicate the operators’ logical precedence, as follows:

d1 %>% 
  filter(
    (
    sponsor %>% 
      equals("TRUMP, DONALD") &
    month %>% 
      is_in(
        c("Oct", 
          "Nov")
        )
    ) | (
    sponsor %>% 
      equals("BIDEN, JOE") &
    month %>% 
      is_in(
        c("Jun",
          "Jul")
        )
    )
  )

Take a moment to work this chunk by removing the nesting or the logical operators to really understand what’s going on here.

`dplyr::select`

If filter uses logical operations on rows, select is the command to do logical operations on columns (or variables).

We get to select columns any number of ways. Really, the flexibility here is almost a pedagogical drawback.

d1 %>% 
  select(1:3)

d1 %>% 
  select(
    ad_type, market, time
  )

d1 %>% 
  select(
    air_date:market
  )

We can use a - to make an selection exclusion

d1 %>% 
  select(-c(1:3))

d1 %>% 
  select(
    -ad_type
  )

d1 %>% 
  select(
    -c(air_date:market)
  )

We get a bunch of utility helpers

d1 %>% 
  select(starts_with("a"))

d1 %>% 
  select(ends_with("e"))

d1 %>% 
  select(contains("_"))

d1 %>% 
  select(
    where(is.character), where(is.ordered)
  )

`dplyr::arrange`

arrange sorts tables by variables.

This is useful when you need to do something like

Get the row which is the highest or lowest for some other value
Building labels

…actually come to think of it, arranging tables is pretty specific and mid. But it’s a public university, and the taxpayers

Anyway, here’s how it works

d1 %>% 
  arrange(
    desc(est_spending)
    )
  
d1 %>% 
  arrange(
    air_date
  )

`dplyr::mutate`

mutate takes an existing table and adds a new column, usually from a computation. It becomes super super super useful, when used in conjunction with group_by

Let’s do an example–an indicator grouping weekday or weekend ads

d1 %>% 
  mutate(
    day_type = wday %>% 
      str_detect("Sat|Sun") %>% 
      ifelse(
        "weekend", "weekday"
      )
  )

or grouping together dayparts

d1 %>% 
  mutate(
    part_type = daypart %>% 
      case_match(
        "EARLY MORNING" ~ "morning",
        c("DAYTIME", 
          "EARLY FRINGE",
          "LATE FRINGE") ~ "daytime",
        c("EARLY NEWS",
          "PRIME ACCESS",
          "PRIME TIME",
          "LATE NEWS") ~ "evening"
      )
  )

or figuring out how many days before the election an ad was broadcast

library(lubridate)

d1 %>% 
  mutate(
    days_before = "2020-11-03" %>% 
      as.Date %>% 
      subtract(
        air_date
      ) %>% 
      as.numeric
  )

`dplyr::summarize`

summarize takes an existing table and generates a new table of results. You will normally want to assign this resuling table to a new variable

How much was spent on TV advertising in the 2020 presidential election?

d1 %>% 
  summarize(
    tote_spend = est_spending %>% sum
  )

Or how many different advertisers where there?

d1 %>% 
  summarize(
    uniq_spons = sponsor %>% unique %>% length
  )

Now, perhaps it’s already occurred to you–

“gosh, summarize is not too useful, a single stat returned from the table?

and I would agree! We’re statisticians, we like counting things, and most of all, we like counting things by group. Which brings us to…

`dplyr::group_by`

This adverb really makes the dplyr approach powerful and compact. group_by repeats a statistical task for some arbitrary number of groups (defined by a variable) and then either returns a single value for each group (when used with summarize) or the underlying table (when used with mutate.)

Let’s do some examples. How many ads were run by each political team, and how much was spent?

d1 %>% 
  group_by(affiliation) %>% 
  summarize(
    tote_spend = est_spending %>% 
      sum, 
    ads = n()
    )

## # A tibble: 3 × 3
##   affiliation tote_spend    ads
##   <chr>            <dbl>  <int>
## 1 DEMOCRAT     558857940 751596
## 2 OTHER            87570    243
## 3 REPUBLICAN   409299330 478918

Did the political parties run ads during the same dayparts?

d1 %>% 
  group_by(affiliation, daypart) %>% 
  summarize(
    tote_spend = est_spending %>% 
      sum, 
    ads = n()
    ) %>% 
  filter(
    affiliation %>% 
      equals("OTHER") %>% 
      not
    )

## # A tibble: 16 × 4
## # Groups:   affiliation [2]
##    affiliation daypart       tote_spend    ads
##    <chr>       <chr>              <dbl>  <int>
##  1 DEMOCRAT    DAYTIME         90442270 197746
##  2 DEMOCRAT    EARLY FRINGE    74519960 104256
##  3 DEMOCRAT    EARLY MORNING   98470600 173379
##  4 DEMOCRAT    EARLY NEWS      45078880  47891
##  5 DEMOCRAT    LATE FRINGE     52588070  87493
##  6 DEMOCRAT    LATE NEWS       40284700  42283
##  7 DEMOCRAT    PRIME ACCESS    55331660  57216
##  8 DEMOCRAT    PRIME TIME     102141800  41332
##  9 REPUBLICAN  DAYTIME         51512260  95777
## 10 REPUBLICAN  EARLY FRINGE    55410960  64729
## 11 REPUBLICAN  EARLY MORNING   66368550 117236
## 12 REPUBLICAN  EARLY NEWS      30599170  32693
## 13 REPUBLICAN  LATE FRINGE     34513880  51875
## 14 REPUBLICAN  LATE NEWS       34978000  32920
## 15 REPUBLICAN  PRIME ACCESS    41314930  42547
## 16 REPUBLICAN  PRIME TIME      94601580  41141

Notice how we passed two variables to group_by, but there’s only 1 group specified when we printed the tibble. The last group specified–d1$daypart has been peeled off by the summarize call, so that we can do additional analyses on these summary tables.

Pause here.

This is a very specific design choice that causes a ton of confusion online. In making this design, Hadley understood a very specific analytic loop:

Compute a statistic (in the above, just a count/frequency of ads and spending by daypart and political party)
Then, compute a new summary within only one of these groups–ie, within parties, was the distribution of ads between parties the same?

So, that the parties run different numbers of ads is not so interesting–instead, given that there were different number of ads, what was the proportion of GOP and Democratic ads run on a specific daypart?

d1 %>% 
  group_by(affiliation, daypart) %>% 
  summarize(
    tote_spend = est_spending %>% 
      sum, 
    ads = n()
    ) %>% 
  filter(
    affiliation %>% 
      equals("OTHER") %>% 
      not
    ) %>% 
  mutate(
    perc_ads = ads %>% 
      divide_by(
        ads %>% sum
      ) %>% 
      multiply_by(100),
    perc_spend = tote_spend %>% 
      divide_by(
        tote_spend %>% sum
      ) %>% 
      multiply_by(100)
  )

## # A tibble: 16 × 6
## # Groups:   affiliation [2]
##    affiliation daypart       tote_spend    ads perc_ads perc_spend
##    <chr>       <chr>              <dbl>  <int>    <dbl>      <dbl>
##  1 DEMOCRAT    DAYTIME         90442270 197746    26.3       16.2 
##  2 DEMOCRAT    EARLY FRINGE    74519960 104256    13.9       13.3 
##  3 DEMOCRAT    EARLY MORNING   98470600 173379    23.1       17.6 
##  4 DEMOCRAT    EARLY NEWS      45078880  47891     6.37       8.07
##  5 DEMOCRAT    LATE FRINGE     52588070  87493    11.6        9.41
##  6 DEMOCRAT    LATE NEWS       40284700  42283     5.63       7.21
##  7 DEMOCRAT    PRIME ACCESS    55331660  57216     7.61       9.90
##  8 DEMOCRAT    PRIME TIME     102141800  41332     5.50      18.3 
##  9 REPUBLICAN  DAYTIME         51512260  95777    20.0       12.6 
## 10 REPUBLICAN  EARLY FRINGE    55410960  64729    13.5       13.5 
## 11 REPUBLICAN  EARLY MORNING   66368550 117236    24.5       16.2 
## 12 REPUBLICAN  EARLY NEWS      30599170  32693     6.83       7.48
## 13 REPUBLICAN  LATE FRINGE     34513880  51875    10.8        8.43
## 14 REPUBLICAN  LATE NEWS       34978000  32920     6.87       8.55
## 15 REPUBLICAN  PRIME ACCESS    41314930  42547     8.88      10.1 
## 16 REPUBLICAN  PRIME TIME      94601580  41141     8.59      23.1

For these reasons, pay a ton of attention to the ordering of the variables passed to group_by

Ok–a few more little pieces of useful dplyr stuff:

slice() returns a number of rows
top_n() is like slice, but allows you to define an additional variable on which you’d like to make a specific selection
rename lets your rename variables
tally reports group counts, normally after you
count reports group counts, but lets you define the group inside the function

CMAG has a pretty wide use among academic papers–it can be ordered here http://mediaproject.wesleyan.edu/dataaccess/↩︎

Code Lab-Lab 1 (dplyr’s 5 verbs)

Thomas Wood (wood.1080@osu.edu)

May 10, 2023