dplyr’s 5 verbs (and 1 adverb)

The principle advantage to the dplyr approach is cognitive–that a huge, potentially limitless set of challenges and statistical tasks…

….are actually special cases of a more general problem.

Let me emphasize this: the reason you make an effort with dplyr is that you will eventually come to learn that each seemingly unique analytics challenge, which begins is actually, a special case of a more general statistical process.

You used to be verklempt! “gosh, how to proceed with this new challenge, given all the rich complexity inherent herein?!”

Armed with this approach, you instead see patterns and similarities. Becoming adept at this pattern recognition is the goal of the tidyverse approach.

Just to ruin the suspense, our verbs are

  1. filter (to return a subset of rows)
  2. select (to return a subset of columns)
  3. arrange (to sort a table)
  4. mutate (to add a new column)
  5. summarize (to generate a new table, usually with statistical result)

and our one adverb

  1. group_by (to repeat some analysis separately by each group)

Ok let’s get started. filtering to begin:

dplyr::filter

We filter to select cases. We have a data.frame with n rows, and we’re actually only interested in k rows which meet some other condition.

A rich and illustrative graphical example

We need some data—this is the CMAG1 table with all the television ads run in 2020 presidential election.

library(tidyverse)
library(magrittr)

d1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/cmag_2020_codelab.rds" %>% 
  url %>% 
  gzcon %>% 
  readRDS

We can do the all the following

d1 %>% 
  filter(
    sponsor %>% 
      equals("TRUMP, DONALD")
  )

d1 %>% 
  filter(
    sponsor %>% 
      is_in(
        c("BIDEN, JOE",
          "AMERICA FIRST ACTION")
        )
    )

d1 %>% 
  filter(
    est_spending %>% 
      is_greater_than(10000) &
    air_date %>% 
      is_weakly_greater_than(
        "2020-10-01" %>% 
          as.Date
        )
    )

Stringing together logical conditions is a hugely powerful technique which you’ll use often. Imagine, for instance, you want to check the spots run by President Trump in October or November, or the ads run by then-Vice President Biden in June or July. You might use a nested sets of intersection operators &, or union operators |, and () to communicate the operators’ logical precedence, as follows:

d1 %>% 
  filter(
    (
    sponsor %>% 
      equals("TRUMP, DONALD") &
    month %>% 
      is_in(
        c("Oct", 
          "Nov")
        )
    ) | (
    sponsor %>% 
      equals("BIDEN, JOE") &
    month %>% 
      is_in(
        c("Jun",
          "Jul")
        )
    )
  )

Take a moment to work this chunk by removing the nesting or the logical operators to really understand what’s going on here.

dplyr::select

If filter uses logical operations on rows, select is the command to do logical operations on columns (or variables).

We get to select columns any number of ways. Really, the flexibility here is almost a pedagogical drawback.

d1 %>% 
  select(1:3)

d1 %>% 
  select(
    ad_type, market, time
  )

d1 %>% 
  select(
    air_date:market
  )

We can use a - to make an selection exclusion

d1 %>% 
  select(-c(1:3))

d1 %>% 
  select(
    -ad_type
  )

d1 %>% 
  select(
    -c(air_date:market)
  )

We get a bunch of utility helpers

d1 %>% 
  select(starts_with("a"))

d1 %>% 
  select(ends_with("e"))

d1 %>% 
  select(contains("_"))

d1 %>% 
  select(
    where(is.character), where(is.ordered)
  )

dplyr::arrange

arrange sorts tables by variables.

This is useful when you need to do something like

  • Get the row which is the highest or lowest for some other value
  • Building labels

…actually come to think of it, arranging tables is pretty specific and mid. But it’s a public university, and the taxpayers

Anyway, here’s how it works

d1 %>% 
  arrange(
    desc(est_spending)
    )
  
d1 %>% 
  arrange(
    air_date
  )

dplyr::mutate

mutate takes an existing table and adds a new column, usually from a computation. It becomes super super super useful, when used in conjunction with group_by

Let’s do an example–an indicator grouping weekday or weekend ads

d1 %>% 
  mutate(
    day_type = wday %>% 
      str_detect("Sat|Sun") %>% 
      ifelse(
        "weekend", "weekday"
      )
  )

or grouping together dayparts

d1 %>% 
  mutate(
    part_type = daypart %>% 
      case_match(
        "EARLY MORNING" ~ "morning",
        c("DAYTIME", 
          "EARLY FRINGE",
          "LATE FRINGE") ~ "daytime",
        c("EARLY NEWS",
          "PRIME ACCESS",
          "PRIME TIME",
          "LATE NEWS") ~ "evening"
      )
  )

or figuring out how many days before the election an ad was broadcast

library(lubridate)

d1 %>% 
  mutate(
    days_before = "2020-11-03" %>% 
      as.Date %>% 
      subtract(
        air_date
      ) %>% 
      as.numeric
  )

dplyr::summarize

summarize takes an existing table and generates a new table of results. You will normally want to assign this resuling table to a new variable

How much was spent on TV advertising in the 2020 presidential election?

d1 %>% 
  summarize(
    tote_spend = est_spending %>% sum
  )

Or how many different advertisers where there?

d1 %>% 
  summarize(
    uniq_spons = sponsor %>% unique %>% length
  )

Now, perhaps it’s already occurred to you–

“gosh, summarize is not too useful, a single stat returned from the table?

and I would agree! We’re statisticians, we like counting things, and most of all, we like counting things by group. Which brings us to…

dplyr::group_by

This adverb really makes the dplyr approach powerful and compact. group_by repeats a statistical task for some arbitrary number of groups (defined by a variable) and then either returns a single value for each group (when used with summarize) or the underlying table (when used with mutate.)

Let’s do some examples. How many ads were run by each political team, and how much was spent?

d1 %>% 
  group_by(affiliation) %>% 
  summarize(
    tote_spend = est_spending %>% 
      sum, 
    ads = n()
    )
## # A tibble: 3 × 3
##   affiliation tote_spend    ads
##   <chr>            <dbl>  <int>
## 1 DEMOCRAT     558857940 751596
## 2 OTHER            87570    243
## 3 REPUBLICAN   409299330 478918

Did the political parties run ads during the same dayparts?

d1 %>% 
  group_by(affiliation, daypart) %>% 
  summarize(
    tote_spend = est_spending %>% 
      sum, 
    ads = n()
    ) %>% 
  filter(
    affiliation %>% 
      equals("OTHER") %>% 
      not
    )
## # A tibble: 16 × 4
## # Groups:   affiliation [2]
##    affiliation daypart       tote_spend    ads
##    <chr>       <chr>              <dbl>  <int>
##  1 DEMOCRAT    DAYTIME         90442270 197746
##  2 DEMOCRAT    EARLY FRINGE    74519960 104256
##  3 DEMOCRAT    EARLY MORNING   98470600 173379
##  4 DEMOCRAT    EARLY NEWS      45078880  47891
##  5 DEMOCRAT    LATE FRINGE     52588070  87493
##  6 DEMOCRAT    LATE NEWS       40284700  42283
##  7 DEMOCRAT    PRIME ACCESS    55331660  57216
##  8 DEMOCRAT    PRIME TIME     102141800  41332
##  9 REPUBLICAN  DAYTIME         51512260  95777
## 10 REPUBLICAN  EARLY FRINGE    55410960  64729
## 11 REPUBLICAN  EARLY MORNING   66368550 117236
## 12 REPUBLICAN  EARLY NEWS      30599170  32693
## 13 REPUBLICAN  LATE FRINGE     34513880  51875
## 14 REPUBLICAN  LATE NEWS       34978000  32920
## 15 REPUBLICAN  PRIME ACCESS    41314930  42547
## 16 REPUBLICAN  PRIME TIME      94601580  41141

Notice how we passed two variables to group_by, but there’s only 1 group specified when we printed the tibble. The last group specified–d1$daypart has been peeled off by the summarize call, so that we can do additional analyses on these summary tables.

Pause here.

This is a very specific design choice that causes a ton of confusion online. In making this design, Hadley understood a very specific analytic loop:

  1. Compute a statistic (in the above, just a count/frequency of ads and spending by daypart and political party)
  2. Then, compute a new summary within only one of these groups–ie, within parties, was the distribution of ads between parties the same?

So, that the parties run different numbers of ads is not so interesting–instead, given that there were different number of ads, what was the proportion of GOP and Democratic ads run on a specific daypart?

d1 %>% 
  group_by(affiliation, daypart) %>% 
  summarize(
    tote_spend = est_spending %>% 
      sum, 
    ads = n()
    ) %>% 
  filter(
    affiliation %>% 
      equals("OTHER") %>% 
      not
    ) %>% 
  mutate(
    perc_ads = ads %>% 
      divide_by(
        ads %>% sum
      ) %>% 
      multiply_by(100),
    perc_spend = tote_spend %>% 
      divide_by(
        tote_spend %>% sum
      ) %>% 
      multiply_by(100)
  )
## # A tibble: 16 × 6
## # Groups:   affiliation [2]
##    affiliation daypart       tote_spend    ads perc_ads perc_spend
##    <chr>       <chr>              <dbl>  <int>    <dbl>      <dbl>
##  1 DEMOCRAT    DAYTIME         90442270 197746    26.3       16.2 
##  2 DEMOCRAT    EARLY FRINGE    74519960 104256    13.9       13.3 
##  3 DEMOCRAT    EARLY MORNING   98470600 173379    23.1       17.6 
##  4 DEMOCRAT    EARLY NEWS      45078880  47891     6.37       8.07
##  5 DEMOCRAT    LATE FRINGE     52588070  87493    11.6        9.41
##  6 DEMOCRAT    LATE NEWS       40284700  42283     5.63       7.21
##  7 DEMOCRAT    PRIME ACCESS    55331660  57216     7.61       9.90
##  8 DEMOCRAT    PRIME TIME     102141800  41332     5.50      18.3 
##  9 REPUBLICAN  DAYTIME         51512260  95777    20.0       12.6 
## 10 REPUBLICAN  EARLY FRINGE    55410960  64729    13.5       13.5 
## 11 REPUBLICAN  EARLY MORNING   66368550 117236    24.5       16.2 
## 12 REPUBLICAN  EARLY NEWS      30599170  32693     6.83       7.48
## 13 REPUBLICAN  LATE FRINGE     34513880  51875    10.8        8.43
## 14 REPUBLICAN  LATE NEWS       34978000  32920     6.87       8.55
## 15 REPUBLICAN  PRIME ACCESS    41314930  42547     8.88      10.1 
## 16 REPUBLICAN  PRIME TIME      94601580  41141     8.59      23.1

For these reasons, pay a ton of attention to the ordering of the variables passed to group_by

Ok–a few more little pieces of useful dplyr stuff:

  • slice() returns a number of rows
  • top_n() is like slice, but allows you to define an additional variable on which you’d like to make a specific selection
  • rename lets your rename variables
  • tally reports group counts, normally after you
  • count reports group counts, but lets you define the group inside the function

a fun youthful exercise

let’s download the Billboard 200 album sales data to do a couple of exercises and try out the dplyr approach.

d2 <- "https://github.com/thomasjwood/code_lab/raw/main/data/billboard_200_68-23.rds"  %>% 
  url %>% 
  gzcon %>% 
  readRDS
  1. No artist better encapsulates Gen-Z’s mid tastes than T-Swift. How many times has Taylor had 5 or more separate albums on a single Billboard 200 chart?

  2. By decade, report the top 5 artists who’ve spent the most weeks with at least 1 top 10 album. Please exclude soundtracks and albums by ‘various artists.’

  3. By decade, report the top 5 artists who’ve had the most separate albums which spent at least 1 week in the top 10. Please again exclude soundtracks and albums by ‘various artists.’

  4. By decade, report the top 5 albums who spent the most weeks in the top 100. Bonus By decade, report the mean number of weeks the start and end of each album’s period in the top 100.


  1. CMAG has a pretty wide use among academic papers–it can be ordered here http://mediaproject.wesleyan.edu/dataaccess/↩︎