R Tutorial 2: Data Transformation

MKT 410: Marketing Analytics

Author

Levin Zhu

Learning Objectives

In this tutorial, we will cover data transformation using the dplyr package, which is one of the most important parts of understanding your data. We will assume, for now, that the data being used has already been imported and tidied. We will cover each of these steps in the next two tutorials.

The key elements of data transformation are:

  • Operating on rows and columns
  • The “pipe” tool to combine “verbs” when transforming data
  • Working with groups within the data

Prerequisites

In this tutorial, we will focus on the dplyr package from tidyverse.

library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

To help us learn the basics of data transformation, we will use the flights dataset within nycflights13 (note, you can call specific functions or datasets from libraries that have not been directly loaded by using two colons ::, e.g. nycflights13::flights). This dataset contains all 336,776 flights that departed from New York City in 2013.

# install.packages("nycflights13") ## install first if needed
library(nycflights13)
flights
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

flights is a “tibble”, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.

Ways to view more of the dataset include:

  • View(flights)

  • print(flights, width = Inf)

  • glimpse(flights)

    glimpse(flights)
    #> Rows: 336,776
    #> Columns: 19
    #> $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
    #> $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
    #> $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
    #> $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…
    #> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…
    #> $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…
    #> $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…
    #> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…
    #> $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…
    #> $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"…
    #> $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…
    #> $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N…
    #> $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG…
    #> $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA…
    #> $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…
    #> $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…
    #> $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…
    #> $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…
    #> $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…

dplyr Basics

There are some basic “rules” for when using dplyr functions (or verbs):

  1. The first argument is always a data frame.
  2. The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).
  3. The output is always a new data frame.

Each verb does one specific transformation to the data frame, so solving complex problems will usually require combining multiple verbs. To do this, we will use a pipe %>%, which takes the thing before it and passes it along as the first argument in the function to the right of it.

  • x %>% f(y) is equivalent to f(x, y)
  • x %>% f(y) %>% g(z) is equivalent to g(f(x, y), z)

In the context of working with data, we might have something that looks like this:

flights %>%
  filter(dest == "IAH") %>%
  group_by(year, month, day) %>%
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )

dplyr verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables. We’ll go through each of these groups in the following sections.

Rows

We’ll cover four row operations:

  • filter(): changes which rows are present without changing their order
  • arrange(): changes the order of the rows without changing which are present
  • distinct(): finds rows with unique values
  • count(): counts the number of rows within groups of rows

filter

filter() allows you to keep rows based on the values of the columns. The arguments are (in order)

  1. The data frame
  2. (and subsequent) The conditions that must be true to keep the row

Let’s find all flights that departed more than 120 minutes late:

flights %>%
  filter(dep_delay > 120)
#> # A tibble: 9,723 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      848           1835       853     1001           1950
#> 2  2013     1     1      957            733       144     1056            853
#> 3  2013     1     1     1114            900       134     1447           1222
#> 4  2013     1     1     1540           1338       122     2020           1825
#> 5  2013     1     1     1815           1325       290     2120           1542
#> 6  2013     1     1     1842           1422       260     1958           1535
#> # ℹ 9,717 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

The conditions you can use include:

  • >: greater than
  • >=: greater than or equal to
  • <: less than
  • <=: less than or equal to
  • ==: equal to
  • !=: not equal to
  • %in%: matching specific values

You can also combine conditions using & or , to indicate “and” (i.e. check both conditions) and | to indicate “or” (i.e. check for either condition):

# Flights that departed on January 1
flights %>%
  filter(month == 1 & day == 1)
#> # A tibble: 842 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 836 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
# Flights that departed in January or February from LGA or JFK
flights %>%
  filter(
    month %in% c(1, 2),
    origin %in% c("LGA", "JFK")
  )
#> # A tibble: 32,955 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      533            529         4      850            830
#> 2  2013     1     1      542            540         2      923            850
#> 3  2013     1     1      544            545        -1     1004           1022
#> 4  2013     1     1      554            600        -6      812            837
#> 5  2013     1     1      557            600        -3      709            723
#> 6  2013     1     1      557            600        -3      838            846
#> # ℹ 32,949 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

Note that the filter() function simply executes the operation and prints out the result, but it does not modify the existing flights dataset. To save the result, we need to assign it to a new object:

jan1 <- flights %>%
  filter(month == 1, day == 1)

Common Mistakes

When specifying equality in the filter() function, make sure to use double equals signs == and not just one =. The single equal sign = is always used to refer to the value of an argument in a function.

flights %>%
  filter(month = 1)
#> Error in `filter()`:
#> ! We detected a named input.
#> ℹ This usually means that you've used `=` instead of `==`.
#> ℹ Did you mean `month == 1`?

Another common mistake is using the “or” | operator as follows:

flights %>%
  filter(month == 1 | 2)
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

While this doesn’t give an error message, what’s actually happening is that the function checks two conditions, the first being month == 1 and the second being 2. The second condition does not mean anything in this context; the result is a filtered dataset with all observations that are in January only.

arrange()

arrange() changes the order of the rows based on the values of the columns. The arguments are (in order)

  1. The data frame
  2. (and subsequent) The columns (or more complicated expressions) to order by

When more than one column name is provided, each additional column will be used to break ties in the values of the preceding columns.

For example, let’s try sorting flights by departure time.

flights %>%
  arrange(year, month, day, dep_time)
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

What if we specified the ordered columns in reverse order? Try it out.

flights %>%
  arrange(dep_time, day, month, year)
## does the output make sense?

You can use desc() on a column inside of arrange() to re-order the data frame based on that column in descending order. The following code orders flights from most to least delayed:

flights %>%
  arrange(desc(dep_delay))
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     9      641            900      1301     1242           1530
#> 2  2013     6    15     1432           1935      1137     1607           2120
#> 3  2013     1    10     1121           1635      1126     1239           1810
#> 4  2013     9    20     1139           1845      1014     1457           2210
#> 5  2013     7    22      845           1600      1005     1044           1815
#> 6  2013     4    10     1100           1900       960     1342           2211
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

distinct()

distinct() finds all the unique rows in a dataset. The arguments are:

  1. A data frame.
  2. (Optional) Column name(s) for which you want the distinct combinations of. If not provided, the function will return all the unique rows in the data frame.
# Remove duplicate rows, if any
flights %>%
  distinct()
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

There are no unique rows in the dataset.

# Find all unique origin and destination pairs
flights %>%
  distinct(origin, dest)
#> # A tibble: 224 × 2
#>   origin dest 
#>   <chr>  <chr>
#> 1 EWR    IAH  
#> 2 LGA    IAH  
#> 3 JFK    MIA  
#> 4 JFK    BQN  
#> 5 LGA    ATL  
#> 6 EWR    ORD  
#> # ℹ 218 more rows

If you want to keep other columns when filtering for unique rows, you can use the .keep_all = TRUE option.

flights %>%
  distinct(origin, dest, .keep_all = TRUE)
#> # A tibble: 224 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 218 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

Note that distinct() will find the first occurrence of a unique row in the dataset and discard the rest. This is why we see all the distinct flights above departed on January 1.

count()

count() finds the number of occurrences within each unique set of rows. For example, if we wanted to find the number of flights that occurred between any two airports, you can write:

flights %>%
  count(origin, dest, sort = TRUE)
#> # A tibble: 224 × 3
#>   origin dest      n
#>   <chr>  <chr> <int>
#> 1 JFK    LAX   11262
#> 2 LGA    ATL   10263
#> 3 LGA    ORD    8857
#> 4 JFK    SFO    8204
#> 5 LGA    CLT    6168
#> 6 EWR    ORD    6100
#> # ℹ 218 more rows

The sort = TRUE argument arranges the number of occurrences in descending order.

Exercises

  1. In a single pipeline for each condition, find all flights that meet the conditions:

    • Had an arrival delay of two or more hours
    • Flew to Houston (IAH or HOU)
    • Were operated by United, American, or Delta
    • Departed in summer (July, August, and September)
    • Arrived more than two hours late but didn’t leave late
    • Were delayed by at least an hour, but made up over 30 minutes in flight
  2. Sort flights to find the flights with the longest departure delays. Find the flights that left earliest in the morning.

  3. Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

  4. Was there a flight on every day of 2013?

  5. Which flights traveled the farthest distance? Which traveled the least distance?

  6. Does it matter what order you used filter() and `arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

Columns

There are four important column operations that we will cover:

  • mutate(): creates new columns that are derived from the existing columns
  • select(): changes which columns are present
  • rename(): changes the names of the columns
  • relocate(): changes the positions of the columns

mutate()

mutate() adds new columns that are calculated from the existing columns.

Let us compute the gain, or how much time a delayed flight made up in the air, and the speed in miles per hour:

flights %>%
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )
#> # A tibble: 336,776 × 21
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

By default, mutate() adds new columns on the right-hand side of your dataset, making it difficult to see what’s happening here. The .before argument can be used to add the variables to the left-hand side of the data frame.

flights %>%
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1 # before the first column
  )
#> # A tibble: 336,776 × 21
#>    gain speed  year month   day dep_time sched_dep_time dep_delay arr_time
#>   <dbl> <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1    -9  370.  2013     1     1      517            515         2      830
#> 2   -16  374.  2013     1     1      533            529         4      850
#> 3   -31  408.  2013     1     1      542            540         2      923
#> 4    17  517.  2013     1     1      544            545        -1     1004
#> 5    19  394.  2013     1     1      554            600        -6      812
#> 6   -16  288.  2013     1     1      554            558        -4      740
#> # ℹ 336,770 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …

We can alternatively specify .after a specific column, e.g.:

flights %>%
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day
  )

Another useful argument in mutate() is the .keep argument, which specifies which columns to keep among four different options (“all” which is the default; “used” which only retains those used to create the new columns; “unused” which retains only columns that were not used; and “none” which only keeps the newly created column and any grouping variables.)

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
    .keep = "used"
  )
#> # A tibble: 336,776 × 6
#>   dep_delay arr_delay air_time  gain hours gain_per_hour
#>       <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
#> 1         2        11      227    -9  3.78         -2.38
#> 2         4        20      227   -16  3.78         -4.23
#> 3         2        33      160   -31  2.67        -11.6 
#> 4        -1       -18      183    17  3.05          5.57
#> 5        -6       -25      116    19  1.93          9.83
#> 6        -4        12      150   -16  2.5          -6.4 
#> # ℹ 336,770 more rows

select()

select() allows you to focus only on specific variables that you’re interested in. There are a few ways to specify the subset of columns.

  • Select columns by name:

    flights %>%
      select(year, month, day)
  • Select all columns between year and day (inclusive):

    flights %>%
      select(year:day)
  • Select all columns except those from year to day (inclusive):

    flights %>%
      select(!year:day)
  • Select all columns that are characters:

    flights %>%
      select(where(is.character))

Other helper functions that you can use within select() are:

  • start_with("abc"): matches names that begin with “abc”
  • ends_with("xyz"): matches names that end with “xyz”
  • contains("ijk"): matches names that contain “ijk”
  • num_range("x", 1:3): matches x1, x2, and x3

See ?select for more details.

You can also rename variables as you select() then by using =. The new name is on the left-hand side of the =, and the old variable name is on the right-hand side.

flights %>%
  select(tail_num = tailnum)
#> # A tibble: 336,776 × 1
#>   tail_num
#>   <chr>   
#> 1 N14228  
#> 2 N24211  
#> 3 N619AA  
#> 4 N804JB  
#> 5 N668DN  
#> 6 N39463  
#> # ℹ 336,770 more rows

rename()

rename() is used to keep all existing variables whiel renaming a few of them.

flights %>%
  rename(tail_num = tailnum)
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

relocate()

relocate() moves variables around. By default, relocate() moves variables to the front:

flights %>%
  relocate(time_hour, air_time)
#> # A tibble: 336,776 × 19
#>   time_hour           air_time  year month   day dep_time sched_dep_time
#>   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
#> 1 2013-01-01 05:00:00      227  2013     1     1      517            515
#> 2 2013-01-01 05:00:00      227  2013     1     1      533            529
#> 3 2013-01-01 05:00:00      160  2013     1     1      542            540
#> 4 2013-01-01 05:00:00      183  2013     1     1      544            545
#> 5 2013-01-01 06:00:00      116  2013     1     1      554            600
#> 6 2013-01-01 05:00:00      150  2013     1     1      554            558
#> # ℹ 336,770 more rows
#> # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, …

You can also specify where to put them using the .before and .after arguments, just like in mutate():

flights %>%
  relocate(year:dep_time, .after = time_hour)
flights %>%
  relocate(starts_with("arr"), .before = dep_time)

Exercises

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

  2. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

  3. What happens if you specify the name of the same variable multiple times in a select() call?

  4. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

    variables <- c("year", "month", "day", "dep_delay", "arr_delay")
  5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

    flights %>% select(contains("TIME"))
  6. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

  7. Why doesn’t the following work, and what does the error mean?

    flights %>%
      select(tailnum) %>%
      arrange(arr_delay)
    #> Error in `arrange()`:
    #> ℹ In argument: `..1 = arr_delay`.
    #> Caused by error:
    #> ! object 'arr_delay' not found

The Pipe %>%

We have learned a few different ways to transform our dataset flights, and we saw how to do so using the pipe %>%. But its real power arises when start to combine multiple verbs.

For example, let’s say you wanted to find the fastest flights to Houston’s IAH airport. How would we do that?

  1. Filter the destination airport to IAH using filter()
  2. Create a new variable for the speed of each flight using mutate()
  3. Order by the speed using arrange()
  4. View relevant columns using select()

Let’s see it in action:

flights %>%
  filter(dest == "IAH") %>%
  mutate(speed = distance / air_time * 60) %>%
  arrange(desc(speed)) %>%
  select(year:day, dep_time, carrier, flight, speed)
#> # A tibble: 7,198 × 7
#>    year month   day dep_time carrier flight speed
#>   <int> <int> <int>    <int> <chr>    <int> <dbl>
#> 1  2013     7     9      707 UA         226  522.
#> 2  2013     8    27     1850 UA        1128  521.
#> 3  2013     8    28      902 UA        1711  519.
#> 4  2013     8    28     2122 UA        1022  519.
#> 5  2013     6    11     1628 UA        1178  515.
#> 6  2013     8    27     1017 UA         333  515.
#> # ℹ 7,192 more rows

Compare the above code using the pipeline vs. nested function calls:

select(
  arrange(
    mutate(
      filter(
        flights, 
        dest == "IAH"
      ),
      speed = distance / air_time * 60
    ),
    desc(speed)
  ),
  year:day, dep_time, carrier, flight, speed
)

Or using intermediate objects:

flights1 <- filter(flights, dest == "IAH")
flights2 <- mutate(flights1, speed = distance / air_time * 60)
flights3 <- select(flights2, year:day, dep_time, carrier, flight, speed)
arrange(flights3, desc(speed))

In contrast, the pipeline generally produces data analysis code that is easier to write and read.

A shortcut for adding a pipe to your code is simply Ctrl/Cmd + Shift + M.

Groups

One of the most powerful aspects of dplyr is its ability to work with groups. We’ll focus on three core functions:

  • group_by(): divides your dataset into groups
  • summarize(): calculates a single summary statistic (for each group)
  • The slice_ functions: extracts specific rows within each group

group_by()

group_by() divides your dataset into groups meaningful for your analysis:

flights %>%
  group_by(month)
#> # A tibble: 336,776 × 19
#> # Groups:   month [12]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

As you’ll notice, the data doesn’t change. However, the output indicates that it is “grouped by” month (Groups: month [12]). This means subsequent operations will now work “by month”. group_by() adds this grouped feature to the data frame, which changes the behavior of the subsequent verbs applied to the data.

summarize()

summarize() calculates summary statistics for each group, reducing the data frame to have a single row for each group.

Let’s try to compute the average departure delay by month.

flights %>%
  group_by(month) %>%
  summarize(
    avg_delay = mean(dep_delay)
  )
#> # A tibble: 12 × 2
#>   month avg_delay
#>   <int>     <dbl>
#> 1     1        NA
#> 2     2        NA
#> 3     3        NA
#> 4     4        NA
#> 5     5        NA
#> 6     6        NA
#> # ℹ 6 more rows

Why are there all the NAs? This happened due to some observed flights having missing data in the delay column. The mean() function cannot calculate a mean when there is missing data. In order to deal with this, we need to tell mean() to ignore missing values using the na.rm argument:

flights %>%
  group_by(month) %>%
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE)
  )
#> # A tibble: 12 × 2
#>   month avg_delay
#>   <int>     <dbl>
#> 1     1      10.0
#> 2     2      10.8
#> 3     3      13.2
#> 4     4      13.9
#> 5     5      13.0
#> 6     6      20.8
#> # ℹ 6 more rows

You can create any number of summaries in a single call of summarize(). There are a lot of different summary statistics that we can specify. One very useful one is n(), which returns the number of rows in each group:

flights %>%
  group_by(month) %>%
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE), 
    n = n()
  )
#> # A tibble: 12 × 3
#>   month avg_delay     n
#>   <int>     <dbl> <int>
#> 1     1      10.0 27004
#> 2     2      10.8 24951
#> 3     3      13.2 28834
#> 4     4      13.9 28330
#> 5     5      13.0 28796
#> 6     6      20.8 28243
#> # ℹ 6 more rows

The slice_ functions

There are five useful functions that allow you to extract specific rows within each group:

  • df |> slice_head(n = 1) takes the first row from each group
  • df |> slice_tail(n = 1) takes the last row in each group
  • df |> slice_min(x, n = 1) takes the row with the smallest value of column x
  • df |> slice_max(x, n = 1) takes the row with the largest value of column x
  • df |> slice_sample(n = 1) takes one random row

You can vary n to select more than one row, or use prop instead of n to select a proportion of rows (e.g. prop = 0.1 to select 10% of the rows in each group).

The following code finds the flights that are most delayed upon arrival at each destination:

flights %>%
  group_by(dest) %>%
  slice_max(arr_delay, n = 1) %>%
  relocate(dest)
#> # A tibble: 108 × 19
#> # Groups:   dest [105]
#>   dest   year month   day dep_time sched_dep_time dep_delay arr_time
#>   <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1 ABQ    2013     7    22     2145           2007        98      132
#> 2 ACK    2013     7    23     1139            800       219     1250
#> 3 ALB    2013     1    25      123           2000       323      229
#> 4 ANC    2013     8    17     1740           1625        75     2042
#> 5 ATL    2013     7    22     2257            759       898      121
#> 6 AUS    2013     7    10     2056           1505       351     2347
#> # ℹ 102 more rows
#> # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, …

Note that there are 105 destinations but we get 108 rows here. This is because slice_min() and slice_max() keep tied values. If you want exactly one row per group, set with_ties = FALSE.

While we could compute the max delay with summarize(), using the above code allows you to get the whole corresponding row(s) instead of the single summary statistic.

Grouping by multiple variables

You can create groups using more than one variable. For example, we could create a group for each date (year, month, day):

daily <- flights %>%
  group_by(year, month, day)
daily
#> # A tibble: 336,776 × 19
#> # Groups:   year, month, day [365]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

Note that when you summarize a tibble grouped by more than one variable, each summary “peels off” the last group. For example, if we use summarize(n = n()) on daily, we will be left with a data frame that has year & month groups, but will no longer have each date as a specific subgroup.

daily_flights <- daily %>%
  summarize(n = n())
#> `summarise()` has grouped output by 'year', 'month'. You can override using
#> the `.groups` argument.

The warning message tells you how to change this default behavior. You can explicitly request the same behavior (.groups = "drop_last") in order to suppress the message.

daily_flights <- daily %>%
  summarize(
    n = n(),
    .groups = "drop_last"
  )

Alternatively, .groups = "drop" drops all groupings while .groups = "keep" preserves the same groups as the original data frame.

daily_flights <- daily %>%
  summarize(
    n = n(),
    .groups = "drop"
  )
daily_flights
#> # A tibble: 365 × 4
#>    year month   day     n
#>   <int> <int> <int> <int>
#> 1  2013     1     1   842
#> 2  2013     1     2   943
#> 3  2013     1     3   914
#> 4  2013     1     4   915
#> 5  2013     1     5   720
#> 6  2013     1     6   832
#> # ℹ 359 more rows

Ungrouping

You can also directly ungroup a data frame using ungroup(), instead of using summarize().

daily %>%
  ungroup()
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …

Let’s see what happens when we summarize an ungrouped data frame.

daily %>%
  ungroup() %>%
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE), 
    flights = n()
  )
#> # A tibble: 1 × 2
#>   avg_delay flights
#>       <dbl>   <int>
#> 1      12.6  336776

.by

Instead of using group_by() as a separate function in the pipeline, dplyr allows you to specify the groups you want to summarize data for within the summarize() function itself using the .by argument.

flights %>%
  summarize(
    delay = mean(dep_delay, na.rm = TRUE),
    n = n(),
    .by = month
  )
#> # A tibble: 12 × 3
#>   month delay     n
#>   <int> <dbl> <int>
#> 1     1 10.0  27004
#> 2    10  6.24 28889
#> 3    11  5.44 27268
#> 4    12 16.6  28135
#> 5     2 10.8  24951
#> 6     3 13.2  28834
#> # ℹ 6 more rows

Or if you want to group by multiple variables:

flights %>%
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), 
    n = n(),
    .by = c(origin, dest)
  )
#> # A tibble: 224 × 4
#>   origin dest  delay     n
#>   <chr>  <chr> <dbl> <int>
#> 1 EWR    IAH   11.8   3973
#> 2 LGA    IAH    9.06  2951
#> 3 JFK    MIA    9.34  3314
#> 4 JFK    BQN    6.67   599
#> 5 LGA    ATL   11.4  10263
#> 6 EWR    ORD   14.6   6100
#> # ℹ 218 more rows

Note that you can use the .by argument with all the verbs in the dplyr toolkit. It can be very useful and may be more convenient to use than group_by().

Exercises

  1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarize(n()))
  2. Find the flights that are most delayed upon departure from each destination.
  3. How do delays vary over the course of the day? Illustrate your answer with a plot.
  4. What happens if you supply a negative n to slice_min() and friends?
  5. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

Summary

In this tutorial, we learned the tools that dplyr provides for working with data frames. There are roughly three categories of tools:

  1. Those that manipulate rows (e.g. filter() and arrange())
  2. Those that manipulate columns (e.g. select() and mutate())
  3. Those that manipulate groups (e.g. group_by() and summarize(), or summarize() with the .by argument)

In the next two tutorials, we’ll talk about the steps that lead up to the data transformation step: data importing and data tidying.