Data wrangling

Introduction

Prerequisites

In this chapter, we will introduce the dplyr package, another core member of the Tidyverse. Because we still have not discussed importing our own datasets, we will showcase data manipulation with dplyr with the nycflights13::flights dataset. We will also apply our knowledge from the previous chapter on ggplot2 by creating various visualizations.

In this chapter, we will need the following packages:

library(ggplot2)
library(dplyr)
library(nycflights13)

After activating dplyr, a message informs us that several functions from base R have been masked, namely filter(), lag(), intersect(), setdiff(), setequal(), and union(). This means that calling filter() after dplyr has been activated does not refer to the base R function from the stats package anymore. If you still want to use that function, you need to specify its full name stats::filter().

The nycflights13 package

The nycflights13 package contains five datasets on all 336,776 flights that departed from New York City in 2013. We will only be using the flights data frame in this chapter. Before we start, I recommend that you take a look at the documentation (?flights) to find out what each column (variable) represents.

flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1  2013     1     1      517            515         2      830            819        11 UA        1545
 2  2013     1     1      533            529         4      850            830        20 UA        1714
 3  2013     1     1      542            540         2      923            850        33 AA        1141
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
 5  2013     1     1      554            600        -6      812            837       -25 DL         461
 6  2013     1     1      554            558        -4      740            728        12 UA        1696
 7  2013     1     1      555            600        -5      913            854        19 B6         507
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708
 9  2013     1     1      557            600        -3      838            846        -8 B6          79
10  2013     1     1      558            600        -2      753            745         8 AA         301
# … with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Take note of the different column data types: <int> and <dbl> are numeric data types that encode integers and doubles (real numbers), respectively. There are also several character columns (<chr>) and one column containing a date and a time (<dttm>).

Basic verbs

The dplyr package offers six functions that form the verbs of a powerful data manipulation language. With these six functions, we can solve almost any data manipulation problem. In particular, the functions perform the following tasks:

filter() selects observations (rows) by certain values.
arrange() arranges rows in a different order.
select() picks specific variables (columns).
mutate() creates new variables based on existing columns.
summarize() aggregates multiple data points to a single value.
group_by() changes subsequent operations to work on grouped subsets instead of the entire dataset.

We will discuss each function using example operations on the nycflights::flights dataset. All functions share some common consistent behavior, because they take a data frame as their first argument and always return a new data frame as their result. Therefore, dplyr functions never modify the original data frame.

Filter rows

We can filter rows in a data frame based on their values in specific columns with filter(). As mentioned previously, the first argument is the data frame. Additional arguments specify how rows should be filtered. For example, we could filter our data to get all flights that departed on January 1st:

filter(flights, month == 1, day == 1)

# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1  2013     1     1      517            515         2      830            819        11 UA        1545
 2  2013     1     1      533            529         4      850            830        20 UA        1714
 3  2013     1     1      542            540         2      923            850        33 AA        1141
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
 5  2013     1     1      554            600        -6      812            837       -25 DL         461
 6  2013     1     1      554            558        -4      740            728        12 UA        1696
 7  2013     1     1      555            600        -5      913            854        19 B6         507
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708
 9  2013     1     1      557            600        -3      838            846        -8 B6          79
10  2013     1     1      558            600        -2      753            745         8 AA         301
# … with 832 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The result is a new data frame consisting of 842 rows with flights on that date. Notice that we can specify column names directly without having to prepend flights$, because filter() is smart enough to know that we are referring to column names inside the flights data frame.

R supports the usual comparison operators:

== (equal)
!= (not equal)
> (greater than)
>= (greater than or equal to)
< (less than)
<= (less than or equal to)

When we pass two or more comparison arguments, filter() combines them with “and” (as in the previous example). In addition, we can combine comparisons with these logical operators:

& (and)
| (or)
! (not)
xor() (exclusive or)

We could rewrite the previous example as follows:

filter(flights, month == 1 & day == 1)

Here’s how we could find all flights departing in November or December:

filter(flights, month == 11 | month == 12)

The %in% operator is a handy shortcut for chaining several comparisons with “or”. The previous example could be rewritten as:

filter(flights, month %in% c(11, 12))

Filtering data with filter() will automatically drop all NA values. If you want to retain missing values you need to be explicit:

df = tibble(x=c(1, NA, 3))
filter(df, x > 1)  # NA values are automatically dropped

# A tibble: 1 × 1
      x
  <dbl>
1     3

filter(df, is.na(x) | x > 1)

# A tibble: 2 × 1
      x
  <dbl>
1    NA
2     3

Exercises

See https://r4ds.had.co.nz/transform.html#exercises-8.

Arrange rows

It is often useful to arrange rows in a specific order. Similar to filter(), the arrange() function takes a data frame as its first argument. One or more additional arguments specify how sorting should be performed. Here’s how to sort the data by scheduled departure time as well as year, month, and day:

arrange(flights, year, month, day, sched_dep_time)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1  2013     1     1      517            515         2      830            819        11 UA        1545
 2  2013     1     1      533            529         4      850            830        20 UA        1714
 3  2013     1     1      542            540         2      923            850        33 AA        1141
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
 5  2013     1     1      554            558        -4      740            728        12 UA        1696
 6  2013     1     1      559            559         0      702            706        -4 B6        1806
 7  2013     1     1      554            600        -6      812            837       -25 DL         461
 8  2013     1     1      555            600        -5      913            854        19 B6         507
 9  2013     1     1      557            600        -3      709            723       -14 EV        5708
10  2013     1     1      557            600        -3      838            846        -8 B6          79
# … with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

By default, arrange() uses ascending order, but we can use the desc() function to sort in descending order:

arrange(flights, desc(dep_delay))

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1  2013     1     9      641            900      1301     1242           1530      1272 HA          51
 2  2013     6    15     1432           1935      1137     1607           2120      1127 MQ        3535
 3  2013     1    10     1121           1635      1126     1239           1810      1109 MQ        3695
 4  2013     9    20     1139           1845      1014     1457           2210      1007 AA         177
 5  2013     7    22      845           1600      1005     1044           1815       989 MQ        3075
 6  2013     4    10     1100           1900       960     1342           2211       931 DL        2391
 7  2013     3    17     2321            810       911      135           1020       915 DL        2119
 8  2013     6    27      959           1900       899     1236           2226       850 DL        2007
 9  2013     7    22     2257            759       898      121           1026       895 DL        2047
10  2013    12     5      756           1700       896     1058           2020       878 AA         172
# … with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note that missing values will always be sorted at the end.

Exercises

See https://r4ds.had.co.nz/transform.html#exercises-9.

Select variables

Often, only a subset of columns in the original data is needed for a particular analysis. The select() function picks only those columns you specify as arguments. Note that ranges can be specified with the : operator, and negative values indicate the complement of a specific selection.

select(flights, year, month, day)  # columns year, month, and day
select(flights, year:day)  # columns year through day
select(flights, -(year:day))  # all columns except year through day

The following functions are very useful in combination with select():

starts_with("abc") finds names that start with “abc”.
ends_with("xyz") finds names that end with “xyz”.
contains("ijk") finds names that contain “ijk”.
matches("(.)\\1") finds names that match the regular expression (.)\\1 (repeated characters in this particular example). Regular expressions are a powerful tool for matching almost any kind of pattern, but we do not have time to cover them in this workshop.
num_range("x", 1:3) finds names x1, x2, and x3.

Here are two examples with the flights dataset:

select(flights, contains("arr"))  # select all columns that contain "arr"

# A tibble: 336,776 × 4
   arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl> <chr>  
 1      830            819        11 UA     
 2      850            830        20 UA     
 3      923            850        33 AA     
 4     1004           1022       -18 B6     
 5      812            837       -25 DL     
 6      740            728        12 UA     
 7      913            854        19 B6     
 8      709            723       -14 EV     
 9      838            846        -8 B6     
10      753            745         8 AA     
# … with 336,766 more rows

select(flights, ends_with("time"))  # select all columns that end with "time"

# A tibble: 336,776 × 5
   dep_time sched_dep_time arr_time sched_arr_time air_time
      <int>          <int>    <int>          <int>    <dbl>
 1      517            515      830            819      227
 2      533            529      850            830      227
 3      542            540      923            850      160
 4      544            545     1004           1022      183
 5      554            600      812            837      116
 6      554            558      740            728      150
 7      555            600      913            854      158
 8      557            600      709            723       53
 9      557            600      838            846      140
10      558            600      753            745      138
# … with 336,766 more rows

Although select() can be used to rename columns, it is often more convenient to use the rename() function instead. Whereas select() drops all columns not explicity mentioned, rename() keeps all of those columns.

rename(flights, tail_num=tailnum)  # rename tailnum column to tail_num

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1  2013     1     1      517            515         2      830            819        11 UA        1545
 2  2013     1     1      533            529         4      850            830        20 UA        1714
 3  2013     1     1      542            540         2      923            850        33 AA        1141
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
 5  2013     1     1      554            600        -6      812            837       -25 DL         461
 6  2013     1     1      554            558        -4      740            728        12 UA        1696
 7  2013     1     1      555            600        -5      913            854        19 B6         507
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708
 9  2013     1     1      557            600        -3      838            846        -8 B6          79
10  2013     1     1      558            600        -2      753            745         8 AA         301
# … with 336,766 more rows, and 8 more variables: tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Finally, the everything() function is useful if you want to reorder specific columns (for example moving them to the beginning of the data frame):

select(flights, time_hour, air_time, everything())  # move time_hour and air_time columns to the beginning

# A tibble: 336,776 × 19
   time_hour           air_time  year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1 2013-01-01 05:00:00      227  2013     1     1      517            515         2      830            819
 2 2013-01-01 05:00:00      227  2013     1     1      533            529         4      850            830
 3 2013-01-01 05:00:00      160  2013     1     1      542            540         2      923            850
 4 2013-01-01 05:00:00      183  2013     1     1      544            545        -1     1004           1022
 5 2013-01-01 06:00:00      116  2013     1     1      554            600        -6      812            837
 6 2013-01-01 05:00:00      150  2013     1     1      554            558        -4      740            728
 7 2013-01-01 06:00:00      158  2013     1     1      555            600        -5      913            854
 8 2013-01-01 06:00:00       53  2013     1     1      557            600        -3      709            723
 9 2013-01-01 06:00:00      140  2013     1     1      557            600        -3      838            846
10 2013-01-01 06:00:00      138  2013     1     1      558            600        -2      753            745
# … with 336,766 more rows, and 9 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>

Exercises

See https://r4ds.had.co.nz/transform.html#exercises-10.

Create new variables

The mutate() function adds new columns based on values from existing columns. Because new columns are always added to the end, we will first create a subset of the flights data frame with fewer columns to better see the effect of mutate().

(df = select(flights, year:day, ends_with("delay"), distance, air_time))

# A tibble: 336,776 × 7
    year month   day dep_delay arr_delay distance air_time
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
 1  2013     1     1         2        11     1400      227
 2  2013     1     1         4        20     1416      227
 3  2013     1     1         2        33     1089      160
 4  2013     1     1        -1       -18     1576      183
 5  2013     1     1        -6       -25      762      116
 6  2013     1     1        -4        12      719      150
 7  2013     1     1        -5        19     1065      158
 8  2013     1     1        -3       -14      229       53
 9  2013     1     1        -3        -8      944      140
10  2013     1     1        -2         8      733      138
# … with 336,766 more rows

mutate(df, gain=arr_delay - dep_delay, hours=air_time / 60, speed=distance / hours)

# A tibble: 336,776 × 10
    year month   day dep_delay arr_delay distance air_time  gain hours speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>
 1  2013     1     1         2        11     1400      227     9 3.78   370.
 2  2013     1     1         4        20     1416      227    16 3.78   374.
 3  2013     1     1         2        33     1089      160    31 2.67   408.
 4  2013     1     1        -1       -18     1576      183   -17 3.05   517.
 5  2013     1     1        -6       -25      762      116   -19 1.93   394.
 6  2013     1     1        -4        12      719      150    16 2.5    288.
 7  2013     1     1        -5        19     1065      158    24 2.63   404.
 8  2013     1     1        -3       -14      229       53   -11 0.883  259.
 9  2013     1     1        -3        -8      944      140    -5 2.33   405.
10  2013     1     1        -2         8      733      138    10 2.3    319.
# … with 336,766 more rows

Note that we can even refer to columns that we have only just created (the hours column in the previous example is immediately used for creating the speed column).

Whereas mutate() always keeps all columns, transmute() keeps only the new ones.

Exercises

See https://r4ds.had.co.nz/transform.html#exercises-11.

Summarize grouped data

Summarizing variables is important in almost any data analysis task. It is most useful on grouped data, but to showcase the summarize() function we will start by applying it on the whole (ungrouped) data first:

summarize(flights, delay=mean(dep_delay, na.rm=TRUE))

# A tibble: 1 × 1
  delay
  <dbl>
1  12.6

The group_by function partitions the dataset into groups. Applying summarize() to grouped data will automatically compute the summary for each group. In the following example, we compute the average delay for each day:

by_day = group_by(flights, year, month, day)
summarize(by_day, delay=mean(dep_delay, na.rm=TRUE))

# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day delay
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# … with 355 more rows

Or what about the average delay per month?

by_month = group_by(flights, year, month)
summarize(by_month, delay=mean(dep_delay, na.rm=TRUE))

# A tibble: 12 × 3
# Groups:   year [1]
    year month delay
   <int> <int> <dbl>
 1  2013     1 10.0 
 2  2013     2 10.8 
 3  2013     3 13.2 
 4  2013     4 13.9 
 5  2013     5 13.0 
 6  2013     6 20.8 
 7  2013     7 21.7 
 8  2013     8 12.6 
 9  2013     9  6.72
10  2013    10  6.24
11  2013    11  5.44
12  2013    12 16.6

Seems like November 2013 was a good time to fly. One other thing to notice in the previous example is that we needed to create a new variable by_month that contained the grouped data. However, we actually never really needed this variable except as input to the summarize() function. In the next section, we will introduce a method that allows us to get rid of such temporary (unneeded) variables: the pipe operator.

The pipe operator

One of the key properties of Tidyverse functions is their support for the pipe operator %>%, which is defined in the magrittr package. R has gained a native pipe operator as of version 4.1. It is written as |> and we can use it instead of %>% in almost all cases. We will use the native pipe operator |> throughout the course material, but if you explicitly want your code to run on R versions prior to 4.1, you need to use the magrittr pipe %>%.

In RStudio, you can use the keyboard shortcut ⌘+⇧+m (macOS) or Ctrl+⇧+m (Windows and Linux) to insert the pipe operator (you can select between the native and magrittr variants in the “Code” section of RStudio’s preferences).

The idea behind piping is simple: instead of having to store an intermediate value in a variable, the pipe operator forwards it as the first argument to the next function. This means that instead of writing f(x), we can express the same command as x |> f(). For example, mean(x) could also be written as x |> mean(). This is not really helpful yet, but if we wanted to feed the result of f(x) into another function g(), instead of writing g(f(x)) we could use a pipe x |> f() |> g(). That way, computations that depend on previous calculations can be written in a very intuitive way.

You might be wondering why this syntactic sugar is so popular in the Tidyverse (after all, it doesn’t add any new functionality). One of the main reasons is that it allows us to write code that corresponds to the flow of operations. Consider the previous example g(f(x)), where we take x, compute f(), and feed its result into g(). It is much easier to see the order of operations when we write this expression as x |> f() |> g().

If this example was too abstract, let’s revisit our previous code snippet where we computed average delays grouped by month. Using the pipe operator, we can express the same calculation as follows:

flights |>
    group_by(year, month) |>
    summarize(delay=mean(dep_delay, na.rm=TRUE))

# A tibble: 12 × 3
# Groups:   year [1]
    year month delay
   <int> <int> <dbl>
 1  2013     1 10.0 
 2  2013     2 10.8 
 3  2013     3 13.2 
 4  2013     4 13.9 
 5  2013     5 13.0 
 6  2013     6 20.8 
 7  2013     7 21.7 
 8  2013     8 12.6 
 9  2013     9  6.72
10  2013    10  6.24
11  2013    11  5.44
12  2013    12 16.6

Notice how it is immediately clear that each function in this pipeline receives the result from the previous computation as its input. In fact, if we read this code from left to right and top to bottom, we automatically know how the data is processed: first, we take the flights data, then group it by year and month, and finally summarize it by computing the average delay.

Counts

R comes with a plethora of useful summary functions. Computing counts in grouped data is always a good idea, because it can provide important first insights. The n() function can be used to determine group sizes, but note that missing values are included in the counts. You can use sum(!is.na(x)) to count the number of non-missing data points in a (grouped) variable x:

flights |>
    group_by(year, month) |>
    summarize(n=n(), na=sum(is.na(dep_delay)), a=sum(!is.na(dep_delay)))

# A tibble: 12 × 5
# Groups:   year [1]
    year month     n    na     a
   <int> <int> <int> <int> <int>
 1  2013     1 27004   521 26483
 2  2013     2 24951  1261 23690
 3  2013     3 28834   861 27973
 4  2013     4 28330   668 27662
 5  2013     5 28796   563 28233
 6  2013     6 28243  1009 27234
 7  2013     7 29425   940 28485
 8  2013     8 29327   486 28841
 9  2013     9 27574   452 27122
10  2013    10 28889   236 28653
11  2013    11 27268   233 27035
12  2013    12 28135  1025 27110

Let’s take a look at the average arrival delay of each individual airplane (as identified by its unique tail number). First, we drop all rows where either departure delay or arrival delay values are missing. Then we group by tailnum and summarize these groups by computing the mean arrival delay. Finally, because the result of this pipeline is a data frame, we can directly pipe it to ggplot() to create a visualization of the average delay distribution.

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(tailnum) |>
    summarize(delay=mean(arr_delay)) |>
    ggplot(mapping=aes(x=delay)) +
    geom_histogram(bins=100)

This plot shows that most airplanes are delayed by only a few minutes, but there are some airplanes with an average arrival delay of around five hours! In this context, it would be interesting to know the number of data points used for each average – it is likely that airplanes with extremely high average delays are really based on only a few flights. We can use the n() function to find out and create a scatterplot of average delay versus number of data points for each delay:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(tailnum) |>
    summarize(n=n(), delay=mean(arr_delay)) |>
    ggplot(mapping=aes(x=delay, y=n)) +
    geom_point(alpha=0.25)

As expected, airplanes with extremely high average delays are based on only a single flight. We can actually take a look at the data sorted by average delay in descending order to see the exact numbers:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(tailnum) |>
    summarize(n=n(), delay=mean(arr_delay)) |>
    arrange(desc(delay))

# A tibble: 4,037 × 3
   tailnum     n delay
   <chr>   <int> <dbl>
 1 N844MH      1  320 
 2 N911DA      1  294 
 3 N922EV      1  276 
 4 N587NW      1  264 
 5 N851NW      1  219 
 6 N928DN      1  201 
 7 N7715E      1  188 
 8 N654UA      1  185 
 9 N665MQ      6  175.
10 N427SW      1  157 
# … with 4,027 more rows

Indeed, we can create a much more reliable plot by excluding outliers. Let’s create the same scatterplot with airplanes that have more than 25 flights:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(tailnum) |>
    summarize(n=n(), delay=mean(arr_delay)) |>
    filter(n > 25) |>
    ggplot(mapping=aes(x=delay, y=n)) +
    geom_point(alpha=0.25)

Other summary functions

In addition to n(), there are many more useful summary functions, including:

Measures of location: mean() and median()
Measures of spread: sd(), IQR(), and mad()
Measures of rank: min(), max(), and quantile()
Measures of position: first(), last(), and nth()
Counts: n(), n_distinct(), and count()

Let’s go through some examples demonstrating these summary functions. First, we compute mean and median arrival delays for each airplane:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(tailnum) |>
    summarize(mean=mean(arr_delay), median=median(arr_delay))

# A tibble: 4,037 × 3
   tailnum   mean median
   <chr>    <dbl>  <dbl>
 1 D942DN  31.5     23  
 2 N0EGMQ   9.98    -2  
 3 N10156  12.7      2  
 4 N102UW   2.94    -6.5
 5 N103US  -6.93    -7  
 6 N104UW   1.80    -8.5
 7 N10575  20.7     -2  
 8 N105UW  -0.267   -7  
 9 N107US  -5.73    -9  
10 N108UW  -1.25   -10  
# … with 4,027 more rows

We might be interested in the variability of the distance of destination airports:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(dest) |>
    summarize(sd=sd(distance)) |>
    arrange(desc(sd))

# A tibble: 104 × 2
   dest     sd
   <chr> <dbl>
 1 EGE   10.5 
 2 SAN   10.4 
 3 SFO   10.2 
 4 HNL   10.0 
 5 SEA    9.98
 6 LAS    9.91
 7 PDX    9.87
 8 PHX    9.86
 9 LAX    9.66
10 IND    9.46
# … with 94 more rows

How about finding out the first and last departure times on each day?

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(year, month, day) |>
    summarize(first=min(dep_time), last=max(dep_time))

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day first  last
   <int> <int> <int> <int> <int>
 1  2013     1     1   517  2356
 2  2013     1     2    42  2354
 3  2013     1     3    32  2349
 4  2013     1     4    25  2358
 5  2013     1     5    14  2357
 6  2013     1     6    16  2355
 7  2013     1     7    49  2359
 8  2013     1     8   454  2351
 9  2013     1     9     2  2252
10  2013     1    10     3  2320
# … with 355 more rows

Since the data is actually sorted by dep_time, we can also take the first and last entry in each group to get the earliest and latest departure times:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(year, month, day) |>
    summarize(first=first(dep_time), last=last(dep_time))

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day first  last
   <int> <int> <int> <int> <int>
 1  2013     1     1   517  2356
 2  2013     1     2    42  2354
 3  2013     1     3    32  2349
 4  2013     1     4    25  2358
 5  2013     1     5    14  2357
 6  2013     1     6    16  2355
 7  2013     1     7    49  2359
 8  2013     1     8   454  2351
 9  2013     1     9     2  2252
10  2013     1    10     3  2320
# … with 355 more rows

Which destinations have the highest number of carriers?

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(dest) |>
    summarize(carriers=n_distinct(carrier)) |>
    arrange(desc(carriers))

# A tibble: 104 × 2
   dest  carriers
   <chr>    <int>
 1 ATL          7
 2 BOS          7
 3 CLT          7
 4 ORD          7
 5 TPA          7
 6 AUS          6
 7 DCA          6
 8 DTW          6
 9 IAD          6
10 MSP          6
# … with 94 more rows

We have computed the number of flights of each airplane previously as follows:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    group_by(tailnum) |>
    summarize(n=n())

# A tibble: 4,037 × 2
   tailnum     n
   <chr>   <int>
 1 D942DN      4
 2 N0EGMQ    352
 3 N10156    145
 4 N102UW     48
 5 N103US     46
 6 N104UW     46
 7 N10575    269
 8 N105UW     45
 9 N107US     41
10 N108UW     60
# … with 4,027 more rows

If that’s all you want, there is a shorthand notation for this operation using the count() function:

flights |>
    filter(!is.na(dep_delay), !is.na(arr_delay)) |>
    count(tailnum)

# A tibble: 4,037 × 2
   tailnum     n
   <chr>   <int>
 1 D942DN      4
 2 N0EGMQ    352
 3 N10156    145
 4 N102UW     48
 5 N103US     46
 6 N104UW     46
 7 N10575    269
 8 N105UW     45
 9 N107US     41
10 N108UW     60
# … with 4,027 more rows

Ungrouping

Sometimes it is necessary to remove grouping and return to the full dataset. This can be achieved with the ungroup() function.

Exercises

See https://r4ds.had.co.nz/transform.html#exercises-12.