The dplyr summarize Function

summarize()

The flights data will be used to demonstrate what this powerful function can do. Recall what the flights dataset looks like:

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

summarize() applies a function (or functions) to the variables in a data table to collapse it to a single row, showing the results of that function (or those functions). For example, this calculates the mean value of the departure delay (dep_delay) column:

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

You include multiple functions in a summarize() call: There will be one “column” per function in the resulting (single-row) table. Note that I break the function arguments across multiple lines for better readability):

summarize(flights, 
          delay = mean(dep_delay, na.rm = TRUE),
          count = n())

## # A tibble: 1 x 2
##   delay  count
##   <dbl>  <int>
## 1  12.6 336776

Grouping

While interesting, so far that is not very enlightening, and there are other (albeit cruder) ways to get that same information. Where summarize() really shines is when it is used together with the group_by() function.

group_by() tells R to think of the table as a collection of sub-tables, where each sub-table is associated with a specific value of one of the variables (columns). For example, let’s group the flights data by airline:

group_by(flights, carrier)

## # A tibble: 336,776 x 19
## # Groups:   carrier [16]
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Applying group_by() to the table did not change the content of the table, but as you can see in the second line of the output, R now knows that it should group the rows by carrier, and it shows there are 16 values for the carrier variable.

The group_by() is almost always used together with summarize(). If the same summarize() function used before is applied to the grouped table, a much more interesting summary results, because now the average delays between different carriers can be compared:

grouped_flights <- group_by(flights, carrier)
summarize(grouped_flights, 
          delay = mean(dep_delay, na.rm = TRUE),
          count = n())

## # A tibble: 16 x 3
##    carrier delay count
##    <chr>   <dbl> <int>
##  1 9E      16.7  18460
##  2 AA       8.59 32729
##  3 AS       5.80   714
##  4 B6      13.0  54635
##  5 DL       9.26 48110
##  6 EV      20.0  54173
##  7 F9      20.2    685
##  8 FL      18.7   3260
##  9 HA       4.90   342
## 10 MQ      10.6  26397
## 11 OO      12.6     32
## 12 UA      12.1  58665
## 13 US       3.78 20536
## 14 VX      12.9   5162
## 15 WN      17.7  12275
## 16 YV      19.0    601

Pipe

The pipe is one of my favorite operators. Because dplyr functions output the same type (tibble) that they take as input, the functions can be effectively chained together in order to avoid having to create intermediate, temporary variables (like “grouped_flights” above).

There is another way to chain the function calls together without using intermediate variables, but the pipe operator %>% yields the most readable code. Here the pipe operator is used to redo the previous summarization:

flights %>%
  group_by(carrier) %>%
  summarize(delay = mean(dep_delay, na.rm = TRUE),
            count = n())

## # A tibble: 16 x 3
##    carrier delay count
##    <chr>   <dbl> <int>
##  1 9E      16.7  18460
##  2 AA       8.59 32729
##  3 AS       5.80   714
##  4 B6      13.0  54635
##  5 DL       9.26 48110
##  6 EV      20.0  54173
##  7 F9      20.2    685
##  8 FL      18.7   3260
##  9 HA       4.90   342
## 10 MQ      10.6  26397
## 11 OO      12.6     32
## 12 UA      12.1  58665
## 13 US       3.78 20536
## 14 VX      12.9   5162
## 15 WN      17.7  12275
## 16 YV      19.0    601

The other way to chain the function calls is to nest them (recall print3num(threenum…)). This may seem shorter in this example, but it less easy to see what is going on with a quick glance, and that becomes more onerous when the call chain gets fairly large.

summarize(group_by(flights, carrier),
          delay = mean(dep_delay, na.rm = TRUE),
          count = n())

## # A tibble: 16 x 3
##    carrier delay count
##    <chr>   <dbl> <int>
##  1 9E      16.7  18460
##  2 AA       8.59 32729
##  3 AS       5.80   714
##  4 B6      13.0  54635
##  5 DL       9.26 48110
##  6 EV      20.0  54173
##  7 F9      20.2    685
##  8 FL      18.7   3260
##  9 HA       4.90   342
## 10 MQ      10.6  26397
## 11 OO      12.6     32
## 12 UA      12.1  58665
## 13 US       3.78 20536
## 14 VX      12.9   5162
## 15 WN      17.7  12275
## 16 YV      19.0    601

The dplyr summarize Function

Kevin Fowler

February 3, 2019

summarize()

Grouping

Pipe