In this chapter, we will introduce the dplyr package, another core
member of the Tidyverse. Because we still have not discussed importing
our own datasets, we will showcase data manipulation with dplyr with the
nycflights13::flights dataset. We will also apply our
knowledge from the previous chapter on ggplot2 by creating various
visualizations.
In this chapter, we will need the following packages:
library(ggplot2)
library(dplyr)
library(nycflights13)
After activating dplyr, a message informs us that
several functions from base R have been masked, namely
filter(), lag(), intersect(),
setdiff(), setequal(), and
union(). This means that calling filter()
after dplyr has been activated does not refer to the base R
function from the stats package anymore. If you still want
to use that function, you need to specify its full name
stats::filter().
The nycflights13
package contains five datasets on all 336,776 flights that departed from
New York City in 2013. We will only be using the flights
data frame in this chapter. Before we start, I recommend that you take a
look at the documentation (?flights) to find out what each
column (variable) represents.
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 1 517 515 2 830 819 11 UA 1545
2 2013 1 1 533 529 4 850 830 20 UA 1714
3 2013 1 1 542 540 2 923 850 33 AA 1141
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
5 2013 1 1 554 600 -6 812 837 -25 DL 461
6 2013 1 1 554 558 -4 740 728 12 UA 1696
7 2013 1 1 555 600 -5 913 854 19 B6 507
8 2013 1 1 557 600 -3 709 723 -14 EV 5708
9 2013 1 1 557 600 -3 838 846 -8 B6 79
10 2013 1 1 558 600 -2 753 745 8 AA 301
# … with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Take note of the different column data types:
<int> and <dbl> are numeric data
types that encode integers and doubles (real numbers), respectively.
There are also several character columns (<chr>) and
one column containing a date and a time (<dttm>).
The dplyr package offers six functions that form the verbs of a powerful data manipulation language. With these six functions, we can solve almost any data manipulation problem. In particular, the functions perform the following tasks:
filter() selects observations (rows) by certain
values.arrange() arranges rows in a different order.select() picks specific variables (columns).mutate() creates new variables based on existing
columns.summarize() aggregates multiple data points to a single
value.group_by() changes subsequent operations to work on
grouped subsets instead of the entire dataset.We will discuss each function using example operations on the
nycflights::flights dataset. All functions share some
common consistent behavior, because they take a data frame as their
first argument and always return a new data frame as their result.
Therefore, dplyr functions never modify the original data
frame.
We can filter rows in a data frame based on their values in specific
columns with filter(). As mentioned previously, the first
argument is the data frame. Additional arguments specify how rows should
be filtered. For example, we could filter our data to get all flights
that departed on January 1st:
filter(flights, month == 1, day == 1)
# A tibble: 842 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 1 517 515 2 830 819 11 UA 1545
2 2013 1 1 533 529 4 850 830 20 UA 1714
3 2013 1 1 542 540 2 923 850 33 AA 1141
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
5 2013 1 1 554 600 -6 812 837 -25 DL 461
6 2013 1 1 554 558 -4 740 728 12 UA 1696
7 2013 1 1 555 600 -5 913 854 19 B6 507
8 2013 1 1 557 600 -3 709 723 -14 EV 5708
9 2013 1 1 557 600 -3 838 846 -8 B6 79
10 2013 1 1 558 600 -2 753 745 8 AA 301
# … with 832 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The result is a new data frame consisting of 842 rows with flights on
that date. Notice that we can specify column names directly without
having to prepend flights$, because filter()
is smart enough to know that we are referring to column names inside the
flights data frame.
R supports the usual comparison operators:
== (equal)!= (not equal)> (greater than)>= (greater than or equal to)< (less than)<= (less than or equal to)When we pass two or more comparison arguments, filter()
combines them with “and” (as in the previous example). In addition, we
can combine comparisons with these logical operators:
& (and)| (or)! (not)xor() (exclusive or)We could rewrite the previous example as follows:
filter(flights, month == 1 & day == 1)
Here’s how we could find all flights departing in November or December:
filter(flights, month == 11 | month == 12)
The %in% operator is a handy shortcut for chaining
several comparisons with “or”. The previous example could be rewritten
as:
filter(flights, month %in% c(11, 12))
Filtering data with filter() will automatically drop all
NA values. If you want to retain missing values you need to
be explicit:
df = tibble(x=c(1, NA, 3))
filter(df, x > 1) # NA values are automatically dropped
# A tibble: 1 × 1
x
<dbl>
1 3
filter(df, is.na(x) | x > 1)
# A tibble: 2 × 1
x
<dbl>
1 NA
2 3
It is often useful to arrange rows in a specific order. Similar to
filter(), the arrange() function takes a data
frame as its first argument. One or more additional arguments specify
how sorting should be performed. Here’s how to sort the data by
scheduled departure time as well as year, month, and day:
arrange(flights, year, month, day, sched_dep_time)
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 1 517 515 2 830 819 11 UA 1545
2 2013 1 1 533 529 4 850 830 20 UA 1714
3 2013 1 1 542 540 2 923 850 33 AA 1141
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
5 2013 1 1 554 558 -4 740 728 12 UA 1696
6 2013 1 1 559 559 0 702 706 -4 B6 1806
7 2013 1 1 554 600 -6 812 837 -25 DL 461
8 2013 1 1 555 600 -5 913 854 19 B6 507
9 2013 1 1 557 600 -3 709 723 -14 EV 5708
10 2013 1 1 557 600 -3 838 846 -8 B6 79
# … with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
By default, arrange() uses ascending order, but we can
use the desc() function to sort in descending order:
arrange(flights, desc(dep_delay))
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 9 641 900 1301 1242 1530 1272 HA 51
2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535
3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695
4 2013 9 20 1139 1845 1014 1457 2210 1007 AA 177
5 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075
6 2013 4 10 1100 1900 960 1342 2211 931 DL 2391
7 2013 3 17 2321 810 911 135 1020 915 DL 2119
8 2013 6 27 959 1900 899 1236 2226 850 DL 2007
9 2013 7 22 2257 759 898 121 1026 895 DL 2047
10 2013 12 5 756 1700 896 1058 2020 878 AA 172
# … with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note that missing values will always be sorted at the end.
Often, only a subset of columns in the original data is needed for a
particular analysis. The select() function picks only those
columns you specify as arguments. Note that ranges can be specified with
the : operator, and negative values indicate the complement
of a specific selection.
select(flights, year, month, day) # columns year, month, and day
select(flights, year:day) # columns year through day
select(flights, -(year:day)) # all columns except year through day
The following functions are very useful in combination with
select():
starts_with("abc") finds names that start with
“abc”.ends_with("xyz") finds names that end with “xyz”.contains("ijk") finds names that contain “ijk”.matches("(.)\\1") finds names that match the regular
expression (.)\\1 (repeated characters in this
particular example). Regular expressions are a powerful tool for
matching almost any kind of pattern, but we do not have time to cover
them in this workshop.num_range("x", 1:3) finds names x1,
x2, and x3.Here are two examples with the flights dataset:
select(flights, contains("arr")) # select all columns that contain "arr"
# A tibble: 336,776 × 4
arr_time sched_arr_time arr_delay carrier
<int> <int> <dbl> <chr>
1 830 819 11 UA
2 850 830 20 UA
3 923 850 33 AA
4 1004 1022 -18 B6
5 812 837 -25 DL
6 740 728 12 UA
7 913 854 19 B6
8 709 723 -14 EV
9 838 846 -8 B6
10 753 745 8 AA
# … with 336,766 more rows
select(flights, ends_with("time")) # select all columns that end with "time"
# A tibble: 336,776 × 5
dep_time sched_dep_time arr_time sched_arr_time air_time
<int> <int> <int> <int> <dbl>
1 517 515 830 819 227
2 533 529 850 830 227
3 542 540 923 850 160
4 544 545 1004 1022 183
5 554 600 812 837 116
6 554 558 740 728 150
7 555 600 913 854 158
8 557 600 709 723 53
9 557 600 838 846 140
10 558 600 753 745 138
# … with 336,766 more rows
Although select() can be used to rename columns, it is
often more convenient to use the rename() function instead.
Whereas select() drops all columns not explicity mentioned,
rename() keeps all of those columns.
rename(flights, tail_num=tailnum) # rename tailnum column to tail_num
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 1 517 515 2 830 819 11 UA 1545
2 2013 1 1 533 529 4 850 830 20 UA 1714
3 2013 1 1 542 540 2 923 850 33 AA 1141
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
5 2013 1 1 554 600 -6 812 837 -25 DL 461
6 2013 1 1 554 558 -4 740 728 12 UA 1696
7 2013 1 1 555 600 -5 913 854 19 B6 507
8 2013 1 1 557 600 -3 709 723 -14 EV 5708
9 2013 1 1 557 600 -3 838 846 -8 B6 79
10 2013 1 1 558 600 -2 753 745 8 AA 301
# … with 336,766 more rows, and 8 more variables: tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Finally, the everything() function is useful if you want
to reorder specific columns (for example moving them to the beginning of
the data frame):
select(flights, time_hour, air_time, everything()) # move time_hour and air_time columns to the beginning
# A tibble: 336,776 × 19
time_hour air_time year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<dttm> <dbl> <int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830 819
2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850 830
3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923 850
4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004 1022
5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812 837
6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740 728
7 2013-01-01 06:00:00 158 2013 1 1 555 600 -5 913 854
8 2013-01-01 06:00:00 53 2013 1 1 557 600 -3 709 723
9 2013-01-01 06:00:00 140 2013 1 1 557 600 -3 838 846
10 2013-01-01 06:00:00 138 2013 1 1 558 600 -2 753 745
# … with 336,766 more rows, and 9 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>
The mutate() function adds new columns based on values
from existing columns. Because new columns are always added to the end,
we will first create a subset of the flights data frame
with fewer columns to better see the effect of
mutate().
(df = select(flights, year:day, ends_with("delay"), distance, air_time))
# A tibble: 336,776 × 7
year month day dep_delay arr_delay distance air_time
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227
2 2013 1 1 4 20 1416 227
3 2013 1 1 2 33 1089 160
4 2013 1 1 -1 -18 1576 183
5 2013 1 1 -6 -25 762 116
6 2013 1 1 -4 12 719 150
7 2013 1 1 -5 19 1065 158
8 2013 1 1 -3 -14 229 53
9 2013 1 1 -3 -8 944 140
10 2013 1 1 -2 8 733 138
# … with 336,766 more rows
mutate(df, gain=arr_delay - dep_delay, hours=air_time / 60, speed=distance / hours)
# A tibble: 336,776 × 10
year month day dep_delay arr_delay distance air_time gain hours speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 3.78 370.
2 2013 1 1 4 20 1416 227 16 3.78 374.
3 2013 1 1 2 33 1089 160 31 2.67 408.
4 2013 1 1 -1 -18 1576 183 -17 3.05 517.
5 2013 1 1 -6 -25 762 116 -19 1.93 394.
6 2013 1 1 -4 12 719 150 16 2.5 288.
7 2013 1 1 -5 19 1065 158 24 2.63 404.
8 2013 1 1 -3 -14 229 53 -11 0.883 259.
9 2013 1 1 -3 -8 944 140 -5 2.33 405.
10 2013 1 1 -2 8 733 138 10 2.3 319.
# … with 336,766 more rows
Note that we can even refer to columns that we have only just created
(the hours column in the previous example is immediately
used for creating the speed column).
Whereas mutate() always keeps all columns,
transmute() keeps only the new ones.
Summarizing variables is important in almost any data analysis task.
It is most useful on grouped data, but to showcase the
summarize() function we will start by applying it on the
whole (ungrouped) data first:
summarize(flights, delay=mean(dep_delay, na.rm=TRUE))
# A tibble: 1 × 1
delay
<dbl>
1 12.6
The group_by function partitions the dataset into
groups. Applying summarize() to grouped data will
automatically compute the summary for each group. In the following
example, we compute the average delay for each day:
by_day = group_by(flights, year, month, day)
summarize(by_day, delay=mean(dep_delay, na.rm=TRUE))
# A tibble: 365 × 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# … with 355 more rows
Or what about the average delay per month?
by_month = group_by(flights, year, month)
summarize(by_month, delay=mean(dep_delay, na.rm=TRUE))
# A tibble: 12 × 3
# Groups: year [1]
year month delay
<int> <int> <dbl>
1 2013 1 10.0
2 2013 2 10.8
3 2013 3 13.2
4 2013 4 13.9
5 2013 5 13.0
6 2013 6 20.8
7 2013 7 21.7
8 2013 8 12.6
9 2013 9 6.72
10 2013 10 6.24
11 2013 11 5.44
12 2013 12 16.6
Seems like November 2013 was a good time to fly. One other thing to
notice in the previous example is that we needed to create a new
variable by_month that contained the grouped data. However,
we actually never really needed this variable except as input to the
summarize() function. In the next section, we will
introduce a method that allows us to get rid of such temporary
(unneeded) variables: the pipe operator.
One of the key properties of Tidyverse functions is their support for
the pipe operator %>%, which is defined in the magrittr package. R has
gained a native pipe operator as of version 4.1. It is written as
|> and we can use it instead of %>% in
almost all cases. We will use the native pipe operator
|> throughout the course material, but if you explicitly
want your code to run on R versions prior to 4.1, you need to use the
magrittr pipe %>%.
In RStudio, you can use the keyboard shortcut ⌘+⇧+m (macOS) or Ctrl+⇧+m (Windows and Linux) to insert the pipe operator (you can select between the native and magrittr variants in the “Code” section of RStudio’s preferences).
The idea behind piping is simple: instead of having to store an
intermediate value in a variable, the pipe operator forwards it as the
first argument to the next function. This means that instead of writing
f(x), we can express the same command as
x |> f(). For example, mean(x) could also
be written as x |> mean(). This is not really helpful
yet, but if we wanted to feed the result of f(x) into
another function g(), instead of writing
g(f(x)) we could use a pipe
x |> f() |> g(). That way, computations that depend
on previous calculations can be written in a very intuitive way.
You might be wondering why this syntactic sugar is so popular in the
Tidyverse (after all, it doesn’t add any new functionality). One of the
main reasons is that it allows us to write code that corresponds to the
flow of operations. Consider the previous example g(f(x)),
where we take x, compute f(), and feed its
result into g(). It is much easier to see the order of
operations when we write this expression as
x |> f() |> g().
If this example was too abstract, let’s revisit our previous code snippet where we computed average delays grouped by month. Using the pipe operator, we can express the same calculation as follows:
flights |>
group_by(year, month) |>
summarize(delay=mean(dep_delay, na.rm=TRUE))
# A tibble: 12 × 3
# Groups: year [1]
year month delay
<int> <int> <dbl>
1 2013 1 10.0
2 2013 2 10.8
3 2013 3 13.2
4 2013 4 13.9
5 2013 5 13.0
6 2013 6 20.8
7 2013 7 21.7
8 2013 8 12.6
9 2013 9 6.72
10 2013 10 6.24
11 2013 11 5.44
12 2013 12 16.6
Notice how it is immediately clear that each function in this
pipeline receives the result from the previous computation as its input.
In fact, if we read this code from left to right and top to bottom, we
automatically know how the data is processed: first, we take the
flights data, then group it by year and month, and finally
summarize it by computing the average delay.
R comes with a plethora of useful summary functions. Computing counts
in grouped data is always a good idea, because it can provide important
first insights. The n() function can be used to determine
group sizes, but note that missing values are included in the counts.
You can use sum(!is.na(x)) to count the number of
non-missing data points in a (grouped) variable x:
flights |>
group_by(year, month) |>
summarize(n=n(), na=sum(is.na(dep_delay)), a=sum(!is.na(dep_delay)))
# A tibble: 12 × 5
# Groups: year [1]
year month n na a
<int> <int> <int> <int> <int>
1 2013 1 27004 521 26483
2 2013 2 24951 1261 23690
3 2013 3 28834 861 27973
4 2013 4 28330 668 27662
5 2013 5 28796 563 28233
6 2013 6 28243 1009 27234
7 2013 7 29425 940 28485
8 2013 8 29327 486 28841
9 2013 9 27574 452 27122
10 2013 10 28889 236 28653
11 2013 11 27268 233 27035
12 2013 12 28135 1025 27110
Let’s take a look at the average arrival delay of each individual
airplane (as identified by its unique tail number). First, we drop all
rows where either departure delay or arrival delay values are missing.
Then we group by tailnum and summarize these groups by
computing the mean arrival delay. Finally, because the result of this
pipeline is a data frame, we can directly pipe it to
ggplot() to create a visualization of the average delay
distribution.
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(tailnum) |>
summarize(delay=mean(arr_delay)) |>
ggplot(mapping=aes(x=delay)) +
geom_histogram(bins=100)
This plot shows that most airplanes are delayed by only a few
minutes, but there are some airplanes with an average arrival delay of
around five hours! In this context, it would be interesting to know the
number of data points used for each average – it is likely that
airplanes with extremely high average delays are really based on only a
few flights. We can use the n() function to find out and
create a scatterplot of average delay versus number of data points for
each delay:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(tailnum) |>
summarize(n=n(), delay=mean(arr_delay)) |>
ggplot(mapping=aes(x=delay, y=n)) +
geom_point(alpha=0.25)
As expected, airplanes with extremely high average delays are based on only a single flight. We can actually take a look at the data sorted by average delay in descending order to see the exact numbers:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(tailnum) |>
summarize(n=n(), delay=mean(arr_delay)) |>
arrange(desc(delay))
# A tibble: 4,037 × 3
tailnum n delay
<chr> <int> <dbl>
1 N844MH 1 320
2 N911DA 1 294
3 N922EV 1 276
4 N587NW 1 264
5 N851NW 1 219
6 N928DN 1 201
7 N7715E 1 188
8 N654UA 1 185
9 N665MQ 6 175.
10 N427SW 1 157
# … with 4,027 more rows
Indeed, we can create a much more reliable plot by excluding outliers. Let’s create the same scatterplot with airplanes that have more than 25 flights:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(tailnum) |>
summarize(n=n(), delay=mean(arr_delay)) |>
filter(n > 25) |>
ggplot(mapping=aes(x=delay, y=n)) +
geom_point(alpha=0.25)
In addition to n(), there are many more useful summary
functions, including:
mean() and
median()sd(), IQR(), and
mad()min(), max(), and
quantile()first(), last(), and
nth()n(), n_distinct(), and
count()Let’s go through some examples demonstrating these summary functions. First, we compute mean and median arrival delays for each airplane:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(tailnum) |>
summarize(mean=mean(arr_delay), median=median(arr_delay))
# A tibble: 4,037 × 3
tailnum mean median
<chr> <dbl> <dbl>
1 D942DN 31.5 23
2 N0EGMQ 9.98 -2
3 N10156 12.7 2
4 N102UW 2.94 -6.5
5 N103US -6.93 -7
6 N104UW 1.80 -8.5
7 N10575 20.7 -2
8 N105UW -0.267 -7
9 N107US -5.73 -9
10 N108UW -1.25 -10
# … with 4,027 more rows
We might be interested in the variability of the distance of destination airports:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(dest) |>
summarize(sd=sd(distance)) |>
arrange(desc(sd))
# A tibble: 104 × 2
dest sd
<chr> <dbl>
1 EGE 10.5
2 SAN 10.4
3 SFO 10.2
4 HNL 10.0
5 SEA 9.98
6 LAS 9.91
7 PDX 9.87
8 PHX 9.86
9 LAX 9.66
10 IND 9.46
# … with 94 more rows
How about finding out the first and last departure times on each day?
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(year, month, day) |>
summarize(first=min(dep_time), last=max(dep_time))
# A tibble: 365 × 5
# Groups: year, month [12]
year month day first last
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
# … with 355 more rows
Since the data is actually sorted by dep_time, we can
also take the first and last entry in each group to get the earliest and
latest departure times:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(year, month, day) |>
summarize(first=first(dep_time), last=last(dep_time))
# A tibble: 365 × 5
# Groups: year, month [12]
year month day first last
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
# … with 355 more rows
Which destinations have the highest number of carriers?
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(dest) |>
summarize(carriers=n_distinct(carrier)) |>
arrange(desc(carriers))
# A tibble: 104 × 2
dest carriers
<chr> <int>
1 ATL 7
2 BOS 7
3 CLT 7
4 ORD 7
5 TPA 7
6 AUS 6
7 DCA 6
8 DTW 6
9 IAD 6
10 MSP 6
# … with 94 more rows
We have computed the number of flights of each airplane previously as follows:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
group_by(tailnum) |>
summarize(n=n())
# A tibble: 4,037 × 2
tailnum n
<chr> <int>
1 D942DN 4
2 N0EGMQ 352
3 N10156 145
4 N102UW 48
5 N103US 46
6 N104UW 46
7 N10575 269
8 N105UW 45
9 N107US 41
10 N108UW 60
# … with 4,027 more rows
If that’s all you want, there is a shorthand notation for this
operation using the count() function:
flights |>
filter(!is.na(dep_delay), !is.na(arr_delay)) |>
count(tailnum)
# A tibble: 4,037 × 2
tailnum n
<chr> <int>
1 D942DN 4
2 N0EGMQ 352
3 N10156 145
4 N102UW 48
5 N103US 46
6 N104UW 46
7 N10575 269
8 N105UW 45
9 N107US 41
10 N108UW 60
# … with 4,027 more rows
Sometimes it is necessary to remove grouping and return to the full
dataset. This can be achieved with the ungroup()
function.