Tristan Mahr, @tjmahr
March 18, 2015
Madison R Users Group
Repository for this talk: https://github.com/tjmahr/MadR_Pipelines
We just covered pipelines. Now we'll use pipelines to work on tabular data.
library("magrittr")
library("dplyr")
library("nycflights13")
The examples for the main dplyr verbs were lifted from the introductory vignette.
Convert data to a tbl_df so that it uses dplyr's nice print method.
flights <- tbl_df(flights)
flights
Source: local data frame [336,776 x 16]
year month day dep_time dep_delay
1 2013 1 1 517 2
2 2013 1 1 533 4
3 2013 1 1 542 2
4 2013 1 1 544 -1
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
Use glimpse to few some values in each column.
glimpse(flights)
Observations: 336776
Variables:
$ year (int) 2013, 2013, 2013, 2013...
$ month (int) 1, 1, 1, 1, 1, 1, 1, 1...
$ day (int) 1, 1, 1, 1, 1, 1, 1, 1...
$ dep_time (int) 517, 533, 542, 544, 55...
$ dep_delay (dbl) 2, 4, 2, -1, -6, -4, -...
$ arr_time (int) 830, 850, 923, 1004, 8...
$ arr_delay (dbl) 11, 20, 33, -18, -25, ...
$ carrier (chr) "UA", "UA", "AA", "B6"...
$ tailnum (chr) "N14228", "N24211", "N...
$ flight (int) 1545, 1714, 1141, 725,...
$ origin (chr) "EWR", "LGA", "JFK", "...
$ dest (chr) "IAH", "IAH", "MIA", "...
$ air_time (dbl) 227, 227, 160, 183, 11...
$ distance (dbl) 1400, 1416, 1089, 1576...
$ hour (dbl) 5, 5, 5, 5, 5, 5, 5, 5...
$ minute (dbl) 17, 33, 42, 44, 54, 54...
Is dpylr's print method really that useful since it's just printing four rows of four columns on these slides?
It normally fills width of the console. I cheated to accommodate the width of these slides, by setting some hidden flags:
options(
width = 45,
dplyr.width = 44,
dplyr.print_min = 4,
dplyr.print_max = 4)
How many flights flew to Madison in 2013?
flights %>%
filter(dest == "MSN")
Source: local data frame [572 x 16]
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
How many flights flew to Madison in first week of January?
# Comma separated conditions are combined with '&'
flights %>%
filter(dest == "MSN", month == 1, day <= 7)
Source: local data frame [6 x 16]
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
Commas in the filter statement are implicit
&(and) operators. Is there anything similar for|(or)?
Logical or statements are supported, but there's no shorthand.
flights %>%
filter(dest == "MSN" | dest == "ORD" | dest == "MDW")
For more complicated checks, I would try a set operation.
flights %>%
filter(is.element(dest, c("MSN", "ORD", "MDW")))
Sort by which airport they departed from in NYC, then year, month, day.
flights %>%
arrange(origin, year, month, day)
Source: local data frame [336,776 x 16]
year month day dep_time dep_delay
1 2013 1 1 517 2
2 2013 1 1 554 -4
3 2013 1 1 555 -5
4 2013 1 1 558 -2
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
Find longest delayed flights to Madison.
flights %>%
filter(dest == "MSN") %>%
arrange(desc(dep_delay))
Source: local data frame [572 x 16]
year month day dep_time dep_delay
1 2013 12 5 2000 340
2 2013 11 17 45 310
3 2013 3 8 1907 302
4 2013 9 12 1841 291
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
Select the columns you want.
flights %>%
select(origin, year, month, day)
Source: local data frame [336,776 x 4]
origin year month day
1 EWR 2013 1 1
2 LGA 2013 1 1
3 JFK 2013 1 1
4 JFK 2013 1 1
.. ... ... ... ...
select has many helper functions. See ?select.
flights %>%
select(origin, year:day, starts_with("dep"))
Source: local data frame [336,776 x 6]
origin year month day dep_time dep_delay
1 EWR 2013 1 1 517 2
2 LGA 2013 1 1 533 4
3 JFK 2013 1 1 542 2
4 JFK 2013 1 1 544 -1
.. ... ... ... ... ... ...
We can drop columns by “negating” the name. Since helpers give us column names, we can negate them too.
flights %>%
select(-dest, -starts_with("arr"),
-ends_with("time"))
Source: local data frame [336,776 x 11]
year month day dep_delay carrier tailnum
1 2013 1 1 2 UA N14228
2 2013 1 1 4 UA N24211
3 2013 1 1 2 AA N619AA
4 2013 1 1 -1 B6 N804JB
.. ... ... ... ... ... ...
Variables not shown: flight (int), origin
(chr), distance (dbl), hour (dbl), minute
(dbl)
tbl_df - nice print methodglimpse - some of each columnfilter - subsettingarrange - sorting (desc to reverse the sort)select - picking (and omiting) columnsRename columns with rename(NewName = OldName). To keep the order
correct, read/remember the renaming = as “was”.
flights %>%
rename(y = year, m = month, d = day)
Source: local data frame [336,776 x 16]
y m d dep_time dep_delay arr_time
1 2013 1 1 517 2 830
2 2013 1 1 533 4 850
3 2013 1 1 542 2 923
4 2013 1 1 544 -1 1004
.. ... . . ... ... ...
Variables not shown: arr_delay (dbl),
carrier (chr), tailnum (chr), flight
(int), origin (chr), dest (chr), air_time
(dbl), distance (dbl), hour (dbl), minute
(dbl)
How much departure delay did the flight make up for in the air?
flights %>%
mutate(
gain = arr_delay - dep_delay,
speed = (distance / air_time) * 60,
gain_per_hour = gain / (air_time / 60)) %>%
select(gain:gain_per_hour)
Source: local data frame [336,776 x 3]
gain speed gain_per_hour
1 9 370.0441 2.378855
2 16 374.2731 4.229075
3 31 408.3750 11.625000
4 -17 516.7213 -5.573770
.. ... ... ...
Let's compute the average delay per month of flights to Madison.
Normally–in aggregate, by or plyr's d*ply functions–you
specify the grouping as an argument to the aggregation function.
aggregate(dep_delay ~ month, flights, mean,
subset = flights$dest == "MSN")
month dep_delay
1 1 18.07692
2 2 20.11111
3 3 41.87097
4 4 29.40741
5 5 21.54839
6 6 29.83333
7 7 11.13333
8 8 19.06667
9 9 15.97183
10 10 19.26190
11 11 15.96250
12 12 42.68831
In dplyr, grouping is its own action. It is done as its own step in
the pipeline. Here, we filter to the flights to Madison and
group_by month.
msn_by_month <- flights %>%
filter(dest == "MSN") %>%
group_by(month)
msn_by_month
Source: local data frame [572 x 16]
Groups: month
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
Now we use summarise to compute (several) aggregate values within
each group (month). summarise returns one row per group.
msn_by_month %>%
summarise(
flights = n(),
avg_delay = mean(dep_delay, na.rm = TRUE),
n_planes = n_distinct(tailnum))
Source: local data frame [12 x 4]
month flights avg_delay n_planes
1 1 27 18.07692 23
2 2 47 20.11111 36
3 3 31 41.87097 29
4 4 27 29.40741 25
.. ... ... ... ...
tally is a shortcut for counting number of items per group.
Number of flights from NYC by destination by month:
flights %>%
group_by(dest, month) %>%
tally
Source: local data frame [1,113 x 3]
Groups: dest
dest month n
1 ABQ 4 9
2 ABQ 5 31
3 ABQ 6 30
4 ABQ 7 31
.. ... ... ..
Remove the grouping structure with ungroup.
msn_by_month %>% ungroup
Source: local data frame [572 x 16]
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
Each summarise statement peels off one layer of grouping (from the
right of the list of groups).
# day gets peeled off
per_day <- flights %>%
group_by(dest, year, month, day) %>%
summarise(flights = n())
per_day
Source: local data frame [31,229 x 5]
Groups: dest, year, month
dest year month day flights
1 ABQ 2013 4 22 1
2 ABQ 2013 4 23 1
3 ABQ 2013 4 24 1
4 ABQ 2013 4 25 1
.. ... ... ... ... ...
Peel off month grouping.
per_month <- per_day %>%
summarise(flights = sum(flights))
per_month
Source: local data frame [1,113 x 4]
Groups: dest, year
dest year month flights
1 ABQ 2013 4 9
2 ABQ 2013 5 31
3 ABQ 2013 6 30
4 ABQ 2013 7 31
.. ... ... ... ...
Peel off year grouping.
per_year <- per_month %>%
summarise(flights = sum(flights))
per_year
Source: local data frame [105 x 3]
Groups: dest
dest year flights
1 ABQ 2013 254
2 ACK 2013 265
3 ALB 2013 439
4 ANC 2013 8
.. ... ... ...
Is there a way to keep a group after a
summarisestatement?
The grouping variable goes away because summarise returns one row
for each value of the grouping variable. If you group_by month and
summarise, you have a row for each month. You can add the month
grouping back to the data-frame, but this is almost always not
necessary because you have only row per group at that point.
Relatedly, you can group by row with rowwise.
Rank flight delays within each destination and month. (dplyr provides a bunch of ranking functions. See the cheatsheet.)
flights %>%
group_by(dest, month) %>%
mutate(timely = row_number(dep_delay),
late = row_number(desc(dep_delay))) %>%
select(dep_delay, timely:late)
Source: local data frame [336,776 x 5]
Groups: dest, month
dest month dep_delay timely late
1 IAH 1 2 347 195
2 IAH 1 4 380 162
3 MIA 1 2 662 294
4 BQN 1 -1 46 38
.. ... ... ... ... ...
This is also good for mean-centering variables within groups.
mean_center <- function(xs) {
xs - mean(xs, na.rm = TRUE)
}
flights %>%
group_by(dest, month) %>%
mutate(c_delay = mean_center(dep_delay)) %>%
select(dep_delay, c_delay)
Source: local data frame [336,776 x 4]
Groups: dest, month
dest month dep_delay c_delay
1 IAH 1 2 -5.294643
2 IAH 1 4 -3.294643
3 MIA 1 2 -5.360656
4 BQN 1 -1 -6.870968
.. ... ... ... ...
For fun, I demonstrated how I use pipelines when making tables.