A Wrangling Problem

Just a sample of what we can do with dplyr.

Homer White http://gccs.surge.sh (Georgetown College)https://www.georgetowncollege.edu
09-22-2018

The Challenge

Consider the nycflights13 package. For each week of the year, find the plane(s) that made the most flights in the month, along with the tail number.

Solution

The following code returns only the top departer(s) in each week.


flights %>% 
  filter(!is.na(tailnum)) %>% 
  mutate(week_number = lubridate::week(time_hour)) %>% 
  group_by(week_number, tailnum) %>% 
  summarize(departures = n()) %>% 
  filter(departures == max(departures)) %>% 
  arrange(week_number) %>%  
  DT::datatable()

If we like we can retain all of the departures, but sort them within their week-groups. The key is to arrange() by weeknumber and by departures within weeknumber. See the code below.


flights %>% 
  filter(!is.na(tailnum)) %>% 
  mutate(week_number = lubridate::week(time_hour)) %>% 
  group_by(week_number, tailnum) %>% 
  summarize(departures = n()) %>% 
  arrange(week_number, desc(departures))

# A tibble: 109,324 x 3
# Groups:   week_number [53]
   week_number tailnum departures
         <dbl> <chr>        <int>
 1           1 N14542          17
 2           1 N711MQ          17
 3           1 N725MQ          17
 4           1 N730MQ          17
 5           1 N16561          16
 6           1 N737MQ          16
 7           1 N739MQ          16
 8           1 N281JB          15
 9           1 N509MQ          15
10           1 N723MQ          15
# ... with 109,314 more rows

You can’t show the whole thing in a DT datatable, however. If you try then you get a message in the Console:


It seems your data is too big for client-side DataTables. 
You may consider server-side processing: 
https://rstudio.github.io/DT/server.html

You could content yourself with finding top performers in each week:


flights %>% 
  filter(!is.na(tailnum)) %>% 
  mutate(week_number = lubridate::week(time_hour)) %>% 
  group_by(week_number, tailnum) %>% 
  summarize(departures = n()) %>% 
  top_n(5) %>% 
  arrange(week_number, desc(departures))

# A tibble: 348 x 3
# Groups:   week_number [53]
   week_number tailnum departures
         <dbl> <chr>        <int>
 1           1 N14542          17
 2           1 N711MQ          17
 3           1 N725MQ          17
 4           1 N730MQ          17
 5           1 N16561          16
 6           1 N737MQ          16
 7           1 N739MQ          16
 8           2 N719MQ          17
 9           2 N730MQ          17
10           2 N713MQ          16
# ... with 338 more rows

Note that when there are ties in a group more than the requested 5 items are returned.

This table is small enough for DT to handle:


flights %>% 
  filter(!is.na(tailnum)) %>% 
  mutate(week_number = lubridate::week(time_hour)) %>% 
  group_by(week_number, tailnum) %>% 
  summarize(departures = n()) %>% 
  top_n(5) %>% 
  arrange(week_number, desc(departures)) %>% 
  DT::datatable()

A Plot

For fun, let’s find plot the number of cancelled flights, during 2013, by day:


flights %>% 
  mutate(day_number = lubridate::yday(time_hour)) %>% 
  group_by(day_number) %>% 
  summarise(cancellations = sum(is.na(dep_delay))) %>% 
  mutate(date = lubridate::as_date(day_number, 
                                   origin = "2013-01-01")) %>% 
  ggplot(aes(x = date, y = cancellations)) +
    geom_line() +
    labs(x = NULL)

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".