Introduction

We will work with the data frame flights, which is included in the nycflights13 package. To get started load tidyverse and nycflights13 with

library(tidyverse)
library(nycflights13)

You may need to install nycflights13. Run install.packages("nycflights13") in your RStudio Console pane.

Package nycflights13 contains a data frame flights that has on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013. Take a few minutes to examine the variables and their descriptions with regards to flights. Run ?flights in your RStudio Console pane.

flights

Object flights is a tibble. Another way to view the tibble in order to see all variables is with function glimpse().

glimpse(flights)
Observations: 336,776
Variables: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

Comparison operators

Before you get started, take a few minutes to refresh on some of R’s comparison operators detailed below.

Operator Description
> greater than
< less than
>= greater than or equal to
<= less than or equal to
== equal to
!= not equal to
& and (ex: (5 > 7) & (6*7 == 42) will return the value FALSE)
| or (ex: (5 > 7) | (6*7 == 42) will return the value TRUE)
%in% group membership
# group membership example
set.seed(634789234)
die.out <- sample(x = 1:6, size = 10, replace = T)
die.out

die.out %in% c(3, 4)
c(3, 4) %in% die.out

dplyr

Package dplyr is based on the concept of functions as verbs that manipulate data frames.

Function Action and purpose
filter() choose rows matching a set of criteria
slice() choose rows using indices
select() choose columns by name
pull() grab a column as a vector
rename() rename specific columns
arrange() reorder rows
mutate() add new variables to the data frame
transmute() create a new data frame with new variables
distinct() filter for unique rows
sample_n / sample_frac() randomly sample rows
summarise() reduce variables to values

Exercise set 1

Make use of %>% operator and any of the functions in package dplyr to answer the following questions.

  1. Filter flights for those in January with a destination of Detroit Metro (DTW) or Chicago O’Hare (ORD).

  2. Filter flights for those before April with a destination that is not Detroit Metro (DTW) and had an origin of JFK.

  3. Choose rows 1, 3, 7, 20 from flights.

  4. Arrange flights by distance and then by departure delay, with the sorting being in descending order in both cases. Hint: desc()

  5. Select only columns month, origin, and destination from flights.

  6. Add a new variable to flights called gain, where gain is the arrival delay minus the departure delay.

  7. Use summarise to obtain the mean departure delay and mean arrival delay for all flights with an origin of EWR.

Exercise set 2

Grouping adds substantially to the power of the dplyr functions. We will focus on using summarise() with group_by(), but grouping also can be used with other dplyr functions.

  1. Create a data frame which contains the number of flights and the mean arrival delay for flights on carrier UA (United Airlines) whose destination is O’Hare Airport (ORD). The number of flights and mean arrival delay is calculated separately for flights out of each of the origin airports.

  2. Recreate the below plot. Hints:

  • wrangle flights to get the data necessary for the plot
  • geom_tile()
  • scale_fill_distiller(palette = "RdPu", direction = 1)
  • inside labs() a \n will place text on a new line

Examples

filter()

To choose the flights in March with a destination of Detroit

flights %>% 
  filter(month == 3 & dest == "DTW")

slice()

To choose rows 1, 2, 3, 4, 50, and 100

flights %>% 
  slice(c(1:4, 50, 100))

arrange()

To arrange flights by arrival time and then by carrier

flights %>% 
  arrange(arr_time, carrier)

select()

To select the variables month, origin, and destination

flights %>% 
  select(month, origin, dest)

mutate()

To add a variable air_time_hr

flights %>% 
  mutate(air_time_hr = air_time / 60) %>% 
  glimpse
Observations: 336,776
Variables: 20
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
$ air_time_hr    <dbl> 3.7833333, 3.7833333, 2.6666667, 3.0500000, 1.9...

transmute()

To create a new data frame with a new variable air_time_hr

flights %>% 
  transmute(air_time_hr = air_time / 60)

summarise()

To compute the median distance and median air travel time for all flights originating from EWR

flights %>% 
  filter(origin == "EWR") %>% 
  summarise(med_dist = median(distance, na.rm = TRUE),
            med_air_time = median(air_time, na.rm = TRUE))

group_by()

To compute the number of flights, the mean distance of flights, and the mean arrival delay of flights, for each month of the year

flights %>% 
  group_by(month) %>% 
  summarize(number = n(), 
            mean_distance = mean(distance, na.rm = TRUE), 
            mean_arrival_delay = mean(arr_delay, na.rm = TRUE))