We will work with the data frame flights, which is included in the nycflights13 package. To get started load tidyverse and nycflights13 with
library(tidyverse)
library(nycflights13)
You may need to install nycflights13. Run install.packages("nycflights13") in your RStudio Console pane.
Package nycflights13 contains a data frame flights that has on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013. Take a few minutes to examine the variables and their descriptions with regards to flights. Run ?flights in your RStudio Console pane.
flights
Object flights is a tibble. Another way to view the tibble in order to see all variables is with function glimpse().
glimpse(flights)
Observations: 336,776
Variables: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
Before you get started, take a few minutes to refresh on some of R’s comparison operators detailed below.
| Operator | Description |
|---|---|
> |
greater than |
< |
less than |
>= |
greater than or equal to |
<= |
less than or equal to |
== |
equal to |
!= |
not equal to |
& |
and (ex: (5 > 7) & (6*7 == 42) will return the value FALSE) |
| |
or (ex: (5 > 7) | (6*7 == 42) will return the value TRUE) |
%in% |
group membership |
# group membership example
set.seed(634789234)
die.out <- sample(x = 1:6, size = 10, replace = T)
die.out
die.out %in% c(3, 4)
c(3, 4) %in% die.out
Package dplyr is based on the concept of functions as verbs that manipulate data frames.
| Function | Action and purpose |
|---|---|
filter() |
choose rows matching a set of criteria |
slice() |
choose rows using indices |
select() |
choose columns by name |
pull() |
grab a column as a vector |
rename() |
rename specific columns |
arrange() |
reorder rows |
mutate() |
add new variables to the data frame |
transmute() |
create a new data frame with new variables |
distinct() |
filter for unique rows |
sample_n / sample_frac() |
randomly sample rows |
summarise() |
reduce variables to values |
Make use of %>% operator and any of the functions in package dplyr to answer the following questions.
Filter flights for those in January with a destination of Detroit Metro (DTW) or Chicago O’Hare (ORD).
Filter flights for those before April with a destination that is not Detroit Metro (DTW) and had an origin of JFK.
Choose rows 1, 3, 7, 20 from flights.
Arrange flights by distance and then by departure delay, with the sorting being in descending order in both cases. Hint: desc()
Select only columns month, origin, and destination from flights.
Add a new variable to flights called gain, where gain is the arrival delay minus the departure delay.
Use summarise to obtain the mean departure delay and mean arrival delay for all flights with an origin of EWR.
Grouping adds substantially to the power of the dplyr functions. We will focus on using summarise() with group_by(), but grouping also can be used with other dplyr functions.
Create a data frame which contains the number of flights and the mean arrival delay for flights on carrier UA (United Airlines) whose destination is O’Hare Airport (ORD). The number of flights and mean arrival delay is calculated separately for flights out of each of the origin airports.
Recreate the below plot. Hints:
geom_tile()scale_fill_distiller(palette = "RdPu", direction = 1)labs() a \n will place text on a new linefilter()To choose the flights in March with a destination of Detroit
flights %>%
filter(month == 3 & dest == "DTW")
slice()To choose rows 1, 2, 3, 4, 50, and 100
flights %>%
slice(c(1:4, 50, 100))
arrange()To arrange flights by arrival time and then by carrier
flights %>%
arrange(arr_time, carrier)
select()To select the variables month, origin, and destination
flights %>%
select(month, origin, dest)
mutate()To add a variable air_time_hr
flights %>%
mutate(air_time_hr = air_time / 60) %>%
glimpse
Observations: 336,776
Variables: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
$ air_time_hr <dbl> 3.7833333, 3.7833333, 2.6666667, 3.0500000, 1.9...
transmute()To create a new data frame with a new variable air_time_hr
flights %>%
transmute(air_time_hr = air_time / 60)
summarise()To compute the median distance and median air travel time for all flights originating from EWR
flights %>%
filter(origin == "EWR") %>%
summarise(med_dist = median(distance, na.rm = TRUE),
med_air_time = median(air_time, na.rm = TRUE))
group_by()To compute the number of flights, the mean distance of flights, and the mean arrival delay of flights, for each month of the year
flights %>%
group_by(month) %>%
summarize(number = n(),
mean_distance = mean(distance, na.rm = TRUE),
mean_arrival_delay = mean(arr_delay, na.rm = TRUE))