We will work with the data frame flights
, which is included in the nycflights13
package. To get started load tidyverse
and nycflights13
with
library(tidyverse)
library(nycflights13)
You may need to install nycflights13
. Run install.packages("nycflights13")
in your RStudio Console pane.
Package nycflights13
contains a data frame flights
that has on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013. Take a few minutes to examine the variables and their descriptions with regards to flights
. Run ?flights
in your RStudio Console pane.
flights
Object flights
is a tibble. Another way to view the tibble in order to see all variables is with function glimpse()
.
glimpse(flights)
Observations: 336,776
Variables: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
Before you get started, take a few minutes to refresh on some of R’s comparison operators detailed below.
Operator | Description |
---|---|
> |
greater than |
< |
less than |
>= |
greater than or equal to |
<= |
less than or equal to |
== |
equal to |
!= |
not equal to |
& |
and (ex: (5 > 7) & (6*7 == 42) will return the value FALSE) |
| |
or (ex: (5 > 7) | (6*7 == 42) will return the value TRUE) |
%in% |
group membership |
# group membership example
set.seed(634789234)
die.out <- sample(x = 1:6, size = 10, replace = T)
die.out
die.out %in% c(3, 4)
c(3, 4) %in% die.out
Package dplyr is based on the concept of functions as verbs that manipulate data frames.
Function | Action and purpose |
---|---|
filter() |
choose rows matching a set of criteria |
slice() |
choose rows using indices |
select() |
choose columns by name |
pull() |
grab a column as a vector |
rename() |
rename specific columns |
arrange() |
reorder rows |
mutate() |
add new variables to the data frame |
transmute() |
create a new data frame with new variables |
distinct() |
filter for unique rows |
sample_n / sample_frac() |
randomly sample rows |
summarise() |
reduce variables to values |
Make use of %>%
operator and any of the functions in package dplyr
to answer the following questions.
Filter flights
for those in January with a destination of Detroit Metro (DTW) or Chicago O’Hare (ORD).
Filter flights
for those before April with a destination that is not Detroit Metro (DTW) and had an origin of JFK.
Choose rows 1, 3, 7, 20 from flights
.
Arrange flights by distance and then by departure delay, with the sorting being in descending order in both cases. Hint: desc()
Select only columns month, origin, and destination from flights
.
Add a new variable to flights
called gain
, where gain
is the arrival delay minus the departure delay.
Use summarise to obtain the mean departure delay and mean arrival delay for all flights with an origin of EWR.
Grouping adds substantially to the power of the dplyr
functions. We will focus on using summarise()
with group_by()
, but grouping also can be used with other dplyr
functions.
Create a data frame which contains the number of flights and the mean arrival delay for flights on carrier UA (United Airlines) whose destination is O’Hare Airport (ORD). The number of flights and mean arrival delay is calculated separately for flights out of each of the origin airports.
Recreate the below plot. Hints:
geom_tile()
scale_fill_distiller(palette = "RdPu", direction = 1)
labs()
a \n
will place text on a new linefilter()
To choose the flights in March with a destination of Detroit
flights %>%
filter(month == 3 & dest == "DTW")
slice()
To choose rows 1, 2, 3, 4, 50, and 100
flights %>%
slice(c(1:4, 50, 100))
arrange()
To arrange flights
by arrival time and then by carrier
flights %>%
arrange(arr_time, carrier)
select()
To select the variables month, origin, and destination
flights %>%
select(month, origin, dest)
mutate()
To add a variable air_time_hr
flights %>%
mutate(air_time_hr = air_time / 60) %>%
glimpse
Observations: 336,776
Variables: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
$ air_time_hr <dbl> 3.7833333, 3.7833333, 2.6666667, 3.0500000, 1.9...
transmute()
To create a new data frame with a new variable air_time_hr
flights %>%
transmute(air_time_hr = air_time / 60)
summarise()
To compute the median distance and median air travel time for all flights originating from EWR
flights %>%
filter(origin == "EWR") %>%
summarise(med_dist = median(distance, na.rm = TRUE),
med_air_time = median(air_time, na.rm = TRUE))
group_by()
To compute the number of flights, the mean distance of flights, and the mean arrival delay of flights, for each month of the year
flights %>%
group_by(month) %>%
summarize(number = n(),
mean_distance = mean(distance, na.rm = TRUE),
mean_arrival_delay = mean(arr_delay, na.rm = TRUE))