First, I load in the dplyr package and the dataset.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.3.3
flights |>
filter(arr_delay >= 120) |>
relocate(arr_delay)
## # A tibble: 10,200 × 19
## arr_delay year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 137 2013 1 1 811 630 101 1047
## 2 851 2013 1 1 848 1835 853 1001
## 3 123 2013 1 1 957 733 144 1056
## 4 145 2013 1 1 1114 900 134 1447
## 5 127 2013 1 1 1505 1310 115 1638
## 6 125 2013 1 1 1525 1340 105 1831
## 7 136 2013 1 1 1549 1445 64 1912
## 8 123 2013 1 1 1558 1359 119 1718
## 9 123 2013 1 1 1732 1630 62 2028
## 10 138 2013 1 1 1803 1620 103 2008
## # ℹ 10,190 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
mutate(flight_speed = distance / air_time) |>
arrange(desc(flight_speed)) |>
relocate(flight_speed)
## # A tibble: 336,776 × 20
## flight_speed year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 11.7 2013 5 25 1709 1700 9 1923
## 2 10.8 2013 7 2 1558 1513 45 1745
## 3 10.8 2013 5 13 2040 2025 15 2225
## 4 10.7 2013 3 23 1914 1910 4 2045
## 5 9.86 2013 1 12 1559 1600 -1 1849
## 6 9.4 2013 11 17 650 655 -5 1059
## 7 9.29 2013 2 21 2355 2358 -3 412
## 8 9.27 2013 11 17 759 800 -1 1212
## 9 9.24 2013 11 16 2003 1925 38 17
## 10 9.24 2013 11 16 2349 2359 -10 402
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The results in the new flight_speed column are given in miles per minute. The fastest flight of the year was from LGA to ATL on May 25th, with an average flight speed of 11.72 miles per minute.
flights |>
distinct(month, day)
## # A tibble: 365 × 2
## month day
## <int> <int>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
## # ℹ 355 more rows
The distinct() function returns a tibble with 365 distinct month and day combinations, so, yes, there was a flight on every day of 2013.
flights |>
arrange(desc(distance)) |>
head()
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 857 900 -3 1516 1530
## 2 2013 1 2 909 900 9 1525 1530
## 3 2013 1 3 914 900 14 1504 1530
## 4 2013 1 4 900 900 0 1516 1530
## 5 2013 1 5 858 900 -2 1519 1530
## 6 2013 1 6 1019 900 79 1558 1530
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
arrange(desc(distance)) |>
tail()
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 3 9 1959 2000 -1 2052 2054
## 2 2013 3 16 1947 1950 -3 2055 2044
## 3 2013 3 23 1946 1950 -4 2030 2044
## 4 2013 3 30 1942 1950 -8 2026 2044
## 5 2013 4 6 1948 1950 -2 2034 2044
## 6 2013 7 27 NA 106 NA NA 245
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
The flights that travelled the furthest distance were those that started at JFK and ended in HNL (Honolulu, Hawaii), with a total distance travelled of 4,983 miles.
The flights that travelled the shortest distance were those that left from EWR and ended in PHL (Philadelphia, Pennsylvania) with a total distance travelled of 80 miles. The shortest flight was the irregular flight from EWR to LGA that was discussed in class, with a distance of 17 miles, but for the purposes of my analysis, I won’t count this piece of data.
When using both filter() and arrange(), the order will not matter in terms of the produced tibble. You can either filter with your desired conditions and then arrange these rows, or you can arrange all of the rows and then select your desired conditions. For example,
#Version 1
flights |>
filter(dest == "MSN") |>
arrange(desc(arr_delay))
## # A tibble: 572 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 12 1841 1350 291 2135 1531
## 2 2013 12 5 2000 1420 340 2132 1555
## 3 2013 10 7 1912 1425 287 2048 1547
## 4 2013 3 8 1907 1405 302 2031 1533
## 5 2013 3 14 1845 1405 280 2026 1533
## 6 2013 11 17 45 1935 310 206 2132
## 7 2013 6 30 1855 1415 280 2013 1539
## 8 2013 10 11 1857 1425 272 2011 1547
## 9 2013 2 6 2242 1825 257 15 2008
## 10 2013 12 31 1819 1505 194 2023 1641
## # ℹ 562 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Version 2
flights |>
arrange(desc(arr_delay)) |>
filter(dest == "MSN")
## # A tibble: 572 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 12 1841 1350 291 2135 1531
## 2 2013 12 5 2000 1420 340 2132 1555
## 3 2013 10 7 1912 1425 287 2048 1547
## 4 2013 3 8 1907 1405 302 2031 1533
## 5 2013 3 14 1845 1405 280 2026 1533
## 6 2013 11 17 45 1935 310 206 2132
## 7 2013 6 30 1855 1415 280 2013 1539
## 8 2013 10 11 1857 1425 272 2011 1547
## 9 2013 2 6 2242 1825 257 15 2008
## 10 2013 12 31 1819 1505 194 2023 1641
## # ℹ 562 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
These two versions produce the same result. However, in terms of computational efficiency, the order does matter. When arrange() is used first, the code must arrange all 300,000+ rows just for those rows to then be narrowed down by conditions. If filter() is used first, the code first identifies the data of interest, and then arranges this smaller sub-set of the larger dataset. Therefore, using filter() and then arrange() is more efficient.