1a.
Delay of two or more hours:
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 811 630 101 1047 830 137 MQ
## 2 2013 1 1 848 1835 853 1001 1950 851 MQ
## 3 2013 1 1 957 733 144 1056 853 123 UA
## 4 2013 1 1 1114 900 134 1447 1222 145 UA
## 5 2013 1 1 1505 1310 115 1638 1431 127 EV
## 6 2013 1 1 1525 1340 105 1831 1626 125 B6
## 7 2013 1 1 1549 1445 64 1912 1656 136 EV
## 8 2013 1 1 1558 1359 119 1718 1515 123 EV
## 9 2013 1 1 1732 1630 62 2028 1825 123 EV
## 10 2013 1 1 1803 1620 103 2008 1750 138 MQ
## # … with 10,190 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
1b.
Flew to Houston:
filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 623 627 -4 933 932 1 UA
## 4 2013 1 1 728 732 -4 1041 1038 3 UA
## 5 2013 1 1 739 739 0 1104 1038 26 UA
## 6 2013 1 1 908 908 0 1228 1219 9 UA
## 7 2013 1 1 1028 1026 2 1350 1339 11 UA
## 8 2013 1 1 1044 1045 -1 1352 1351 1 UA
## 9 2013 1 1 1114 900 134 1447 1222 145 UA
## 10 2013 1 1 1205 1200 5 1503 1505 -2 UA
## # … with 9,303 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
1c.
Flights operated by United (UA), American (AA), or Delta (DL)
filter(flights, carrier %in% c("AA", "DL", "UA"))
## # A tibble: 139,504 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 554 600 -6 812 837 -25 DL
## 5 2013 1 1 554 558 -4 740 728 12 UA
## 6 2013 1 1 558 600 -2 753 745 8 AA
## 7 2013 1 1 558 600 -2 924 917 7 UA
## 8 2013 1 1 558 600 -2 923 937 -14 UA
## 9 2013 1 1 559 600 -1 941 910 31 AA
## 10 2013 1 1 559 600 -1 854 902 -8 UA
## # … with 139,494 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
1d.
Departed in July, August, or September
filter(flights, month >= 7, month >= 9)
## # A tibble: 111,866 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 10 1 447 500 -13 614 648 -34 US
## 2 2013 10 1 522 517 5 735 757 -22 UA
## 3 2013 10 1 536 545 -9 809 855 -46 AA
## 4 2013 10 1 539 545 -6 801 827 -26 UA
## 5 2013 10 1 539 545 -6 917 933 -16 B6
## 6 2013 10 1 544 550 -6 912 932 -20 B6
## 7 2013 10 1 549 600 -11 653 716 -23 EV
## 8 2013 10 1 550 600 -10 648 700 -12 US
## 9 2013 10 1 550 600 -10 649 659 -10 US
## 10 2013 10 1 551 600 -9 727 730 -3 UA
## # … with 111,856 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
1e.
Arrived more than two hours late but left on time
filter(flights, arr_delay > 120, dep_delay <=0)
## # A tibble: 29 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 27 1419 1420 -1 1754 1550 124 MQ
## 2 2013 10 7 1350 1350 0 1736 1526 130 EV
## 3 2013 10 7 1357 1359 -2 1858 1654 124 AA
## 4 2013 10 16 657 700 -3 1258 1056 122 B6
## 5 2013 11 1 658 700 -2 1329 1015 194 VX
## 6 2013 3 18 1844 1847 -3 39 2219 140 UA
## 7 2013 4 17 1635 1640 -5 2049 1845 124 MQ
## 8 2013 4 18 558 600 -2 1149 850 179 AA
## 9 2013 4 18 655 700 -5 1213 950 143 AA
## 10 2013 5 22 1827 1830 -3 2217 2010 127 MQ
## # … with 19 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
1f.
Delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay - arr_delay > 30, dep_delay >= 60)
## # A tibble: 1,844 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 2205 1720 285 46 2040 246 AA
## 2 2013 1 1 2326 2130 116 131 18 73 B6
## 3 2013 1 3 1503 1221 162 1803 1555 128 UA
## 4 2013 1 3 1839 1700 99 2056 1950 66 AA
## 5 2013 1 3 1850 1745 65 2148 2120 28 AA
## 6 2013 1 3 1941 1759 102 2246 2139 67 UA
## 7 2013 1 3 1950 1845 65 2228 2227 1 B6
## 8 2013 1 3 2015 1915 60 2135 2111 24 9E
## 9 2013 1 3 2257 2000 177 45 2224 141 9E
## 10 2013 1 4 1917 1700 137 2135 1950 105 AA
## # … with 1,834 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
1g.
Departed between midnight and 6 am
filter(flights, dep_time <= 600 | dep_time == 2400)
## # A tibble: 9,373 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 9,363 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
The between function helps simplify statements that deal with >= and <=, such as what we used in 1d above.
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # … with 86,316 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Number of flights with a missing dep_time:
filter(flights, is.na(dep_time))
## # A tibble: 8,255 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 NA 1630 NA NA 1815 NA EV
## 2 2013 1 1 NA 1935 NA NA 2240 NA AA
## 3 2013 1 1 NA 1500 NA NA 1825 NA AA
## 4 2013 1 1 NA 600 NA NA 901 NA B6
## 5 2013 1 2 NA 1540 NA NA 1747 NA EV
## 6 2013 1 2 NA 1620 NA NA 1746 NA EV
## 7 2013 1 2 NA 1355 NA NA 1459 NA EV
## 8 2013 1 2 NA 1420 NA NA 1644 NA EV
## 9 2013 1 2 NA 1321 NA NA 1536 NA EV
## 10 2013 1 2 NA 1545 NA NA 1910 NA AA
## # … with 8,245 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
It appears that the arr_time variable is missing from these flights, possibly indicating that these flights were cancelled.
Most delayed flight:
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
## 6 2013 4 10 1100 1900 960 1342 2211 931 DL
## 7 2013 3 17 2321 810 911 135 1020 915 DL
## 8 2013 6 27 959 1900 899 1236 2226 850 DL
## 9 2013 7 22 2257 759 898 121 1026 895 DL
## 10 2013 12 5 756 1700 896 1058 2020 878 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Earliest delayed flight:
arrange(flights, dep_delay)
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 12 7 2040 2123 -43 40 2352 48 B6
## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL
## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV
## 4 2013 1 11 1900 1930 -30 2233 2243 -10 DL
## 5 2013 1 29 1703 1730 -27 1947 1957 -10 F9
## 6 2013 8 9 729 755 -26 1002 955 7 MQ
## 7 2013 10 23 1907 1932 -25 2143 2143 0 EV
## 8 2013 3 30 2030 2055 -25 2213 2250 -37 MQ
## 9 2013 3 2 1431 1455 -24 1601 1631 -30 9E
## 10 2013 5 5 934 958 -24 1225 1309 -44 B6
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Fastest flights (in terms of shortest flight). (Not sure if you were looking for fastest in terms of speed or time?)
head(arrange(flights, air_time))
## # A tibble: 6 × 19
## year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 16 1355 1315 40 1442 1411 31 EV
## 2 2013 4 13 537 527 10 622 628 -6 EV
## 3 2013 12 6 922 851 31 1021 954 27 EV
## 4 2013 2 3 2153 2129 24 2247 2224 23 EV
## 5 2013 2 5 1303 1315 -12 1342 1411 -29 EV
## 6 2013 2 12 2123 2130 -7 2211 2225 -14 EV
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `colnames()` to see all variable names
Traveled the longest:
arrange(flights, desc(distance))
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 857 900 -3 1516 1530 -14 HA
## 2 2013 1 2 909 900 9 1525 1530 -5 HA
## 3 2013 1 3 914 900 14 1504 1530 -26 HA
## 4 2013 1 4 900 900 0 1516 1530 -14 HA
## 5 2013 1 5 858 900 -2 1519 1530 -11 HA
## 6 2013 1 6 1019 900 79 1558 1530 28 HA
## 7 2013 1 7 1042 900 102 1620 1530 50 HA
## 8 2013 1 8 901 900 1 1504 1530 -26 HA
## 9 2013 1 9 641 900 1301 1242 1530 1272 HA
## 10 2013 1 10 859 900 -1 1449 1530 -41 HA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Traveled the shortest:
arrange(flights, distance)
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 27 NA 106 NA NA 245 NA US
## 2 2013 1 3 2127 2129 -2 2222 2224 -2 EV
## 3 2013 1 4 1240 1200 40 1333 1306 27 EV
## 4 2013 1 4 1829 1615 134 1937 1721 136 EV
## 5 2013 1 4 2128 2129 -1 2218 2224 -6 EV
## 6 2013 1 5 1155 1200 -5 1241 1306 -25 EV
## 7 2013 1 6 2125 2129 -4 2224 2224 0 EV
## 8 2013 1 7 2124 2129 -5 2212 2224 -12 EV
## 9 2013 1 8 2127 2130 -3 2304 2225 39 EV
## 10 2013 1 9 2126 2129 -3 2217 2224 -7 EV
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Brainstorm:
dplyr::select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
## # ℹ Use `print(n = ...)` to see more rows
dplyr::select(flights, starts_with("dep_"), starts_with("arr_"))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
## # ℹ Use `print(n = ...)` to see more rows
brainstorm <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
dplyr::select(flights, brainstorm)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(brainstorm)` instead of `brainstorm` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
## # ℹ Use `print(n = ...)` to see more rows
Given code:
dplyr::select(flights, contains("TIME"))
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # … with 336,766 more rows
## # ℹ Use `print(n = ...)` to see more rows
From the code, it appears that with using contain, case will be ignored by default. If we wanted to change that, we would have to add to our code to specify that request.
Example:
dplyr::select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0
## # ℹ Use `print(n = ...)` to see more rows