nycflights13 if necessary.library(nycflights13)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
Note: you will probably have to install nycflights13 using install.packages and the load it with the library command. nycflights13 is a relational database containing the following tables (data frames). This is data about all airline flights into and out of New York City in 2021. This project will parallel Chapter 5 in Wickham and Hadley
| data frame | description |
|---|---|
| airlines | Airline names |
| airports | Airport metadata |
| flights | Flights data |
| planes | Planes meta data |
| weather | Hourly data |
This is data about all airline flights into and out of New York City in 2013. This project will parallel Chapter 5 in Wickham and Hadley. You should start reading this chapter now, and try to complete reading it by the end of this week.
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Write up your solutions to the exercises in 5.2.4 in this document, including the code chunks you use to determine the answer.
#1.
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 811 630 101 1047 830
## 2 2013 1 1 848 1835 853 1001 1950
## 3 2013 1 1 957 733 144 1056 853
## 4 2013 1 1 1114 900 134 1447 1222
## 5 2013 1 1 1505 1310 115 1638 1431
## 6 2013 1 1 1525 1340 105 1831 1626
## 7 2013 1 1 1549 1445 64 1912 1656
## 8 2013 1 1 1558 1359 119 1718 1515
## 9 2013 1 1 1732 1630 62 2028 1825
## 10 2013 1 1 1803 1620 103 2008 1750
## # … with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dest %in% c('IAH', 'HOU'))
## # A tibble: 9,313 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # … with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, carrier %in% c('UA','AA','DL'))
## # A tibble: 139,504 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 600 -1 941 910
## 10 2013 1 1 559 600 -1 854 902
## # … with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, month %in% c('7','8','9'))
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights,arr_delay > 120, dep_delay <= 0)
## # A tibble: 29 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 27 1419 1420 -1 1754 1550
## 2 2013 10 7 1350 1350 0 1736 1526
## 3 2013 10 7 1357 1359 -2 1858 1654
## 4 2013 10 16 657 700 -3 1258 1056
## 5 2013 11 1 658 700 -2 1329 1015
## 6 2013 3 18 1844 1847 -3 39 2219
## 7 2013 4 17 1635 1640 -5 2049 1845
## 8 2013 4 18 558 600 -2 1149 850
## 9 2013 4 18 655 700 -5 1213 950
## 10 2013 5 22 1827 1830 -3 2217 2010
## # … with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights,dep_delay >= 60 & dep_delay-arr_delay > 30)
## # A tibble: 1,844 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 2205 1720 285 46 2040
## 2 2013 1 1 2326 2130 116 131 18
## 3 2013 1 3 1503 1221 162 1803 1555
## 4 2013 1 3 1839 1700 99 2056 1950
## 5 2013 1 3 1850 1745 65 2148 2120
## 6 2013 1 3 1941 1759 102 2246 2139
## 7 2013 1 3 1950 1845 65 2228 2227
## 8 2013 1 3 2015 1915 60 2135 2111
## 9 2013 1 3 2257 2000 177 45 2224
## 10 2013 1 4 1917 1700 137 2135 1950
## # … with 1,834 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dep_time <=600 | dep_time == 2400)
## # A tibble: 9,373 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
between() function in R Language is used to check whether a numeric value falls in a specific range or not. A lower bound and an upper bound is specified and checked if the value falls in it
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, !between(dep_time, 601, 2359))
## # A tibble: 9,373 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
8255 flights have a missing dep_time. 8255 have a missing dep_delay, 8713 have a missing arr_time, 9430 have a missing arr_delay, and 9430 have a missing air_time. These flights represent those that never departed or arrived. It could also be an issue of lost flight data.
summary(flights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :29.00 Median :2013-07-03 10:00:00
## Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :59.00 Max. :2013-12-31 23:00:00
##
’NA ^ 0` equals 1 because anything to the power of 0 is 1. Unfortunatley, because it is recorded as NA the true value of NA can’t be determined.
NA | TRUE equals TRUE because the | operator returns TRUE if either of the terms are true. In this expression because the right side is ‘TRUE’ therefore giving the answer “TRUE’
‘False & NA’ equals FALSE because & returns TRUE when both terms are true. In this case neither parts of the expression are TRUE.
NA * 0 equals NA which can be any number (infinite). If the number is finite, then the result of the multiplication will be 0. However, if the the number is Inf, then the result of the multiplication will be NaN. We don’t know if the multiplication results in 0 or NaN so the result is given as NA.
NA ^ 0
## [1] 1
NA | TRUE
## [1] TRUE
FALSE & NA
## [1] FALSE
NA*0
## [1] NA
Inf*0
## [1] NaN
Write up your solutions to the exercises in 5.3.1 in this document, including the code chunks you use to determine the answer.
df <- tibble(x = c(5, 2, NA))
arrange(df, desc(is.na(x)))
## # A tibble: 3 × 1
## x
## <dbl>
## 1 NA
## 2 5
## 3 2
arrange(df, -(is.na(x)))
## # A tibble: 3 × 1
## x
## <dbl>
## 1 NA
## 2 5
## 3 2
#2. Sort flights to find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% mutate(travel_time = ifelse((arr_time - dep_time < 0),
2400+(arr_time - dep_time),
arr_time - dep_time)) %>%
arrange(travel_time) %>% select(arr_time, dep_time, travel_time)
## # A tibble: 336,776 × 3
## arr_time dep_time travel_time
## <int> <int> <dbl>
## 1 1358 1323 35
## 2 1347 1312 35
## 3 1238 1203 35
## 4 758 722 36
## 5 758 722 36
## 6 754 718 36
## 7 1455 1418 37
## 8 53 16 37
## 9 754 717 37
## 10 1353 1315 38
## # … with 336,766 more rows
arrange(flights, desc(distance)) %>% select(1:5, distance)
## # A tibble: 336,776 × 6
## year month day dep_time sched_dep_time distance
## <int> <int> <int> <int> <int> <dbl>
## 1 2013 1 1 857 900 4983
## 2 2013 1 2 909 900 4983
## 3 2013 1 3 914 900 4983
## 4 2013 1 4 900 900 4983
## 5 2013 1 5 858 900 4983
## 6 2013 1 6 1019 900 4983
## 7 2013 1 7 1042 900 4983
## 8 2013 1 8 901 900 4983
## 9 2013 1 9 641 900 4983
## 10 2013 1 10 859 900 4983
## # … with 336,766 more rows
arrange(flights, distance) %>% select(1:5, distance)
## # A tibble: 336,776 × 6
## year month day dep_time sched_dep_time distance
## <int> <int> <int> <int> <int> <dbl>
## 1 2013 7 27 NA 106 17
## 2 2013 1 3 2127 2129 80
## 3 2013 1 4 1240 1200 80
## 4 2013 1 4 1829 1615 80
## 5 2013 1 4 2128 2129 80
## 6 2013 1 5 1155 1200 80
## 7 2013 1 6 2125 2129 80
## 8 2013 1 7 2124 2129 80
## 9 2013 1 8 2127 2130 80
## 10 2013 1 9 2126 2129 80
## # … with 336,766 more rows
Write up your solutions to the exercises in 5.4.1 in this document, including the code chunks you use to determine the answer.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
select(flights, c(dep_time, dep_delay, arr_time, arr_delay))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
flights %>% select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
You just get the variable column with data once.
flights %>% select(arr_delay, arr_delay, arr_delay)
## # A tibble: 336,776 × 1
## arr_delay
## <dbl>
## 1 11
## 2 20
## 3 33
## 4 -18
## 5 -25
## 6 12
## 7 19
## 8 -14
## 9 -8
## 10 8
## # … with 336,766 more rows
It returns all the variables requested in the expression.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
All variable names that contained “TIME” were filtered. The default can be changed by setting ignore.case=FALSE.
select(flights, contains("TIME"))
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # … with 336,766 more rows
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Write up your solutions to the exercises in 5.5.2 in this document, including the code chunks you use to determine the answer.
mutate(flights,
dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100),
sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <dbl> <dbl> <dbl> <int> <int>
## 1 2013 1 1 317 315 2 830 819
## 2 2013 1 1 333 329 4 850 830
## 3 2013 1 1 342 340 2 923 850
## 4 2013 1 1 344 345 -1 1004 1022
## 5 2013 1 1 354 360 -6 812 837
## 6 2013 1 1 354 358 -4 740 728
## 7 2013 1 1 355 360 -5 913 854
## 8 2013 1 1 357 360 -3 709 723
## 9 2013 1 1 357 360 -3 838 846
## 10 2013 1 1 358 360 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
In the original flights dataset ‘air_time’ is in minutes (60) but ‘arr_time’ and “dep_time” are in military time (HHMM or HMM) so in order to compare a conversion would have to be made.
The results of arr_time - dep_time are large negative numbers. This occurs when a flight sets off before midnight but arrives after it.
The results vary significantly fromarr_time - dep_time and air_time
flights %>%
mutate(dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100),
sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100),
arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>%
transmute((arr_time - dep_time) %% (60*24) - air_time)
## # A tibble: 336,776 × 1
## `(arr_time - dep_time)%%(60 * 24) - air_time`
## <dbl>
## 1 -34
## 2 -30
## 3 61
## 4 77
## 5 22
## 6 -44
## 7 40
## 8 19
## 9 21
## 10 -23
## # … with 336,766 more rows
There were no ties in the top 10 most delayed flights for departure and arrival. The min_rank() function does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks. Therefore, if there was a tie for 10th place the min_rank() function could have produced more than 10 results. The arrange() function also could be used to break a tie. Instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
filter(flights, min_rank(desc(dep_delay))<=10)
## # A tibble: 10 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 1 10 1121 1635 1126 1239 1810
## 3 2013 12 5 756 1700 896 1058 2020
## 4 2013 3 17 2321 810 911 135 1020
## 5 2013 4 10 1100 1900 960 1342 2211
## 6 2013 6 15 1432 1935 1137 1607 2120
## 7 2013 6 27 959 1900 899 1236 2226
## 8 2013 7 22 845 1600 1005 1044 1815
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 9 20 1139 1845 1014 1457 2210
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% top_n(n = 10, wt = dep_delay)
## # A tibble: 10 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 1 10 1121 1635 1126 1239 1810
## 3 2013 12 5 756 1700 896 1058 2020
## 4 2013 3 17 2321 810 911 135 1020
## 5 2013 4 10 1100 1900 960 1342 2211
## 6 2013 6 15 1432 1935 1137 1607 2120
## 7 2013 6 27 959 1900 899 1236 2226
## 8 2013 7 22 845 1600 1005 1044 1815
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 9 20 1139 1845 1014 1457 2210
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
1:3 + 1:10 produces a length 10 vector and a warning message. This is because the shorter vector is repeated out to the length of the longer one. However, since 10 is not a multiple of 3 we get an error.
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 5 7 9 8 10 12 11
These functions give the obvious trigonometric functions. They respectively compute the cosine, sine, tangent, arc-cosine, arc-sine, arc-tangent, and the two-argument arc-tangent.
cospi(x), sinpi(x), and tanpi(x), compute cos(pix), sin(pix), and tan(pi*x).
Usage:
cos(x) sin(x) tan(x)
acos(x) asin(x) atan(x) atan2(y, x)
cospi(x) sinpi(x) tanpi(x)
?Trig
Write up your solutions to the exercises in 5.6.7 in this document, including the code chunks you use to determine the answer.
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flight_delay_summary <- group_by(flights, flight) %>% summarise(num_flights = n(),
percentage_on_time = sum(arr_time == sched_arr_time)/num_flights,
percentage_early = sum(arr_time < sched_arr_time)/num_flights,
percentage_15_mins_early = sum(sched_arr_time - arr_time == 15)/num_flights,
percentage_30_mins_early = sum(sched_arr_time - arr_time == 30)/num_flights,
percentage_late = sum(arr_time > sched_arr_time)/num_flights,
percentage_10_mins_late = sum(arr_time - sched_arr_time == 10)/num_flights,
percentage_15_mins_late = sum(arr_time - sched_arr_time == 15)/num_flights,
percentage_30_mins_late = sum(arr_time - sched_arr_time == 30)/num_flights,
percentage_2_hours_late = sum(arr_time - sched_arr_time == 120)/num_flights)
flight_delay_summary
## # A tibble: 3,844 × 11
## flight num_flights percentage_on_time percentage_early percentage_15_mins_ea…
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 701 NA NA NA
## 2 2 51 0.0392 0.725 0.0392
## 3 3 631 NA NA NA
## 4 4 393 NA NA NA
## 5 5 324 0.00617 0.716 0.00926
## 6 6 210 NA NA NA
## 7 7 237 NA NA NA
## 8 8 236 NA NA NA
## 9 9 153 NA NA NA
## 10 10 61 0.0164 0.721 0.0164
## # … with 3,834 more rows, and 6 more variables: percentage_30_mins_early <dbl>,
## # percentage_late <dbl>, percentage_10_mins_late <dbl>,
## # percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## # percentage_2_hours_late <dbl>
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
filter(flight_delay_summary, percentage_15_mins_early == 0.5 & percentage_15_mins_late == 0.5)
## # A tibble: 0 × 11
## # … with 11 variables: flight <int>, num_flights <int>,
## # percentage_on_time <dbl>, percentage_early <dbl>,
## # percentage_15_mins_early <dbl>, percentage_30_mins_early <dbl>,
## # percentage_late <dbl>, percentage_10_mins_late <dbl>,
## # percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## # percentage_2_hours_late <dbl>
A flight is always 10 minutes late.
filter(flight_delay_summary, percentage_10_mins_late == 1.00)
## # A tibble: 3 × 11
## flight num_flights percentage_on_time percentage_early percentage_15_mins_ear…
## <int> <int> <dbl> <dbl> <dbl>
## 1 2254 1 0 0 0
## 2 3880 1 0 0 0
## 3 5854 1 0 0 0
## # … with 6 more variables: percentage_30_mins_early <dbl>,
## # percentage_late <dbl>, percentage_10_mins_late <dbl>,
## # percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## # percentage_2_hours_late <dbl>
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
filter(flight_delay_summary, percentage_30_mins_early == 0.5 & percentage_30_mins_late == 0.5)
## # A tibble: 0 × 11
## # … with 11 variables: flight <int>, num_flights <int>,
## # percentage_on_time <dbl>, percentage_early <dbl>,
## # percentage_15_mins_early <dbl>, percentage_30_mins_early <dbl>,
## # percentage_late <dbl>, percentage_10_mins_late <dbl>,
## # percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## # percentage_2_hours_late <dbl>
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
filter(flight_delay_summary, percentage_on_time == 0.99 & percentage_2_hours_late == 0.01)
## # A tibble: 0 × 11
## # … with 11 variables: flight <int>, num_flights <int>,
## # percentage_on_time <dbl>, percentage_early <dbl>,
## # percentage_15_mins_early <dbl>, percentage_30_mins_early <dbl>,
## # percentage_late <dbl>, percentage_10_mins_late <dbl>,
## # percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## # percentage_2_hours_late <dbl>
Which is more important: arrival delay or departure delay?
Unfortunately, we do not have enough data to determine if arrival delay or departure delay is more important. This would vary depending on the need or use of the data. Particularly if an airline is using this data is it for accounting purpose, improve routes, improve the service it provides to its customer, etc.
not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(dest) %>%
tally()
## # A tibble: 104 × 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # … with 94 more rows
not_cancelled %>%
group_by(tailnum) %>%
summarise(n = sum(distance))
## # A tibble: 4,037 × 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # … with 4,027 more rows
Departure, because flights can’t arrive unless they depart from another airport.
flights %>%
group_by(departed = !is.na(dep_delay), arrived = !is.na(arr_delay)) %>%
summarise(n=n())
## `summarise()` has grouped output by 'departed'. You can override using the `.groups` argument.
## # A tibble: 3 × 3
## # Groups: departed [2]
## departed arrived n
## <lgl> <lgl> <int>
## 1 FALSE FALSE 8255
## 2 TRUE FALSE 1175
## 3 TRUE TRUE 327346
The carrier with the worst delays is SkyWest Airlines Inc. (OO) with 60.6 average arrival delay.
It is difficult to determine the effects of bad airports vs bad carriers. In this dataset there are 16 carriers, 3 origin airports, and 105 airport destinations. Many airport location act as hubs therefore we may find that many airports only support a few carriers. As a result, some airport destinations will have only one or two carriers, so it is difficult to tell how much of the delay is due to the carrier, and how much is due to the airport. We also have to consider the route delays, weather, and other plane technical difficulties that could also result in delays.
flights %>%
filter(arr_delay > 0) %>%
group_by(carrier) %>%
summarise(average_arr_delay = mean(arr_delay, na.rm=TRUE)) %>%
arrange(desc(average_arr_delay))
## # A tibble: 16 × 2
## carrier average_arr_delay
## <chr> <dbl>
## 1 OO 60.6
## 2 YV 51.1
## 3 9E 49.3
## 4 EV 48.3
## 5 F9 47.6
## 6 VX 43.8
## 7 FL 41.1
## 8 WN 40.7
## 9 B6 40.0
## 10 AA 38.3
## 11 MQ 37.9
## 12 DL 37.7
## 13 UA 36.7
## 14 HA 35.0
## 15 AS 34.4
## 16 US 29.0
flights %>%
summarise(n_distinct(carrier),
n_distinct(origin),
n_distinct(dest))
## # A tibble: 1 × 3
## `n_distinct(carrier)` `n_distinct(origin)` `n_distinct(dest)`
## <int> <int> <int>
## 1 16 3 105
flights %>%
mutate(dep_date = lubridate::make_datetime(year, month, day)) %>%
group_by(tailnum) %>%
arrange(dep_date) %>%
filter(!cumany(arr_delay>60)) %>%
tally(sort = TRUE)
## # A tibble: 3,748 × 2
## tailnum n
## <chr> <int>
## 1 N705TW 97
## 2 N765US 97
## 3 N12125 94
## 4 N320AA 94
## 5 N13110 91
## 6 N3763D 82
## 7 N58101 82
## 8 N17122 81
## 9 N961UW 80
## 10 N950UW 79
## # … with 3,738 more rows
You will turn in your work, first by posting the work to RPubs, you will probably need to get an account, they are free.
The work will be due by Midnight November 21.
You will get another project starting next week which will be due after Thanksgiving.