Code
# Loading package(s) and data
library(tidyverse)
library(nycflights13)
library(methods)
data("flights")Data Science 1 with R (STAT 301-1)
# Loading package(s) and data
library(tidyverse)
library(nycflights13)
library(methods)
data("flights")Why does this code not work?
my_variable <- 10
my_varıable
#> Error: object 'my_varıable' not foundThe two names are not the same. In the second “my_variable”, the “i” has been written as “1” due to a typo, and the smallest of typos will make the code not work. Paying very close attention to the exact writing of your code is critical, because it needs to be 100% accurate for it to run.
Tweak each of the following R commands so that they run correctly:
# Command 1
ggplot(dota = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# Command 2
fliter(mpg, cyl = 8)
# Command 3
filter(diamond, carat > 3)# Command 1
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))# Command 2
filter(mpg, cyl == 8)# A tibble: 70 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
2 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
3 chevrolet c1500 sub… 5.3 2008 8 auto… r 11 15 e suv
4 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
5 chevrolet c1500 sub… 5.7 1999 8 auto… r 13 17 r suv
6 chevrolet c1500 sub… 6 2008 8 auto… r 12 17 r suv
7 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
8 chevrolet corvette 5.7 1999 8 auto… r 15 23 p 2sea…
9 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
10 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
# … with 60 more rows
# Command 3
filter(diamonds, carat > 3)# A tibble: 32 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
8 3.24 Premium H I1 62.1 58 12300 9.44 9.4 5.85
9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
# … with 22 more rows
For command #1, data is accidentally spelled as “dota”. For command #2, filter is spelled incorrectly, meaning that RStudio cannot infer what we are asking it to do. For command #3, the dataset should be labeled as “diamonds”, not “diamond”.
Find all flights that:
IAH or HOU)# a
flights %>%
filter(arr_delay >= 2) %>%
select(arr_delay, everything())# A tibble: 127,929 × 19
arr_delay year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ carrier
<dbl> <int> <int> <int> <int> <int> <dbl> <int> <int> <chr>
1 11 2013 1 1 517 515 2 830 819 UA
2 20 2013 1 1 533 529 4 850 830 UA
3 33 2013 1 1 542 540 2 923 850 AA
4 12 2013 1 1 554 558 -4 740 728 UA
5 19 2013 1 1 555 600 -5 913 854 B6
6 8 2013 1 1 558 600 -2 753 745 AA
7 7 2013 1 1 558 600 -2 924 917 UA
8 31 2013 1 1 559 600 -1 941 910 AA
9 12 2013 1 1 600 600 0 837 825 MQ
10 16 2013 1 1 602 605 -3 821 805 MQ
# … with 127,919 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time
# b
flights %>%
filter(dest %in% c("HOU", "IAH")) %>%
select(dest, origin, everything())# A tibble: 9,313 × 19
dest origin year month day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴
<chr> <chr> <int> <int> <int> <int> <int> <dbl> <int> <int>
1 IAH EWR 2013 1 1 517 515 2 830 819
2 IAH LGA 2013 1 1 533 529 4 850 830
3 IAH LGA 2013 1 1 623 627 -4 933 932
4 IAH LGA 2013 1 1 728 732 -4 1041 1038
5 IAH EWR 2013 1 1 739 739 0 1104 1038
6 IAH EWR 2013 1 1 908 908 0 1228 1219
7 IAH LGA 2013 1 1 1028 1026 2 1350 1339
8 IAH EWR 2013 1 1 1044 1045 -1 1352 1351
9 IAH LGA 2013 1 1 1114 900 134 1447 1222
10 IAH EWR 2013 1 1 1205 1200 5 1503 1505
# … with 9,303 more rows, 9 more variables: arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time
# c
flights %>%
filter(carrier %in% c("UA", "AA" , "DL")) %>%
select(carrier, everything())# A tibble: 139,504 × 19
carrier year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
<chr> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 UA 2013 1 1 517 515 2 830 819 11
2 UA 2013 1 1 533 529 4 850 830 20
3 AA 2013 1 1 542 540 2 923 850 33
4 DL 2013 1 1 554 600 -6 812 837 -25
5 UA 2013 1 1 554 558 -4 740 728 12
6 AA 2013 1 1 558 600 -2 753 745 8
7 UA 2013 1 1 558 600 -2 924 917 7
8 UA 2013 1 1 558 600 -2 923 937 -14
9 AA 2013 1 1 559 600 -1 941 910 31
10 UA 2013 1 1 559 600 -1 854 902 -8
# … with 139,494 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# d
flights %>%
filter(month >= 7, month <= 9)%>%
select(month, everything())# A tibble: 86,326 × 19
month year day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 7 2013 1 1 2029 212 236 2359 157 B6
2 7 2013 1 2 2359 3 344 344 0 B6
3 7 2013 1 29 2245 104 151 1 110 B6
4 7 2013 1 43 2130 193 322 14 188 B6
5 7 2013 1 44 2150 174 300 100 120 AA
6 7 2013 1 46 2051 235 304 2358 186 B6
7 7 2013 1 48 2001 287 308 2305 243 VX
8 7 2013 1 58 2155 183 335 43 172 B6
9 7 2013 1 100 2146 194 327 30 177 B6
10 7 2013 1 100 2245 135 337 135 122 B6
# … with 86,316 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# e
flights %>%
filter(arr_delay >=2, dep_delay <= 0) %>%
select(arr_delay, dep_delay, everything())# A tibble: 37,527 × 19
arr_delay dep_delay year month day dep_t…¹ sched…² arr_t…³ sched…⁴ carrier
<dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <chr>
1 12 -4 2013 1 1 554 558 740 728 UA
2 19 -5 2013 1 1 555 600 913 854 B6
3 8 -2 2013 1 1 558 600 753 745 AA
4 7 -2 2013 1 1 558 600 924 917 UA
5 31 -1 2013 1 1 559 600 941 910 AA
6 12 0 2013 1 1 600 600 837 825 MQ
7 16 -3 2013 1 1 602 605 821 805 MQ
8 3 -8 2013 1 1 622 630 1017 1014 US
9 29 -6 2013 1 1 624 630 909 840 EV
10 10 -6 2013 1 1 624 630 840 830 MQ
# … with 37,517 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names ¹dep_time,
# ²sched_dep_time, ³arr_time, ⁴sched_arr_time
# f
flights %>%
filter(dep_delay >= 1, arr_delay <= -30) %>%
select(dep_delay, arr_delay, everything())# A tibble: 2,059 × 19
dep_delay arr_delay year month day dep_t…¹ sched…² arr_t…³ sched…⁴ carrier
<dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <chr>
1 1 -31 2013 1 1 701 700 1123 1154 UA
2 5 -44 2013 1 2 605 600 851 935 UA
3 1 -31 2013 1 2 647 646 738 809 UA
4 3 -36 2013 1 2 854 851 1146 1222 UA
5 1 -44 2013 1 2 1048 1047 1321 1405 UA
6 1 -41 2013 1 2 1201 1200 1504 1545 VX
7 1 -41 2013 1 2 1811 1810 2051 2132 B6
8 17 -33 2013 1 2 2002 1945 2256 2329 B6
9 1 -36 2013 1 2 2036 2035 2137 2213 9E
10 7 -30 2013 1 2 2047 2040 2329 2359 B6
# … with 2,049 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names ¹dep_time,
# ²sched_dep_time, ³arr_time, ⁴sched_arr_time
# g
flights %>%
filter(dep_time >= 6, dep_time <= 12) %>%
select(dep_time, everything())# A tibble: 167 × 19
dep_time year month day sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 8 2013 1 9 2359 9 432 437 -5 B6
2 11 2013 1 11 2359 12 436 444 -8 B6
3 10 2013 1 13 2135 155 305 36 149 B6
4 10 2013 1 28 2359 11 454 437 17 B6
5 7 2013 1 31 2359 8 453 437 16 B6
6 12 2013 1 31 2250 82 132 7 85 B6
7 7 2013 10 4 2359 8 350 350 0 B6
8 6 2013 10 7 2159 127 57 2306 111 EV
9 9 2013 10 7 2030 219 109 2205 184 WN
10 10 2013 10 7 2137 153 318 27 171 B6
# … with 157 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
A useful dplyr filtering helper is between(). Rewrite your code to Exercise 3 (d) using between()? If you did use it, then indicate that you did.
flights %>%
filter(month == between(3,7,9)) %>%
select(month, everything())# A tibble: 0 × 19
# … with 19 variables: month <int>, year <int>, day <int>, dep_time <int>,
# sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
colSums(is.na(flights)) year month day dep_time sched_dep_time
0 0 0 8255 0
dep_delay arr_time sched_arr_time arr_delay carrier
8255 8713 0 9430 0
flight tailnum origin dest air_time
0 2512 0 0 9430
distance hour minute time_hour
0 0 0 0
8255 flights have a missing departure time. Meanwhile, five other variables also have missing values (arr_time, tailnum, arr_delay, air_time, dep_delay). Data is often not clean in terms of every value being there, and,there could have been some sort of inputing error here. Most likely, though, the N/As signal flights that were cancelled, and, thus, didn’t have data for those specific variables.
How could you use arrange() to sort the dataset so that flights missing arr_delay are at the start/top of the tibble/table? Hint: Use is.na()
flights %>%
arrange(desc(is.na(arr_delay))) %>%
select(arr_delay, everything())# A tibble: 336,776 × 19
arr_delay year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ carrier
<dbl> <int> <int> <int> <int> <int> <dbl> <int> <int> <chr>
1 NA 2013 1 1 1525 1530 -5 1934 1805 MQ
2 NA 2013 1 1 1528 1459 29 2002 1647 EV
3 NA 2013 1 1 1740 1745 -5 2158 2020 MQ
4 NA 2013 1 1 1807 1738 29 2251 2103 UA
5 NA 2013 1 1 1939 1840 59 29 2151 9E
6 NA 2013 1 1 1952 1930 22 2358 2207 EV
7 NA 2013 1 1 2016 1930 46 NA 2220 EV
8 NA 2013 1 1 NA 1630 NA NA 1815 EV
9 NA 2013 1 1 NA 1935 NA NA 2240 AA
10 NA 2013 1 1 NA 1500 NA NA 1825 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time
Find the flights that left earliest. Name the flight at top of the list (e.g. New York City JFK to Chicago ORD).
flights %>%
select(origin, dest, dep_delay) %>%
arrange(dep_delay)# A tibble: 336,776 × 3
origin dest dep_delay
<chr> <chr> <dbl>
1 JFK DEN -43
2 LGA MSY -33
3 LGA IAD -32
4 LGA TPA -30
5 LGA DEN -27
6 LGA DTW -26
7 EWR TYS -25
8 LGA DTW -25
9 JFK BUF -24
10 LGA FLL -24
# … with 336,766 more rows
The flight that left the earliest was New York City JFK to Denver, which led 43 minutes before expected.
Sort flights to find the fastest flights. Name the flight at top of the list (e.g. New York City JFK to Chicago ORD).
flights %>%
select(origin, dest, air_time) %>%
arrange(air_time)# A tibble: 336,776 × 3
origin dest air_time
<chr> <chr> <dbl>
1 EWR BDL 20
2 EWR BDL 20
3 EWR BDL 21
4 EWR PHL 21
5 EWR BDL 21
6 EWR PHL 21
7 LGA BOS 21
8 JFK PHL 21
9 EWR BDL 21
10 EWR BDL 21
# … with 336,766 more rows
The fastest flight was Newark EWR to Hartford Conneticut BDL.
Which flights traveled the longest? Which traveled the shortest? Name the flight at top of the list (e.g. New York City JFK to Chicago ORD).
flights %>%
select(origin, dest, distance) %>%
arrange(desc(distance))# A tibble: 336,776 × 3
origin dest distance
<chr> <chr> <dbl>
1 JFK HNL 4983
2 JFK HNL 4983
3 JFK HNL 4983
4 JFK HNL 4983
5 JFK HNL 4983
6 JFK HNL 4983
7 JFK HNL 4983
8 JFK HNL 4983
9 JFK HNL 4983
10 JFK HNL 4983
# … with 336,766 more rows
flights %>%
select(origin, dest, distance) %>%
arrange(distance)# A tibble: 336,776 × 3
origin dest distance
<chr> <chr> <dbl>
1 EWR LGA 17
2 EWR PHL 80
3 EWR PHL 80
4 EWR PHL 80
5 EWR PHL 80
6 EWR PHL 80
7 EWR PHL 80
8 EWR PHL 80
9 EWR PHL 80
10 EWR PHL 80
# … with 336,766 more rows
The flight that traveled the longest was New York City JFK to Honolulu HNL. The flight that traveled the shortest distance was Newark EWR to Queens LGA
Brainstorm at least 3 ways to select dep_time, dep_delay, arr_time, and arr_delay from flights. Hint: Use helper functions
select(flights, dep_time, dep_delay, arr_time, arr_delay)# A tibble: 336,776 × 4
dep_time dep_delay arr_time arr_delay
<int> <dbl> <int> <dbl>
1 517 2 830 11
2 533 4 850 20
3 542 2 923 33
4 544 -1 1004 -18
5 554 -6 812 -25
6 554 -4 740 12
7 555 -5 913 19
8 557 -3 709 -14
9 557 -3 838 -8
10 558 -2 753 8
# … with 336,766 more rows
select(flights, starts_with("arr") | starts_with("dep"))# A tibble: 336,776 × 4
arr_time arr_delay dep_time dep_delay
<int> <dbl> <int> <dbl>
1 830 11 517 2
2 850 20 533 4
3 923 33 542 2
4 1004 -18 544 -1
5 812 -25 554 -6
6 740 12 554 -4
7 913 19 555 -5
8 709 -14 557 -3
9 838 -8 557 -3
10 753 8 558 -2
# … with 336,766 more rows
Ex10data <- c("dep_time", "arr_time", "dep_delay", "arr_delay")
flights %>% select_at(all_of(Ex10data))# A tibble: 336,776 × 4
dep_time arr_time dep_delay arr_delay
<int> <int> <dbl> <dbl>
1 517 830 2 11
2 533 850 4 20
3 542 923 2 33
4 544 1004 -1 -18
5 554 812 -6 -25
6 554 740 -4 12
7 555 913 -5 19
8 557 709 -3 -14
9 557 838 -3 -8
10 558 753 -2 8
# … with 336,766 more rows
What does the any_of() function do? Why might it be helpful in conjunction with the vector below?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(any_of(vars))# A tibble: 336,776 × 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# … with 336,766 more rows
Whereas the all_of function requires all variables to be true, the any_of function will work if any of the variables are present for a specific event (a flight, in this case). In this case, the all_of function would not be able to include any flight with a missing variable, but the any_of function can.
Does the result of running the code below surprise you? Which default setting for contains() causes this to happen? What should happen if you run the code, but with the default setting changed?
select(flights, contains("TIME"))# A tibble: 336,776 × 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
7 555 600 913 854 158 2013-01-01 06:00:00
8 557 600 709 723 53 2013-01-01 06:00:00
9 557 600 838 846 140 2013-01-01 06:00:00
10 558 600 753 745 138 2013-01-01 06:00:00
# … with 336,766 more rows
The result of running this code below includes any column that contains the word “time”, but does not adjust for the fact that “TIME” is in all caps; the default setting for contains() is to ignore case. To change this, we would write this code, which will bring back a tibble with zero columns, as there are no variables containing “TIME” in all uppercase letters.
select(flights, contains("TIME", ignore.case = FALSE))# A tibble: 336,776 × 0
Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
flights %>%
transmute(dep_time, hour = dep_time %/% 100, minute = dep_time %% 100) %>%
transmute(dep_time =(hour * 60 + minute))# A tibble: 336,776 × 1
dep_time
<dbl>
1 317
2 333
3 342
4 344
5 354
6 354
7 355
8 357
9 357
10 358
# … with 336,766 more rows
flights %>%
transmute(sched_dep_time, hour = sched_dep_time %/% 100, minute = sched_dep_time %% 100) %>%
transmute(sched_dep_time =(hour * 60 + minute)) # A tibble: 336,776 × 1
sched_dep_time
<dbl>
1 315
2 329
3 340
4 345
5 360
6 358
7 360
8 360
9 360
10 360
# … with 336,766 more rows
Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
flights %>%
transmute(air_time, hour = air_time %/% 100, minute = air_time %% 100) %>%
transmute(air_time =(hour * 60 + minute)) # A tibble: 336,776 × 1
air_time
<dbl>
1 147
2 147
3 120
4 143
5 76
6 110
7 118
8 53
9 100
10 98
# … with 336,766 more rows
flights %>%
mutate(gain = arr_time - dep_time) %>%
summarise(difference = gain - air_time, year, month, day, origin, dest)# A tibble: 336,776 × 6
difference year month day origin dest
<dbl> <int> <int> <int> <chr> <chr>
1 86 2013 1 1 EWR IAH
2 90 2013 1 1 LGA IAH
3 221 2013 1 1 JFK MIA
4 277 2013 1 1 JFK BQN
5 142 2013 1 1 LGA ATL
6 36 2013 1 1 EWR ORD
7 200 2013 1 1 EWR FLL
8 99 2013 1 1 LGA IAD
9 141 2013 1 1 JFK MCO
10 57 2013 1 1 LGA ORD
# … with 336,766 more rows
First, we need to convert air_time to minutes after midnight, or less the calculation would be severely wrong. What I would expect to see is a slight gap between air_time and “arr_time - dep_time”, as it’s not accounting for the time when on the ground and trying to get to your gate. Still, since some days started before mignight and ended the next day, the calculation is incorrect. To fix this, we may need to convert the variables into “minutes after midnight on January 1st”. Therefore, this would make changing days a non-factor.
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
flights %>%
mutate(difference = dep_time - sched_dep_time) %>%
summarise(dep_delay, difference)# A tibble: 336,776 × 2
dep_delay difference
<dbl> <int>
1 2 2
2 4 4
3 2 2
4 -1 -1
5 -6 -46
6 -4 -4
7 -5 -45
8 -3 -43
9 -3 -43
10 -2 -42
# … with 336,766 more rows
All three of these variables should be related. Departure time minus scheduled departure time should equal the departure delay. Meanwhile, departure time minus the departure delay would equal scheduled departure time, and scheduled departure time combined would equal the departure time.
Find the 10 most delayed flights using the min_rank() function.
flights %>%
summarise(delay = dep_delay, year, month, day, origin, dest) %>%
arrange(min_rank(desc(delay)))# A tibble: 336,776 × 6
delay year month day origin dest
<dbl> <int> <int> <int> <chr> <chr>
1 1301 2013 1 9 JFK HNL
2 1137 2013 6 15 JFK CMH
3 1126 2013 1 10 EWR ORD
4 1014 2013 9 20 JFK SFO
5 1005 2013 7 22 JFK CVG
6 960 2013 4 10 JFK TPA
7 911 2013 3 17 LGA MSP
8 899 2013 6 27 JFK PDX
9 898 2013 7 22 LGA ATL
10 896 2013 12 5 EWR MIA
# … with 336,766 more rows
The most delayed flight was New York City JFK to Honolulu HNL on January 9th, with a departure delay of 1301 minutes.
Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()). Dataset not_cancelled was created in this section of the book, but for your convenience the code is provided.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))not_cancelled %>%
group_by(dest) %>%
summarise(n())# A tibble: 104 × 2
dest `n()`
<chr> <int>
1 ABQ 254
2 ACK 264
3 ALB 418
4 ANC 8
5 ATL 16837
6 AUS 2411
7 AVL 261
8 BDL 412
9 BGR 358
10 BHM 269
# … with 94 more rows
not_cancelled %>%
group_by(tailnum) %>%
summarise(n= sum(distance))# A tibble: 4,037 × 2
tailnum n
<chr> <dbl>
1 D942DN 3418
2 N0EGMQ 239143
3 N10156 109664
4 N102UW 25722
5 N103US 24619
6 N104UW 24616
7 N10575 139903
8 N105UW 23618
9 N107US 21677
10 N108UW 32070
# … with 4,027 more rows
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
Ex18 <-
flights %>%
mutate(cancelled_flights = is.na(arr_delay) | is.na(dep_delay)) %>%
group_by(year, month, day) %>%
summarise(total_flights = n(), total_cancelled = sum(cancelled_flights), average_arr_delay = mean(arr_delay, na.rm = TRUE), prop_cancelled = (total_cancelled / total_flights))Ex18# A tibble: 365 × 7
# Groups: year, month [12]
year month day total_flights total_cancelled average_arr_delay prop_canc…¹
<int> <int> <int> <int> <int> <dbl> <dbl>
1 2013 1 1 842 11 12.7 0.0131
2 2013 1 2 943 15 12.7 0.0159
3 2013 1 3 914 14 5.73 0.0153
4 2013 1 4 915 7 -1.93 0.00765
5 2013 1 5 720 3 -1.53 0.00417
6 2013 1 6 832 3 4.24 0.00361
7 2013 1 7 933 3 -4.95 0.00322
8 2013 1 8 899 7 -3.23 0.00779
9 2013 1 9 902 9 -0.264 0.00998
10 2013 1 10 932 3 -5.90 0.00322
# … with 355 more rows, and abbreviated variable name ¹prop_cancelled
ggplot() +
geom_point(data = Ex18, mapping = aes(x = average_arr_delay, y = prop_cancelled))There does appear the be a correlation to the proportion of cancelled flights and the length of the average delay; days with longer delays are more likely to have more cancelled flights. There is a correlation between the total number of flights and the number of cancelled flights, though you would certainly expect that to be the case.
Which plane (tailnum) has the worst on-time record?
flights %>%
group_by(tailnum) %>%
filter(n() >= 10) %>%
summarise(arr_delay = mean(arr_delay)) %>%
arrange(desc(arr_delay))# A tibble: 3,432 × 2
tailnum arr_delay
<chr> <dbl>
1 N354AT 68.1
2 N337AT 66.5
3 N203FR 59.1
4 N176DN 46.2
5 N366AA 43.8
6 N184DN 43.6
7 N923FJ 43.1
8 N521VA 42.2
9 N353AT 41.2
10 N942AT 41.2
# … with 3,422 more rows
Among planes with at least 10 flights,plane number N354AT has the worst on-time record, with an average arrival delay of 68.1 minutes.
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(dest) %>%
mutate(sum_arr_delay= sum(arr_delay), prop_arr_delay = arr_delay / sum_arr_delay) %>%
select(year:day, origin, dest, carrier, arr_delay, sum_arr_delay, prop_arr_delay) %>%
filter(arr_delay > 0) %>%
arrange(desc(prop_arr_delay))# A tibble: 133,004 × 9
# Groups: dest [103]
year month day origin dest carrier arr_delay sum_arr_delay prop_arr_delay
<int> <int> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 3 30 EWR MTJ UA 101 25 4.04
2 2013 12 19 EWR SLC DL 847 432 1.96
3 2013 3 16 EWR HDN UA 43 30 1.43
4 2013 3 16 EWR MTJ UA 31 25 1.24
5 2013 12 21 EWR HDN UA 32 30 1.07
6 2013 12 21 EWR MTJ UA 24 25 0.96
7 2013 3 3 JFK SLC DL 387 432 0.896
8 2013 3 12 JFK SLC B6 356 432 0.824
9 2013 11 22 LGA SBN EV 53 65 0.815
10 2013 1 2 EWR MYR EV 207 267 0.775
# … with 132,994 more rows
A flight from Newark EWR to Montrose, Colorado MTJ on United Airlines had the greatest proportion of total delay for flights to Montrose, Colorado.
For each plane, count the number of flights before the first delay of greater than 1 hour.
flights %>%
filter(!is.na(dep_delay)) %>%
select(tailnum, year, month, day, dep_delay) %>%
group_by(tailnum) %>%
mutate(hour_delayed = cumsum(dep_delay > 60)) %>%
summarise(total = sum(hour_delayed < 1 )) %>%
arrange(desc(total))# A tibble: 4,037 × 2
tailnum total
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N706TW 149
5 N961UW 139
6 N713TW 128
7 N346NB 127
8 N765US 122
9 N721TW 120
10 N5FAAA 117
# … with 4,027 more rows
Flight number N952UW had 215 flights before the first delay of greater than 1 hour.