L03 Data Transformation

Data Science 1 with R (STAT 301-1)

Author

Justin Dunbar

Load packages

Code

# Loading package(s) and data

library(tidyverse)
library(nycflights13)
library(methods)
data("flights")

Exercise 1

Why does this code not work?

Code

my_variable <- 10
my_varıable

#> Error: object 'my_varıable' not found

Solution

The two names are not the same. In the second “my_variable”, the “i” has been written as “1” due to a typo, and the smallest of typos will make the code not work. Paying very close attention to the exact writing of your code is critical, because it needs to be 100% accurate for it to run.

Exercise 2

Tweak each of the following R commands so that they run correctly:

Code

# Command 1
ggplot(dota = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# Command 2
fliter(mpg, cyl = 8)

# Command 3
filter(diamond, carat > 3)

Code

# Command 1
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Code

# Command 2
filter(mpg, cyl == 8)

# A tibble: 70 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a6 quattro   4.2  2008     8 auto… 4        16    23 p     mids…
 2 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
 3 chevrolet    c1500 sub…   5.3  2008     8 auto… r        11    15 e     suv  
 4 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
 5 chevrolet    c1500 sub…   5.7  1999     8 auto… r        13    17 r     suv  
 6 chevrolet    c1500 sub…   6    2008     8 auto… r        12    17 r     suv  
 7 chevrolet    corvette     5.7  1999     8 manu… r        16    26 p     2sea…
 8 chevrolet    corvette     5.7  1999     8 auto… r        15    23 p     2sea…
 9 chevrolet    corvette     6.2  2008     8 manu… r        16    26 p     2sea…
10 chevrolet    corvette     6.2  2008     8 auto… r        15    25 p     2sea…
# … with 60 more rows

Code

# Command 3
filter(diamonds, carat > 3)

# A tibble: 32 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  3.01 Premium I     I1       62.7    58  8040  9.1   8.97  5.67
 2  3.11 Fair    J     I1       65.9    57  9823  9.15  9.02  5.98
 3  3.01 Premium F     I1       62.2    56  9925  9.24  9.13  5.73
 4  3.05 Premium E     I1       60.9    58 10453  9.26  9.25  5.66
 5  3.02 Fair    I     I1       65.2    56 10577  9.11  9.02  5.91
 6  3.01 Fair    H     I1       56.1    62 10761  9.54  9.38  5.31
 7  3.65 Fair    H     I1       67.1    53 11668  9.53  9.48  6.38
 8  3.24 Premium H     I1       62.1    58 12300  9.44  9.4   5.85
 9  3.22 Ideal   I     I1       62.6    55 12545  9.49  9.42  5.92
10  3.5  Ideal   H     I1       62.8    57 12587  9.65  9.59  6.03
# … with 22 more rows

Solution

For command #1, data is accidentally spelled as “dota”. For command #2, filter is spelled incorrectly, meaning that RStudio cannot infer what we are asking it to do. For command #3, the dataset should be labeled as “diamonds”, not “diamond”.

Exercise 3

Find all flights that:

Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late
Were delayed by at least an hour, but made up over 30 minutes in flight
Departed between midnight and 6am (inclusive)

Solution

Code

# a
flights %>%
  filter(arr_delay >= 2) %>%
  select(arr_delay, everything())

# A tibble: 127,929 × 19
   arr_delay  year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ carrier
       <dbl> <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int> <chr>  
 1        11  2013     1     1      517      515       2     830     819 UA     
 2        20  2013     1     1      533      529       4     850     830 UA     
 3        33  2013     1     1      542      540       2     923     850 AA     
 4        12  2013     1     1      554      558      -4     740     728 UA     
 5        19  2013     1     1      555      600      -5     913     854 B6     
 6         8  2013     1     1      558      600      -2     753     745 AA     
 7         7  2013     1     1      558      600      -2     924     917 UA     
 8        31  2013     1     1      559      600      -1     941     910 AA     
 9        12  2013     1     1      600      600       0     837     825 MQ     
10        16  2013     1     1      602      605      -3     821     805 MQ     
# … with 127,919 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time

Code

# b
flights %>%
  filter(dest %in% c("HOU", "IAH")) %>%
  select(dest, origin, everything())

# A tibble: 9,313 × 19
   dest  origin  year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴
   <chr> <chr>  <int> <int> <int>    <int>         <int>   <dbl>   <int>   <int>
 1 IAH   EWR     2013     1     1      517           515       2     830     819
 2 IAH   LGA     2013     1     1      533           529       4     850     830
 3 IAH   LGA     2013     1     1      623           627      -4     933     932
 4 IAH   LGA     2013     1     1      728           732      -4    1041    1038
 5 IAH   EWR     2013     1     1      739           739       0    1104    1038
 6 IAH   EWR     2013     1     1      908           908       0    1228    1219
 7 IAH   LGA     2013     1     1     1028          1026       2    1350    1339
 8 IAH   EWR     2013     1     1     1044          1045      -1    1352    1351
 9 IAH   LGA     2013     1     1     1114           900     134    1447    1222
10 IAH   EWR     2013     1     1     1205          1200       5    1503    1505
# … with 9,303 more rows, 9 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time

Code

# c
flights %>%
  filter(carrier %in% c("UA", "AA" , "DL")) %>%
  select(carrier, everything())

# A tibble: 139,504 × 19
   carrier  year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
   <chr>   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl>
 1 UA       2013     1     1      517        515       2     830     819      11
 2 UA       2013     1     1      533        529       4     850     830      20
 3 AA       2013     1     1      542        540       2     923     850      33
 4 DL       2013     1     1      554        600      -6     812     837     -25
 5 UA       2013     1     1      554        558      -4     740     728      12
 6 AA       2013     1     1      558        600      -2     753     745       8
 7 UA       2013     1     1      558        600      -2     924     917       7
 8 UA       2013     1     1      558        600      -2     923     937     -14
 9 AA       2013     1     1      559        600      -1     941     910      31
10 UA       2013     1     1      559        600      -1     854     902      -8
# … with 139,494 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Code

# d
flights %>%
  filter(month >= 7, month <= 9)%>%
  select(month, everything())

# A tibble: 86,326 × 19
   month  year   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1     7  2013     1        1       2029     212     236    2359     157 B6     
 2     7  2013     1        2       2359       3     344     344       0 B6     
 3     7  2013     1       29       2245     104     151       1     110 B6     
 4     7  2013     1       43       2130     193     322      14     188 B6     
 5     7  2013     1       44       2150     174     300     100     120 AA     
 6     7  2013     1       46       2051     235     304    2358     186 B6     
 7     7  2013     1       48       2001     287     308    2305     243 VX     
 8     7  2013     1       58       2155     183     335      43     172 B6     
 9     7  2013     1      100       2146     194     327      30     177 B6     
10     7  2013     1      100       2245     135     337     135     122 B6     
# … with 86,316 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Code

# e
flights %>%
  filter(arr_delay >=2, dep_delay <= 0) %>%
  select(arr_delay, dep_delay, everything())

# A tibble: 37,527 × 19
   arr_delay dep_delay  year month   day dep_t…¹ sched…² arr_t…³ sched…⁴ carrier
       <dbl>     <dbl> <int> <int> <int>   <int>   <int>   <int>   <int> <chr>  
 1        12        -4  2013     1     1     554     558     740     728 UA     
 2        19        -5  2013     1     1     555     600     913     854 B6     
 3         8        -2  2013     1     1     558     600     753     745 AA     
 4         7        -2  2013     1     1     558     600     924     917 UA     
 5        31        -1  2013     1     1     559     600     941     910 AA     
 6        12         0  2013     1     1     600     600     837     825 MQ     
 7        16        -3  2013     1     1     602     605     821     805 MQ     
 8         3        -8  2013     1     1     622     630    1017    1014 US     
 9        29        -6  2013     1     1     624     630     909     840 EV     
10        10        -6  2013     1     1     624     630     840     830 MQ     
# … with 37,517 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names ¹dep_time,
#   ²sched_dep_time, ³arr_time, ⁴sched_arr_time

Code

# f
flights %>%
  filter(dep_delay >= 1, arr_delay <= -30) %>%
  select(dep_delay, arr_delay, everything())

# A tibble: 2,059 × 19
   dep_delay arr_delay  year month   day dep_t…¹ sched…² arr_t…³ sched…⁴ carrier
       <dbl>     <dbl> <int> <int> <int>   <int>   <int>   <int>   <int> <chr>  
 1         1       -31  2013     1     1     701     700    1123    1154 UA     
 2         5       -44  2013     1     2     605     600     851     935 UA     
 3         1       -31  2013     1     2     647     646     738     809 UA     
 4         3       -36  2013     1     2     854     851    1146    1222 UA     
 5         1       -44  2013     1     2    1048    1047    1321    1405 UA     
 6         1       -41  2013     1     2    1201    1200    1504    1545 VX     
 7         1       -41  2013     1     2    1811    1810    2051    2132 B6     
 8        17       -33  2013     1     2    2002    1945    2256    2329 B6     
 9         1       -36  2013     1     2    2036    2035    2137    2213 9E     
10         7       -30  2013     1     2    2047    2040    2329    2359 B6     
# … with 2,049 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names ¹dep_time,
#   ²sched_dep_time, ³arr_time, ⁴sched_arr_time

Code

# g
flights %>%
  filter(dep_time >= 6, dep_time <= 12) %>%
  select(dep_time, everything())

# A tibble: 167 × 19
   dep_time  year month   day sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
      <int> <int> <int> <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1        8  2013     1     9       2359       9     432     437      -5 B6     
 2       11  2013     1    11       2359      12     436     444      -8 B6     
 3       10  2013     1    13       2135     155     305      36     149 B6     
 4       10  2013     1    28       2359      11     454     437      17 B6     
 5        7  2013     1    31       2359       8     453     437      16 B6     
 6       12  2013     1    31       2250      82     132       7      85 B6     
 7        7  2013    10     4       2359       8     350     350       0 B6     
 8        6  2013    10     7       2159     127      57    2306     111 EV     
 9        9  2013    10     7       2030     219     109    2205     184 WN     
10       10  2013    10     7       2137     153     318      27     171 B6     
# … with 157 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Exercise 4

A useful dplyr filtering helper is between(). Rewrite your code to Exercise 3 (d) using between()? If you did use it, then indicate that you did.

Code

flights %>%
  filter(month == between(3,7,9)) %>%
  select(month, everything())

# A tibble: 0 × 19
# … with 19 variables: month <int>, year <int>, day <int>, dep_time <int>,
#   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 5

How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

Code

colSums(is.na(flights))

          year          month            day       dep_time sched_dep_time 
             0              0              0           8255              0 
     dep_delay       arr_time sched_arr_time      arr_delay        carrier 
          8255           8713              0           9430              0 
        flight        tailnum         origin           dest       air_time 
             0           2512              0              0           9430 
      distance           hour         minute      time_hour 
             0              0              0              0

Solution

8255 flights have a missing departure time. Meanwhile, five other variables also have missing values (arr_time, tailnum, arr_delay, air_time, dep_delay). Data is often not clean in terms of every value being there, and,there could have been some sort of inputing error here. Most likely, though, the N/As signal flights that were cancelled, and, thus, didn’t have data for those specific variables.

Exercise 6

How could you use arrange() to sort the dataset so that flights missing arr_delay are at the start/top of the tibble/table? Hint: Use is.na()

Code

flights %>%
arrange(desc(is.na(arr_delay))) %>%
select(arr_delay, everything())

# A tibble: 336,776 × 19
   arr_delay  year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ carrier
       <dbl> <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int> <chr>  
 1        NA  2013     1     1     1525     1530      -5    1934    1805 MQ     
 2        NA  2013     1     1     1528     1459      29    2002    1647 EV     
 3        NA  2013     1     1     1740     1745      -5    2158    2020 MQ     
 4        NA  2013     1     1     1807     1738      29    2251    2103 UA     
 5        NA  2013     1     1     1939     1840      59      29    2151 9E     
 6        NA  2013     1     1     1952     1930      22    2358    2207 EV     
 7        NA  2013     1     1     2016     1930      46      NA    2220 EV     
 8        NA  2013     1     1       NA     1630      NA      NA    1815 EV     
 9        NA  2013     1     1       NA     1935      NA      NA    2240 AA     
10        NA  2013     1     1       NA     1500      NA      NA    1825 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time

Exercise 7

Find the flights that left earliest. Name the flight at top of the list (e.g. New York City JFK to Chicago ORD).

Code

flights %>%
select(origin, dest, dep_delay) %>%
arrange(dep_delay)

# A tibble: 336,776 × 3
   origin dest  dep_delay
   <chr>  <chr>     <dbl>
 1 JFK    DEN         -43
 2 LGA    MSY         -33
 3 LGA    IAD         -32
 4 LGA    TPA         -30
 5 LGA    DEN         -27
 6 LGA    DTW         -26
 7 EWR    TYS         -25
 8 LGA    DTW         -25
 9 JFK    BUF         -24
10 LGA    FLL         -24
# … with 336,766 more rows

Solution

The flight that left the earliest was New York City JFK to Denver, which led 43 minutes before expected.

Exercise 8

Sort flights to find the fastest flights. Name the flight at top of the list (e.g. New York City JFK to Chicago ORD).

Code

flights %>%
select(origin, dest, air_time) %>%
arrange(air_time)

# A tibble: 336,776 × 3
   origin dest  air_time
   <chr>  <chr>    <dbl>
 1 EWR    BDL         20
 2 EWR    BDL         20
 3 EWR    BDL         21
 4 EWR    PHL         21
 5 EWR    BDL         21
 6 EWR    PHL         21
 7 LGA    BOS         21
 8 JFK    PHL         21
 9 EWR    BDL         21
10 EWR    BDL         21
# … with 336,766 more rows

Solution

The fastest flight was Newark EWR to Hartford Conneticut BDL.

Exercise 9

Which flights traveled the longest? Which traveled the shortest? Name the flight at top of the list (e.g. New York City JFK to Chicago ORD).

Code

flights %>%
select(origin, dest, distance) %>%
arrange(desc(distance))

# A tibble: 336,776 × 3
   origin dest  distance
   <chr>  <chr>    <dbl>
 1 JFK    HNL       4983
 2 JFK    HNL       4983
 3 JFK    HNL       4983
 4 JFK    HNL       4983
 5 JFK    HNL       4983
 6 JFK    HNL       4983
 7 JFK    HNL       4983
 8 JFK    HNL       4983
 9 JFK    HNL       4983
10 JFK    HNL       4983
# … with 336,766 more rows

Code

flights %>%
select(origin, dest, distance) %>%
arrange(distance)

# A tibble: 336,776 × 3
   origin dest  distance
   <chr>  <chr>    <dbl>
 1 EWR    LGA         17
 2 EWR    PHL         80
 3 EWR    PHL         80
 4 EWR    PHL         80
 5 EWR    PHL         80
 6 EWR    PHL         80
 7 EWR    PHL         80
 8 EWR    PHL         80
 9 EWR    PHL         80
10 EWR    PHL         80
# … with 336,766 more rows

Solution

The flight that traveled the longest was New York City JFK to Honolulu HNL. The flight that traveled the shortest distance was Newark EWR to Queens LGA

Exercise 10

Brainstorm at least 3 ways to select dep_time, dep_delay, arr_time, and arr_delay from flights. Hint: Use helper functions

Code

select(flights, dep_time, dep_delay, arr_time, arr_delay)

# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# … with 336,766 more rows

Code

select(flights, starts_with("arr") | starts_with("dep"))

# A tibble: 336,776 × 4
   arr_time arr_delay dep_time dep_delay
      <int>     <dbl>    <int>     <dbl>
 1      830        11      517         2
 2      850        20      533         4
 3      923        33      542         2
 4     1004       -18      544        -1
 5      812       -25      554        -6
 6      740        12      554        -4
 7      913        19      555        -5
 8      709       -14      557        -3
 9      838        -8      557        -3
10      753         8      558        -2
# … with 336,766 more rows

Code

Ex10data <- c("dep_time", "arr_time", "dep_delay", "arr_delay")
flights %>% select_at(all_of(Ex10data))

# A tibble: 336,776 × 4
   dep_time arr_time dep_delay arr_delay
      <int>    <int>     <dbl>     <dbl>
 1      517      830         2        11
 2      533      850         4        20
 3      542      923         2        33
 4      544     1004        -1       -18
 5      554      812        -6       -25
 6      554      740        -4        12
 7      555      913        -5        19
 8      557      709        -3       -14
 9      557      838        -3        -8
10      558      753        -2         8
# … with 336,766 more rows

Exercise 11

What does the any_of() function do? Why might it be helpful in conjunction with the vector below?

Code

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

Code

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(any_of(vars))

# A tibble: 336,776 × 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# … with 336,766 more rows

Solution

Whereas the all_of function requires all variables to be true, the any_of function will work if any of the variables are present for a specific event (a flight, in this case). In this case, the all_of function would not be able to include any flight with a missing variable, but the any_of function can.

Exercise 12

Does the result of running the code below surprise you? Which default setting for contains() causes this to happen? What should happen if you run the code, but with the default setting changed?

Code

select(flights, contains("TIME"))

# A tibble: 336,776 × 6
   dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
      <int>          <int>    <int>          <int>    <dbl> <dttm>             
 1      517            515      830            819      227 2013-01-01 05:00:00
 2      533            529      850            830      227 2013-01-01 05:00:00
 3      542            540      923            850      160 2013-01-01 05:00:00
 4      544            545     1004           1022      183 2013-01-01 05:00:00
 5      554            600      812            837      116 2013-01-01 06:00:00
 6      554            558      740            728      150 2013-01-01 05:00:00
 7      555            600      913            854      158 2013-01-01 06:00:00
 8      557            600      709            723       53 2013-01-01 06:00:00
 9      557            600      838            846      140 2013-01-01 06:00:00
10      558            600      753            745      138 2013-01-01 06:00:00
# … with 336,766 more rows

Solution

The result of running this code below includes any column that contains the word “time”, but does not adjust for the fact that “TIME” is in all caps; the default setting for contains() is to ignore case. To change this, we would write this code, which will bring back a tibble with zero columns, as there are no variables containing “TIME” in all uppercase letters.

Code

select(flights, contains("TIME", ignore.case = FALSE))

# A tibble: 336,776 × 0

Exercise 13

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

Code

flights %>%
  transmute(dep_time, hour = dep_time %/% 100, minute = dep_time %% 100) %>%
  transmute(dep_time =(hour * 60 + minute))

# A tibble: 336,776 × 1
   dep_time
      <dbl>
 1      317
 2      333
 3      342
 4      344
 5      354
 6      354
 7      355
 8      357
 9      357
10      358
# … with 336,766 more rows

Code

flights %>%
  transmute(sched_dep_time, hour = sched_dep_time %/% 100,   minute = sched_dep_time %% 100) %>%
  transmute(sched_dep_time =(hour * 60 + minute))

# A tibble: 336,776 × 1
   sched_dep_time
            <dbl>
 1            315
 2            329
 3            340
 4            345
 5            360
 6            358
 7            360
 8            360
 9            360
10            360
# … with 336,766 more rows

Exercise 14

Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

Code

flights %>%
  transmute(air_time, hour = air_time %/% 100, minute = air_time %% 100) %>%
  transmute(air_time =(hour * 60 + minute))

# A tibble: 336,776 × 1
   air_time
      <dbl>
 1      147
 2      147
 3      120
 4      143
 5       76
 6      110
 7      118
 8       53
 9      100
10       98
# … with 336,766 more rows

Code

flights %>%
  mutate(gain = arr_time - dep_time) %>%
  summarise(difference = gain - air_time, year, month, day, origin, dest)

# A tibble: 336,776 × 6
   difference  year month   day origin dest 
        <dbl> <int> <int> <int> <chr>  <chr>
 1         86  2013     1     1 EWR    IAH  
 2         90  2013     1     1 LGA    IAH  
 3        221  2013     1     1 JFK    MIA  
 4        277  2013     1     1 JFK    BQN  
 5        142  2013     1     1 LGA    ATL  
 6         36  2013     1     1 EWR    ORD  
 7        200  2013     1     1 EWR    FLL  
 8         99  2013     1     1 LGA    IAD  
 9        141  2013     1     1 JFK    MCO  
10         57  2013     1     1 LGA    ORD  
# … with 336,766 more rows

Solution

First, we need to convert air_time to minutes after midnight, or less the calculation would be severely wrong. What I would expect to see is a slight gap between air_time and “arr_time - dep_time”, as it’s not accounting for the time when on the ground and trying to get to your gate. Still, since some days started before mignight and ended the next day, the calculation is incorrect. To fix this, we may need to convert the variables into “minutes after midnight on January 1st”. Therefore, this would make changing days a non-factor.

Exercise 15

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

Code

flights %>%
  mutate(difference = dep_time - sched_dep_time) %>%
  summarise(dep_delay, difference)

# A tibble: 336,776 × 2
   dep_delay difference
       <dbl>      <int>
 1         2          2
 2         4          4
 3         2          2
 4        -1         -1
 5        -6        -46
 6        -4         -4
 7        -5        -45
 8        -3        -43
 9        -3        -43
10        -2        -42
# … with 336,766 more rows

Solution

All three of these variables should be related. Departure time minus scheduled departure time should equal the departure delay. Meanwhile, departure time minus the departure delay would equal scheduled departure time, and scheduled departure time combined would equal the departure time.

Exercise 16

Find the 10 most delayed flights using the min_rank() function.

Code

flights %>% 
  summarise(delay = dep_delay, year, month, day, origin, dest) %>% 
  arrange(min_rank(desc(delay)))

# A tibble: 336,776 × 6
   delay  year month   day origin dest 
   <dbl> <int> <int> <int> <chr>  <chr>
 1  1301  2013     1     9 JFK    HNL  
 2  1137  2013     6    15 JFK    CMH  
 3  1126  2013     1    10 EWR    ORD  
 4  1014  2013     9    20 JFK    SFO  
 5  1005  2013     7    22 JFK    CVG  
 6   960  2013     4    10 JFK    TPA  
 7   911  2013     3    17 LGA    MSP  
 8   899  2013     6    27 JFK    PDX  
 9   898  2013     7    22 LGA    ATL  
10   896  2013    12     5 EWR    MIA  
# … with 336,766 more rows

Solution

The most delayed flight was New York City JFK to Honolulu HNL on January 9th, with a departure delay of 1301 minutes.

Exercise 17

Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()). Dataset not_cancelled was created in this section of the book, but for your convenience the code is provided.

Code

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

Code

not_cancelled %>%
  group_by(dest) %>%
  summarise(n())

# A tibble: 104 × 2
   dest  `n()`
   <chr> <int>
 1 ABQ     254
 2 ACK     264
 3 ALB     418
 4 ANC       8
 5 ATL   16837
 6 AUS    2411
 7 AVL     261
 8 BDL     412
 9 BGR     358
10 BHM     269
# … with 94 more rows

Code

not_cancelled %>% 
  group_by(tailnum) %>%
  summarise(n= sum(distance))

# A tibble: 4,037 × 2
   tailnum      n
   <chr>    <dbl>
 1 D942DN    3418
 2 N0EGMQ  239143
 3 N10156  109664
 4 N102UW   25722
 5 N103US   24619
 6 N104UW   24616
 7 N10575  139903
 8 N105UW   23618
 9 N107US   21677
10 N108UW   32070
# … with 4,027 more rows

Exercise 18

Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

Code

Ex18 <-
flights %>%
  mutate(cancelled_flights = is.na(arr_delay) | is.na(dep_delay)) %>% 
  group_by(year, month, day) %>% 
  summarise(total_flights = n(), total_cancelled = sum(cancelled_flights), average_arr_delay = mean(arr_delay, na.rm = TRUE), prop_cancelled = (total_cancelled / total_flights))

Code

Ex18

# A tibble: 365 × 7
# Groups:   year, month [12]
    year month   day total_flights total_cancelled average_arr_delay prop_canc…¹
   <int> <int> <int>         <int>           <int>             <dbl>       <dbl>
 1  2013     1     1           842              11            12.7       0.0131 
 2  2013     1     2           943              15            12.7       0.0159 
 3  2013     1     3           914              14             5.73      0.0153 
 4  2013     1     4           915               7            -1.93      0.00765
 5  2013     1     5           720               3            -1.53      0.00417
 6  2013     1     6           832               3             4.24      0.00361
 7  2013     1     7           933               3            -4.95      0.00322
 8  2013     1     8           899               7            -3.23      0.00779
 9  2013     1     9           902               9            -0.264     0.00998
10  2013     1    10           932               3            -5.90      0.00322
# … with 355 more rows, and abbreviated variable name ¹prop_cancelled

Code

ggplot() +
  geom_point(data = Ex18, mapping = aes(x = average_arr_delay, y = prop_cancelled))

Solution

There does appear the be a correlation to the proportion of cancelled flights and the length of the average delay; days with longer delays are more likely to have more cancelled flights. There is a correlation between the total number of flights and the number of cancelled flights, though you would certainly expect that to be the case.

Exercise 19

Which plane (tailnum) has the worst on-time record?

Code

flights %>% 
  group_by(tailnum) %>% 
  filter(n() >= 10) %>% 
  summarise(arr_delay = mean(arr_delay)) %>% 
  arrange(desc(arr_delay))

# A tibble: 3,432 × 2
   tailnum arr_delay
   <chr>       <dbl>
 1 N354AT       68.1
 2 N337AT       66.5
 3 N203FR       59.1
 4 N176DN       46.2
 5 N366AA       43.8
 6 N184DN       43.6
 7 N923FJ       43.1
 8 N521VA       42.2
 9 N353AT       41.2
10 N942AT       41.2
# … with 3,422 more rows

Solution

Among planes with at least 10 flights,plane number N354AT has the worst on-time record, with an average arrival delay of 68.1 minutes.

Exercise 20

For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

Code

flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay)) %>% 
  group_by(dest) %>% 
  mutate(sum_arr_delay= sum(arr_delay), prop_arr_delay = arr_delay / sum_arr_delay) %>% 
  select(year:day, origin, dest, carrier, arr_delay, sum_arr_delay, prop_arr_delay) %>% 
  filter(arr_delay > 0) %>%
  arrange(desc(prop_arr_delay))

# A tibble: 133,004 × 9
# Groups:   dest [103]
    year month   day origin dest  carrier arr_delay sum_arr_delay prop_arr_delay
   <int> <int> <int> <chr>  <chr> <chr>       <dbl>         <dbl>          <dbl>
 1  2013     3    30 EWR    MTJ   UA            101            25          4.04 
 2  2013    12    19 EWR    SLC   DL            847           432          1.96 
 3  2013     3    16 EWR    HDN   UA             43            30          1.43 
 4  2013     3    16 EWR    MTJ   UA             31            25          1.24 
 5  2013    12    21 EWR    HDN   UA             32            30          1.07 
 6  2013    12    21 EWR    MTJ   UA             24            25          0.96 
 7  2013     3     3 JFK    SLC   DL            387           432          0.896
 8  2013     3    12 JFK    SLC   B6            356           432          0.824
 9  2013    11    22 LGA    SBN   EV             53            65          0.815
10  2013     1     2 EWR    MYR   EV            207           267          0.775
# … with 132,994 more rows

Solution

A flight from Newark EWR to Montrose, Colorado MTJ on United Airlines had the greatest proportion of total delay for flights to Montrose, Colorado.

Exercise 21

For each plane, count the number of flights before the first delay of greater than 1 hour.

Code

flights %>%
  filter(!is.na(dep_delay)) %>%
   select(tailnum, year, month, day, dep_delay) %>%
    group_by(tailnum) %>%
  mutate(hour_delayed = cumsum(dep_delay > 60)) %>%
  summarise(total = sum(hour_delayed < 1 )) %>%
  arrange(desc(total))

# A tibble: 4,037 × 2
   tailnum total
   <chr>   <int>
 1 N952UW    215
 2 N315NB    161
 3 N705TW    160
 4 N706TW    149
 5 N961UW    139
 6 N713TW    128
 7 N346NB    127
 8 N765US    122
 9 N721TW    120
10 N5FAAA    117
# … with 4,027 more rows

Solution

Flight number N952UW had 215 flights before the first delay of greater than 1 hour.