Task 1

Change your name to author name in the YAML block at the top of this page.
Install nycflights13 if necessary.

library(nycflights13)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)

nycflights13

Note: you will probably have to install nycflights13 using install.packages and the load it with the library command. nycflights13 is a relational database containing the following tables (data frames). This is data about all airline flights into and out of New York City in 2021. This project will parallel Chapter 5 in Wickham and Hadley

data frame	description
airlines	Airline names
airports	Airport metadata
flights	Flights data
planes	Planes meta data
weather	Hourly data

This is data about all airline flights into and out of New York City in 2013. This project will parallel Chapter 5 in Wickham and Hadley. You should start reading this chapter now, and try to complete reading it by the end of this week.

flights

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Do the Exercises in section 5.2.4 of the Wickham book you will have to read the material prior to the exercises, though you have probably seen all this in the preceding sections. -

Write up your solutions to the exercises in 5.2.4 in this document, including the code chunks you use to determine the answer.

#1.

1. Had an arrival delay of two or more hours

filter(flights, arr_delay >= 120)

## # A tibble: 10,200 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # … with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2. Flew to Houston (IAH or HOU)

filter(flights, dest %in% c('IAH', 'HOU'))

## # A tibble: 9,313 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # … with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3. Were operated by United, American, or Delta

filter(flights, carrier %in% c('UA','AA','DL'))

## # A tibble: 139,504 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # … with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

4. Departed in summer (July, August, and September)

filter(flights, month %in% c('7','8','9'))

## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

5. Arrived more than two hours late, but didn’t leave late

filter(flights,arr_delay > 120, dep_delay <= 0)

## # A tibble: 29 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    27     1419           1420        -1     1754           1550
##  2  2013    10     7     1350           1350         0     1736           1526
##  3  2013    10     7     1357           1359        -2     1858           1654
##  4  2013    10    16      657            700        -3     1258           1056
##  5  2013    11     1      658            700        -2     1329           1015
##  6  2013     3    18     1844           1847        -3       39           2219
##  7  2013     4    17     1635           1640        -5     2049           1845
##  8  2013     4    18      558            600        -2     1149            850
##  9  2013     4    18      655            700        -5     1213            950
## 10  2013     5    22     1827           1830        -3     2217           2010
## # … with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

6. Were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights,dep_delay >= 60 & dep_delay-arr_delay > 30)

## # A tibble: 1,844 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     2205           1720       285       46           2040
##  2  2013     1     1     2326           2130       116      131             18
##  3  2013     1     3     1503           1221       162     1803           1555
##  4  2013     1     3     1839           1700        99     2056           1950
##  5  2013     1     3     1850           1745        65     2148           2120
##  6  2013     1     3     1941           1759       102     2246           2139
##  7  2013     1     3     1950           1845        65     2228           2227
##  8  2013     1     3     2015           1915        60     2135           2111
##  9  2013     1     3     2257           2000       177       45           2224
## 10  2013     1     4     1917           1700       137     2135           1950
## # … with 1,834 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

7. Departed between midnight and 6am (inclusive)

filter(flights, dep_time <=600 | dep_time == 2400)

## # A tibble: 9,373 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

between() function in R Language is used to check whether a numeric value falls in a specific range or not. A lower bound and an upper bound is specified and checked if the value falls in it

filter(flights, between(month, 7, 9))

## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights, !between(dep_time, 601, 2359))

## # A tibble: 9,373 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

8255 flights have a missing dep_time. 8255 have a missing dep_delay, 8713 have a missing arr_time, 9430 have a missing arr_delay, and 9430 have a missing air_time. These flights represent those that never departed or arrived. It could also be an issue of lost flight data.

summary(flights)

##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                  
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00  
##

4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

’NA ^ 0` equals 1 because anything to the power of 0 is 1. Unfortunatley, because it is recorded as NA the true value of NA can’t be determined.

NA | TRUE equals TRUE because the | operator returns TRUE if either of the terms are true. In this expression because the right side is ‘TRUE’ therefore giving the answer “TRUE’

‘False & NA’ equals FALSE because & returns TRUE when both terms are true. In this case neither parts of the expression are TRUE.

NA * 0 equals NA which can be any number (infinite). If the number is finite, then the result of the multiplication will be 0. However, if the the number is Inf, then the result of the multiplication will be NaN. We don’t know if the multiplication results in 0 or NaN so the result is given as NA.

NA ^ 0

## [1] 1

NA | TRUE

## [1] TRUE

FALSE & NA

## [1] FALSE

NA*0

## [1] NA

Inf*0

## [1] NaN

Task 2

Do the exercises in 5.3.1

Write up your solutions to the exercises in 5.3.1 in this document, including the code chunks you use to determine the answer.

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

df <- tibble(x = c(5, 2, NA))
arrange(df, desc(is.na(x)))

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1    NA
## 2     5
## 3     2

arrange(df, -(is.na(x)))

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1    NA
## 2     5
## 3     2

#2. Sort flights to find the most delayed flights. Find the flights that left earliest.

arrange(flights, desc(dep_delay))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

arrange(flights, dep_delay)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3. Sort flights to find the fastest (highest speed) flights.

flights %>% mutate(travel_time = ifelse((arr_time - dep_time < 0),
                                        2400+(arr_time - dep_time),
                                        arr_time - dep_time)) %>% 
  arrange(travel_time) %>% select(arr_time, dep_time, travel_time)

## # A tibble: 336,776 × 3
##    arr_time dep_time travel_time
##       <int>    <int>       <dbl>
##  1     1358     1323          35
##  2     1347     1312          35
##  3     1238     1203          35
##  4      758      722          36
##  5      758      722          36
##  6      754      718          36
##  7     1455     1418          37
##  8       53       16          37
##  9      754      717          37
## 10     1353     1315          38
## # … with 336,766 more rows

4. Which flights travelled the farthest? Which travelled the shortest?

arrange(flights, desc(distance)) %>% select(1:5, distance)

## # A tibble: 336,776 × 6
##     year month   day dep_time sched_dep_time distance
##    <int> <int> <int>    <int>          <int>    <dbl>
##  1  2013     1     1      857            900     4983
##  2  2013     1     2      909            900     4983
##  3  2013     1     3      914            900     4983
##  4  2013     1     4      900            900     4983
##  5  2013     1     5      858            900     4983
##  6  2013     1     6     1019            900     4983
##  7  2013     1     7     1042            900     4983
##  8  2013     1     8      901            900     4983
##  9  2013     1     9      641            900     4983
## 10  2013     1    10      859            900     4983
## # … with 336,766 more rows

arrange(flights, distance) %>% select(1:5, distance)

## # A tibble: 336,776 × 6
##     year month   day dep_time sched_dep_time distance
##    <int> <int> <int>    <int>          <int>    <dbl>
##  1  2013     7    27       NA            106       17
##  2  2013     1     3     2127           2129       80
##  3  2013     1     4     1240           1200       80
##  4  2013     1     4     1829           1615       80
##  5  2013     1     4     2128           2129       80
##  6  2013     1     5     1155           1200       80
##  7  2013     1     6     2125           2129       80
##  8  2013     1     7     2124           2129       80
##  9  2013     1     8     2127           2130       80
## 10  2013     1     9     2126           2129       80
## # … with 336,766 more rows

Task 3

Do the exercises 5.4.1

Write up your solutions to the exercises in 5.4.1 in this document, including the code chunks you use to determine the answer.

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, dep_time,  dep_delay, arr_time, arr_delay)

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows

select(flights, c(dep_time,  dep_delay, arr_time, arr_delay))

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows

flights %>% select(dep_time,  dep_delay, arr_time, arr_delay)

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows

head(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

2. What happens if you include the name of a variable multiple times in a select() call?

You just get the variable column with data once.

flights %>% select(arr_delay, arr_delay, arr_delay)

## # A tibble: 336,776 × 1
##    arr_delay
##        <dbl>
##  1        11
##  2        20
##  3        33
##  4       -18
##  5       -25
##  6        12
##  7        19
##  8       -14
##  9        -8
## 10         8
## # … with 336,766 more rows

3. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

It returns all the variables requested in the expression.

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

All variable names that contained “TIME” were filtered. The default can be changed by setting ignore.case=FALSE.

select(flights, contains("TIME"))

## # A tibble: 336,776 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # … with 336,766 more rows

select(flights, contains("TIME", ignore.case = FALSE))

## # A tibble: 336,776 × 0

head(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Task 4

Do the exercises 5.5.2

Write up your solutions to the exercises in 5.5.2 in this document, including the code chunks you use to determine the answer.

1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

mutate(flights,
       dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100),
       sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <dbl>          <dbl>     <dbl>    <int>          <int>
##  1  2013     1     1      317            315         2      830            819
##  2  2013     1     1      333            329         4      850            830
##  3  2013     1     1      342            340         2      923            850
##  4  2013     1     1      344            345        -1     1004           1022
##  5  2013     1     1      354            360        -6      812            837
##  6  2013     1     1      354            358        -4      740            728
##  7  2013     1     1      355            360        -5      913            854
##  8  2013     1     1      357            360        -3      709            723
##  9  2013     1     1      357            360        -3      838            846
## 10  2013     1     1      358            360        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

In the original flights dataset ‘air_time’ is in minutes (60) but ‘arr_time’ and “dep_time” are in military time (HHMM or HMM) so in order to compare a conversion would have to be made.

The results of arr_time - dep_time are large negative numbers. This occurs when a flight sets off before midnight but arrives after it.

The results vary significantly fromarr_time - dep_time and air_time

flights %>% 
  mutate(dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100),
         sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100),
         arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
         sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>%
  transmute((arr_time - dep_time) %% (60*24) - air_time)

## # A tibble: 336,776 × 1
##    `(arr_time - dep_time)%%(60 * 24) - air_time`
##                                            <dbl>
##  1                                           -34
##  2                                           -30
##  3                                            61
##  4                                            77
##  5                                            22
##  6                                           -44
##  7                                            40
##  8                                            19
##  9                                            21
## 10                                           -23
## # … with 336,766 more rows

3. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

I would expect sched_dep_time + dep_delay == dep_time. After analyzing this I find that almost all results appear as ‘TRUE’.

flights %>% 
  mutate(dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100),
         sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100),
         arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
         sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>%
  transmute(near((sched_dep_time + dep_delay) %% (60*24), dep_time, tol=1))

## # A tibble: 336,776 × 1
##    `near((sched_dep_time + dep_delay)%%(60 * 24), dep_time, tol = 1)`
##    <lgl>                                                             
##  1 TRUE                                                              
##  2 TRUE                                                              
##  3 TRUE                                                              
##  4 TRUE                                                              
##  5 TRUE                                                              
##  6 TRUE                                                              
##  7 TRUE                                                              
##  8 TRUE                                                              
##  9 TRUE                                                              
## 10 TRUE                                                              
## # … with 336,766 more rows

4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

There were no ties in the top 10 most delayed flights for departure and arrival. The min_rank() function does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks. Therefore, if there was a tie for 10th place the min_rank() function could have produced more than 10 results. The arrange() function also could be used to break a tie. Instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

filter(flights, min_rank(desc(dep_delay))<=10)

## # A tibble: 10 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     1    10     1121           1635      1126     1239           1810
##  3  2013    12     5      756           1700       896     1058           2020
##  4  2013     3    17     2321            810       911      135           1020
##  5  2013     4    10     1100           1900       960     1342           2211
##  6  2013     6    15     1432           1935      1137     1607           2120
##  7  2013     6    27      959           1900       899     1236           2226
##  8  2013     7    22      845           1600      1005     1044           1815
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013     9    20     1139           1845      1014     1457           2210
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% top_n(n = 10, wt = dep_delay)

## # A tibble: 10 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     1    10     1121           1635      1126     1239           1810
##  3  2013    12     5      756           1700       896     1058           2020
##  4  2013     3    17     2321            810       911      135           1020
##  5  2013     4    10     1100           1900       960     1342           2211
##  6  2013     6    15     1432           1935      1137     1607           2120
##  7  2013     6    27      959           1900       899     1236           2226
##  8  2013     7    22      845           1600      1005     1044           1815
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013     9    20     1139           1845      1014     1457           2210
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

5. What does 1:3 + 1:10 return? Why?

1:3 + 1:10 produces a length 10 vector and a warning message. This is because the shorter vector is repeated out to the length of the longer one. However, since 10 is not a multiple of 3 we get an error.

1:3 + 1:10

## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
## length

##  [1]  2  4  6  5  7  9  8 10 12 11

6. What trigonometric functions does R provide?

These functions give the obvious trigonometric functions. They respectively compute the cosine, sine, tangent, arc-cosine, arc-sine, arc-tangent, and the two-argument arc-tangent.

cospi(x), sinpi(x), and tanpi(x), compute cos(pix), sin(pix), and tan(pi*x).

Usage:

cos(x) sin(x) tan(x)

acos(x) asin(x) atan(x) atan2(y, x)

cospi(x) sinpi(x) tanpi(x)

?Trig

Task 5

Do the exercises 5.6.7

Write up your solutions to the exercises in 5.6.7 in this document, including the code chunks you use to determine the answer.

1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

str(flights)

## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

head(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flight_delay_summary <- group_by(flights, flight) %>% summarise(num_flights = n(),
                                                                percentage_on_time = sum(arr_time == sched_arr_time)/num_flights,
                                                                percentage_early = sum(arr_time < sched_arr_time)/num_flights, 
                                                                percentage_15_mins_early = sum(sched_arr_time - arr_time == 15)/num_flights,
                                                                percentage_30_mins_early = sum(sched_arr_time - arr_time == 30)/num_flights,               
                                                                percentage_late = sum(arr_time > sched_arr_time)/num_flights,
                                                                percentage_10_mins_late = sum(arr_time - sched_arr_time == 10)/num_flights,
                                                                percentage_15_mins_late = sum(arr_time - sched_arr_time == 15)/num_flights,
                                                                percentage_30_mins_late = sum(arr_time - sched_arr_time == 30)/num_flights,
                                                                percentage_2_hours_late = sum(arr_time - sched_arr_time == 120)/num_flights)
flight_delay_summary

## # A tibble: 3,844 × 11
##    flight num_flights percentage_on_time percentage_early percentage_15_mins_ea…
##     <int>       <int>              <dbl>            <dbl>                  <dbl>
##  1      1         701           NA                 NA                   NA      
##  2      2          51            0.0392             0.725                0.0392 
##  3      3         631           NA                 NA                   NA      
##  4      4         393           NA                 NA                   NA      
##  5      5         324            0.00617            0.716                0.00926
##  6      6         210           NA                 NA                   NA      
##  7      7         237           NA                 NA                   NA      
##  8      8         236           NA                 NA                   NA      
##  9      9         153           NA                 NA                   NA      
## 10     10          61            0.0164             0.721                0.0164 
## # … with 3,834 more rows, and 6 more variables: percentage_30_mins_early <dbl>,
## #   percentage_late <dbl>, percentage_10_mins_late <dbl>,
## #   percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## #   percentage_2_hours_late <dbl>

A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

filter(flight_delay_summary, percentage_15_mins_early == 0.5 & percentage_15_mins_late == 0.5)

## # A tibble: 0 × 11
## # … with 11 variables: flight <int>, num_flights <int>,
## #   percentage_on_time <dbl>, percentage_early <dbl>,
## #   percentage_15_mins_early <dbl>, percentage_30_mins_early <dbl>,
## #   percentage_late <dbl>, percentage_10_mins_late <dbl>,
## #   percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## #   percentage_2_hours_late <dbl>

A flight is always 10 minutes late.

filter(flight_delay_summary, percentage_10_mins_late == 1.00)

## # A tibble: 3 × 11
##   flight num_flights percentage_on_time percentage_early percentage_15_mins_ear…
##    <int>       <int>              <dbl>            <dbl>                   <dbl>
## 1   2254           1                  0                0                       0
## 2   3880           1                  0                0                       0
## 3   5854           1                  0                0                       0
## # … with 6 more variables: percentage_30_mins_early <dbl>,
## #   percentage_late <dbl>, percentage_10_mins_late <dbl>,
## #   percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## #   percentage_2_hours_late <dbl>

A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

filter(flight_delay_summary, percentage_30_mins_early == 0.5 & percentage_30_mins_late == 0.5)

## # A tibble: 0 × 11
## # … with 11 variables: flight <int>, num_flights <int>,
## #   percentage_on_time <dbl>, percentage_early <dbl>,
## #   percentage_15_mins_early <dbl>, percentage_30_mins_early <dbl>,
## #   percentage_late <dbl>, percentage_10_mins_late <dbl>,
## #   percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## #   percentage_2_hours_late <dbl>

99% of the time a flight is on time. 1% of the time it’s 2 hours late.

filter(flight_delay_summary, percentage_on_time == 0.99 & percentage_2_hours_late == 0.01)

## # A tibble: 0 × 11
## # … with 11 variables: flight <int>, num_flights <int>,
## #   percentage_on_time <dbl>, percentage_early <dbl>,
## #   percentage_15_mins_early <dbl>, percentage_30_mins_early <dbl>,
## #   percentage_late <dbl>, percentage_10_mins_late <dbl>,
## #   percentage_15_mins_late <dbl>, percentage_30_mins_late <dbl>,
## #   percentage_2_hours_late <dbl>

Which is more important: arrival delay or departure delay?

Unfortunately, we do not have enough data to determine if arrival delay or departure delay is more important. This would vary depending on the need or use of the data. Particularly if an airline is using this data is it for accounting purpose, improve routes, improve the service it provides to its customer, etc.

2. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
  group_by(dest) %>%
  tally()

## # A tibble: 104 × 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # … with 94 more rows

not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))

## # A tibble: 4,037 × 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # … with 4,027 more rows

3. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

Departure, because flights can’t arrive unless they depart from another airport.

flights %>%
    group_by(departed = !is.na(dep_delay), arrived = !is.na(arr_delay)) %>%
    summarise(n=n())

## `summarise()` has grouped output by 'departed'. You can override using the `.groups` argument.

## # A tibble: 3 × 3
## # Groups:   departed [2]
##   departed arrived      n
##   <lgl>    <lgl>    <int>
## 1 FALSE    FALSE     8255
## 2 TRUE     FALSE     1175
## 3 TRUE     TRUE    327346

4. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

Based on the chart we can see that their is not a strong relationship between cancellations and delays.

flights %>%
  mutate(dep_date = lubridate::make_datetime(year, month, day)) %>%
  group_by(dep_date) %>%
  summarise(cancelled = sum(is.na(dep_delay)), 
            n = n(),
            mean_dep_delay = mean(dep_delay,na.rm=TRUE),
            mean_arr_delay = mean(arr_delay,na.rm=TRUE)) %>%
    ggplot(aes(x= cancelled/n)) + 
    geom_point(aes(y=mean_dep_delay), colour='blue', alpha=0.5) + 
    geom_point(aes(y=mean_arr_delay), colour='red', alpha=0.5) + 
    ylab('mean delay (minutes)')

5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

The carrier with the worst delays is SkyWest Airlines Inc. (OO) with 60.6 average arrival delay.

It is difficult to determine the effects of bad airports vs bad carriers. In this dataset there are 16 carriers, 3 origin airports, and 105 airport destinations. Many airport location act as hubs therefore we may find that many airports only support a few carriers. As a result, some airport destinations will have only one or two carriers, so it is difficult to tell how much of the delay is due to the carrier, and how much is due to the airport. We also have to consider the route delays, weather, and other plane technical difficulties that could also result in delays.

flights %>%
    filter(arr_delay > 0) %>%
    group_by(carrier) %>%
    summarise(average_arr_delay = mean(arr_delay, na.rm=TRUE)) %>%
    arrange(desc(average_arr_delay))

## # A tibble: 16 × 2
##    carrier average_arr_delay
##    <chr>               <dbl>
##  1 OO                   60.6
##  2 YV                   51.1
##  3 9E                   49.3
##  4 EV                   48.3
##  5 F9                   47.6
##  6 VX                   43.8
##  7 FL                   41.1
##  8 WN                   40.7
##  9 B6                   40.0
## 10 AA                   38.3
## 11 MQ                   37.9
## 12 DL                   37.7
## 13 UA                   36.7
## 14 HA                   35.0
## 15 AS                   34.4
## 16 US                   29.0

flights %>%
  summarise(n_distinct(carrier),
            n_distinct(origin),
            n_distinct(dest))

## # A tibble: 1 × 3
##   `n_distinct(carrier)` `n_distinct(origin)` `n_distinct(dest)`
##                   <int>                <int>              <int>
## 1                    16                    3                105

6. What does the sort argument to count() do. When might you use it?

flights %>%
    mutate(dep_date = lubridate::make_datetime(year, month, day)) %>%
    group_by(tailnum) %>%
    arrange(dep_date) %>%
    filter(!cumany(arr_delay>60)) %>%
    tally(sort = TRUE)

## # A tibble: 3,748 × 2
##    tailnum     n
##    <chr>   <int>
##  1 N705TW     97
##  2 N765US     97
##  3 N12125     94
##  4 N320AA     94
##  5 N13110     91
##  6 N3763D     82
##  7 N58101     82
##  8 N17122     81
##  9 N961UW     80
## 10 N950UW     79
## # … with 3,738 more rows

Turning in your work and announcements for the remainder of the Semester

You will turn in your work, first by posting the work to RPubs, you will probably need to get an account, they are free.
The work will be due by Midnight November 21.
You will get another project starting next week which will be due after Thanksgiving.

Final Project

Then there will be a final project. There will be no final exam. I will provide a list of three projects from which you may choose the project you will do.
I will provide all the data necessary.
You will also post your final project on RPubs.
The final project will be due on Midnight the day the final is scheduled.

DATA 101: Project 2 - Part 1

Jannety Mosley

11/6/2021