Data101 NYC Flights 2013

Author

Duchelle K

Exercise 3.2.5 of R4DS

# Load the library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
  1. In a single pipeline for each condition, find all flights that meet the condition:

    • Had an arrival delay of two or more hours
    flights |> filter(arr_delay >= 120)
    # A tibble: 10,200 × 19
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
     1  2013     1     1      811            630       101     1047            830
     2  2013     1     1      848           1835       853     1001           1950
     3  2013     1     1      957            733       144     1056            853
     4  2013     1     1     1114            900       134     1447           1222
     5  2013     1     1     1505           1310       115     1638           1431
     6  2013     1     1     1525           1340       105     1831           1626
     7  2013     1     1     1549           1445        64     1912           1656
     8  2013     1     1     1558           1359       119     1718           1515
     9  2013     1     1     1732           1630        62     2028           1825
    10  2013     1     1     1803           1620       103     2008           1750
    # ℹ 10,190 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    • Flew to Houston (IAH or HOU)
    flights |> filter(dest %in% c("IAH","HOU"))
    # A tibble: 9,313 × 19
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
     1  2013     1     1      517            515         2      830            819
     2  2013     1     1      533            529         4      850            830
     3  2013     1     1      623            627        -4      933            932
     4  2013     1     1      728            732        -4     1041           1038
     5  2013     1     1      739            739         0     1104           1038
     6  2013     1     1      908            908         0     1228           1219
     7  2013     1     1     1028           1026         2     1350           1339
     8  2013     1     1     1044           1045        -1     1352           1351
     9  2013     1     1     1114            900       134     1447           1222
    10  2013     1     1     1205           1200         5     1503           1505
    # ℹ 9,303 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    • Were operated by United, American, or Delta
    flights |> filter(carrier %in% c("UA", "AA", "DL"))
    # A tibble: 139,504 × 19
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
     1  2013     1     1      517            515         2      830            819
     2  2013     1     1      533            529         4      850            830
     3  2013     1     1      542            540         2      923            850
     4  2013     1     1      554            600        -6      812            837
     5  2013     1     1      554            558        -4      740            728
     6  2013     1     1      558            600        -2      753            745
     7  2013     1     1      558            600        -2      924            917
     8  2013     1     1      558            600        -2      923            937
     9  2013     1     1      559            600        -1      941            910
    10  2013     1     1      559            600        -1      854            902
    # ℹ 139,494 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    • Departed in summer (July, August, and September)
    flights |>filter(month %in% c("7","8","9"))
    # A tibble: 86,326 × 19
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
     1  2013     7     1        1           2029       212      236           2359
     2  2013     7     1        2           2359         3      344            344
     3  2013     7     1       29           2245       104      151              1
     4  2013     7     1       43           2130       193      322             14
     5  2013     7     1       44           2150       174      300            100
     6  2013     7     1       46           2051       235      304           2358
     7  2013     7     1       48           2001       287      308           2305
     8  2013     7     1       58           2155       183      335             43
     9  2013     7     1      100           2146       194      327             30
    10  2013     7     1      100           2245       135      337            135
    # ℹ 86,316 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    • Arrived more than two hours late, but didn’t leave late
    flights |> filter(arr_delay > 120) |> filter(dep_delay <= 0)
    # A tibble: 29 × 19
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
     1  2013     1    27     1419           1420        -1     1754           1550
     2  2013    10     7     1350           1350         0     1736           1526
     3  2013    10     7     1357           1359        -2     1858           1654
     4  2013    10    16      657            700        -3     1258           1056
     5  2013    11     1      658            700        -2     1329           1015
     6  2013     3    18     1844           1847        -3       39           2219
     7  2013     4    17     1635           1640        -5     2049           1845
     8  2013     4    18      558            600        -2     1149            850
     9  2013     4    18      655            700        -5     1213            950
    10  2013     5    22     1827           1830        -3     2217           2010
    # ℹ 19 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    • Were delayed by at least an hour, but made up over 30 minutes in flights
    flights |> filter(dep_delay >= 60) |> filter(arr_delay <= 30)
    # A tibble: 239 × 19
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
     1  2013     1     3     1850           1745        65     2148           2120
     2  2013     1     3     1950           1845        65     2228           2227
     3  2013     1     3     2015           1915        60     2135           2111
     4  2013     1     6     1019            900        79     1558           1530
     5  2013     1     7     1543           1430        73     1758           1735
     6  2013     1    11     1020            920        60     1311           1245
     7  2013     1    12     1706           1600        66     1949           1927
     8  2013     1    12     1953           1845        68     2154           2137
     9  2013     1    19     1456           1355        61     1636           1615
    10  2013     1    21     1531           1430        61     1843           1815
    # ℹ 229 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
  2. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

flights |> arrange(desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> arrange(dep_time)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    13        1           2249        72      108           2357
 2  2013     1    31        1           2100       181      124           2225
 3  2013    11    13        1           2359         2      442            440
 4  2013    12    16        1           2359         2      447            437
 5  2013    12    20        1           2359         2      430            440
 6  2013    12    26        1           2359         2      437            440
 7  2013    12    30        1           2359         2      441            437
 8  2013     2    11        1           2100       181      111           2225
 9  2013     2    24        1           2245        76      121           2354
10  2013     3     8        1           2355         6      431            440
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

3. Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

flights |> mutate(speed = distance / air_time * 60) |> arrange(desc(speed)) |> relocate(speed, distance, air_time)
# A tibble: 336,776 × 20
   speed distance air_time  year month   day dep_time sched_dep_time dep_delay
   <dbl>    <dbl>    <dbl> <int> <int> <int>    <int>          <int>     <dbl>
 1  703.      762       65  2013     5    25     1709           1700         9
 2  650.     1008       93  2013     7     2     1558           1513        45
 3  648       594       55  2013     5    13     2040           2025        15
 4  641.      748       70  2013     3    23     1914           1910         4
 5  591.     1035      105  2013     1    12     1559           1600        -1
 6  564      1598      170  2013    11    17      650            655        -5
 7  557.     1598      172  2013     2    21     2355           2358        -3
 8  556.     1623      175  2013    11    17      759            800        -1
 9  554.     1598      173  2013    11    16     2003           1925        38
10  554.     1598      173  2013    11    16     2349           2359       -10
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

4. Was there a flight on every day of 2013?

Yes, there was a flight on everyday of 2013 since the output shows that after removing the duplicated data, we have 365 rows which correspond to the 365 days of the year.

flights |> distinct(year, month , day)
# A tibble: 365 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     2
 3  2013     1     3
 4  2013     1     4
 5  2013     1     5
 6  2013     1     6
 7  2013     1     7
 8  2013     1     8
 9  2013     1     9
10  2013     1    10
# ℹ 355 more rows

5. Which flights traveled the farthest distance? Which traveled the least distance?

JFK-HNL traveled the longest distance. EWR-LGA and EWR-PHL traveled the least distance.

flights |> arrange(desc(distance)) 
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      857            900        -3     1516           1530
 2  2013     1     2      909            900         9     1525           1530
 3  2013     1     3      914            900        14     1504           1530
 4  2013     1     4      900            900         0     1516           1530
 5  2013     1     5      858            900        -2     1519           1530
 6  2013     1     6     1019            900        79     1558           1530
 7  2013     1     7     1042            900       102     1620           1530
 8  2013     1     8      901            900         1     1504           1530
 9  2013     1     9      641            900      1301     1242           1530
10  2013     1    10      859            900        -1     1449           1530
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> arrange(distance)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     7    27       NA            106        NA       NA            245
 2  2013     1     3     2127           2129        -2     2222           2224
 3  2013     1     4     1240           1200        40     1333           1306
 4  2013     1     4     1829           1615       134     1937           1721
 5  2013     1     4     2128           2129        -1     2218           2224
 6  2013     1     5     1155           1200        -5     1241           1306
 7  2013     1     6     2125           2129        -4     2224           2224
 8  2013     1     7     2124           2129        -5     2212           2224
 9  2013     1     8     2127           2130        -3     2304           2225
10  2013     1     9     2126           2129        -3     2217           2224
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

6. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

The order of execution matters because arrange() operates on the filtered data.

Exercise 3.3.5

1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

dep-delay is the result of a difference between dep_time and sched_dep_time. When dep_time increases and sched_dep_time decreases, dep_delay increases, and vice versa.

flights |> select(dep_delay , dep_time , sched_dep_time) |> arrange(dep_delay)
# A tibble: 336,776 × 3
   dep_delay dep_time sched_dep_time
       <dbl>    <int>          <int>
 1       -43     2040           2123
 2       -33     2022           2055
 3       -32     1408           1440
 4       -30     1900           1930
 5       -27     1703           1730
 6       -26      729            755
 7       -25     1907           1932
 8       -25     2030           2055
 9       -24     1431           1455
10       -24      934            958
# ℹ 336,766 more rows

2. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights |> select(dep_time, dep_delay, arr_time, arr_delay)
# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows
select(flights, dep_time, dep_delay, arr_time, arr_delay)
# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows
flights %>% select(dep_time, dep_delay, arr_time, arr_delay)
# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows
flights[, c(4, 6, 7, 9)]
# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows
flights |> select(starts_with("dep"), starts_with("arr"), ends_with("delay"))
# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows

3. What happens if you specify the name of the same variable multiple times in a select() call?

In dplyr, if we specify the same variable name multiple times in a select() call, only one copy of the variable will appear in the resulting data frame.

flights |> select(dep_time, dep_time, dep_time, arr_time, arr_time)
# A tibble: 336,776 × 2
   dep_time arr_time
      <int>    <int>
 1      517      830
 2      533      850
 3      542      923
 4      544     1004
 5      554      812
 6      554      740
 7      555      913
 8      557      709
 9      557      838
10      558      753
# ℹ 336,766 more rows

4. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

variables <- c("year", "month", "day", "dep_delay", "arr_delay")

any_of() is especially useful to remove variables from a data frame because calling it again does not cause an error. Using any_of(variables) would be helpful because it allows you to select columns from the flights data frame based on the vector variables, which contains the column names “year”, “month”, “day”, “dep_delay”, and “arr_delay”. This approach simplifies the selection process and makes the code more concise.

flights |> select(any_of(variables <- c("year", "month", "day", "dep_delay", "arr_delay")))
# A tibble: 336,776 × 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ℹ 336,766 more rows

5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

flights |> select(contains(“TIME”))

No, it does not surprise because the output displays all the columns which has the word “time”. By default, select(contains) is case-insensitive. To change it, we should set the argument ignore.case = FALSE , and nothing will appear because there is no uppercase column name in our dataset.

flights |> select(contains("TIME"))
# A tibble: 336,776 × 6
   dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
      <int>          <int>    <int>          <int>    <dbl> <dttm>             
 1      517            515      830            819      227 2013-01-01 05:00:00
 2      533            529      850            830      227 2013-01-01 05:00:00
 3      542            540      923            850      160 2013-01-01 05:00:00
 4      544            545     1004           1022      183 2013-01-01 05:00:00
 5      554            600      812            837      116 2013-01-01 06:00:00
 6      554            558      740            728      150 2013-01-01 05:00:00
 7      555            600      913            854      158 2013-01-01 06:00:00
 8      557            600      709            723       53 2013-01-01 06:00:00
 9      557            600      838            846      140 2013-01-01 06:00:00
10      558            600      753            745      138 2013-01-01 06:00:00
# ℹ 336,766 more rows
flights |> select(contains("TIME" , ignore.case = FALSE))
# A tibble: 336,776 × 0

6. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

flights |> rename(air_time_min = air_time) |> relocate(air_time_min)
# A tibble: 336,776 × 19
   air_time_min  year month   day dep_time sched_dep_time dep_delay arr_time
          <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1          227  2013     1     1      517            515         2      830
 2          227  2013     1     1      533            529         4      850
 3          160  2013     1     1      542            540         2      923
 4          183  2013     1     1      544            545        -1     1004
 5          116  2013     1     1      554            600        -6      812
 6          150  2013     1     1      554            558        -4      740
 7          158  2013     1     1      555            600        -5      913
 8           53  2013     1     1      557            600        -3      709
 9          140  2013     1     1      557            600        -3      838
10          138  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

7. Why doesn’t the following work, and what does the error mean?

flights |> select(tailnum) |> arrange(arr_delay) #> Error in `arrange()`: #> ℹ In argument: `..1 = arr_delay`. #> Caused by error: #> ! object ‘arr_delay’ not found

# flights |> select(tailnum) |> arrange(arr_delay)

The error occurs because the column name arr_delay is not found after selecting only the tailnum column. To fix it we should include the arr_delay column in the selection before arranging.

flights |> select(tailnum, arr_delay) |> arrange(arr_delay)
# A tibble: 336,776 × 2
   tailnum arr_delay
   <chr>       <dbl>
 1 N843VA        -86
 2 N840VA        -79
 3 N851UA        -75
 4 N3KCAA        -75
 5 N551AS        -74
 6 N24212        -73
 7 N3760C        -71
 8 N806UA        -71
 9 N805JB        -71
10 N855VA        -70
# ℹ 336,766 more rows

Exercise 3.5.7

1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

flights |> group_by(carrier, dest) |> summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |> arrange(desc(avg_delay))
`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.
# A tibble: 314 × 3
# Groups:   carrier [16]
   carrier dest  avg_delay
   <chr>   <chr>     <dbl>
 1 UA      STL        77.5
 2 OO      ORD        67  
 3 OO      DTW        61  
 4 UA      RDU        60  
 5 EV      PBI        48.7
 6 EV      TYS        41.8
 7 EV      CAE        36.7
 8 EV      TUL        34.9
 9 9E      BGR        34  
10 WN      MSY        33.4
# ℹ 304 more rows

United Airlines has the worst average delays, but only appears 2 times on the top 10 worst when ExpressJet appears 4 times . While average delays among carriers highlight trends, disentangling the effects of bad airports versus bad carriers is challenging due to complex interactions, shared routes, data limitations, operational strategies, and mitigation efforts.

2. Find the flights that are most delayed upon departure from each destination.

flights |> group_by(dest) |> slice_max(dep_delay, n = 1, with_ties = FALSE) |> relocate(dest) |> arrange(desc(dep_delay))
# A tibble: 105 × 19
# Groups:   dest [105]
   dest   year month   day dep_time sched_dep_time dep_delay arr_time
   <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1 HNL    2013     1     9      641            900      1301     1242
 2 CMH    2013     6    15     1432           1935      1137     1607
 3 ORD    2013     1    10     1121           1635      1126     1239
 4 SFO    2013     9    20     1139           1845      1014     1457
 5 CVG    2013     7    22      845           1600      1005     1044
 6 TPA    2013     4    10     1100           1900       960     1342
 7 MSP    2013     3    17     2321            810       911      135
 8 PDX    2013     6    27      959           1900       899     1236
 9 ATL    2013     7    22     2257            759       898      121
10 MIA    2013    12     5      756           1700       896     1058
# ℹ 95 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  1. How do delays vary over the course of the day. Illustrate your answer with a plot.

    There are less delays before 5 am. There are some delays that are above and beyond the rest. The longest delay is observed at 7 am. Delays increase a little bit over the day, and there are more delays during night times.

ggplot(data = flights) +
  geom_point(mapping = aes(x= dep_time, y = dep_delay, na.rm = TRUE))+
  labs(x= "Departure Time", y = "Departure Delay(minutes)")
Warning in geom_point(mapping = aes(x = dep_time, y = dep_delay, na.rm =
TRUE)): Ignoring unknown aesthetics: na.rm
Warning: Removed 8255 rows containing missing values (`geom_point()`).

4. What happens if you supply a negative n to slice_min() and friends?

When you supply a negative value for n in functions like slice_min() , all rows are shown instead of just showing the highest or lowest value.

flights |> slice_min(dep_time , n = -5)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    13        1           2249        72      108           2357
 2  2013     1    31        1           2100       181      124           2225
 3  2013    11    13        1           2359         2      442            440
 4  2013    12    16        1           2359         2      447            437
 5  2013    12    20        1           2359         2      430            440
 6  2013    12    26        1           2359         2      437            440
 7  2013    12    30        1           2359         2      441            437
 8  2013     2    11        1           2100       181      111           2225
 9  2013     2    24        1           2245        76      121           2354
10  2013     3     8        1           2355         6      431            440
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

5. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

count() computes the number of occurrences for each unique combination of specified variables, and the sort argument controls the order of results.

6. df <- tibble( x = 1:5, y = c(“a”, “b”, “a”, “a”, “b”), z = c(“K”, “K”, “L”, “L”, “K”) )

  1. Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

    df |> group_by(y)

    It will create a 3x5 data frame in the environment with variables x, y, and z. x will be the numbers 1-5, y will be the characters a, b, a, a, b, and z will be the characters K, K, L, L, K; in corresponding order with x.

df <- tibble(
  x = 1:5,
  y = c("a", "b", "a", "a", "b"),
  z = c("K", "K", "L", "L", "K"))
df |> group_by(y)
# A tibble: 5 × 3
# Groups:   y [2]
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     2 b     K    
3     3 a     L    
4     4 a     L    
5     5 b     K    

b. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?

df |> arrange(y)

arrange()will order rows based on the value of y column in a 5x3 tibble.

df |> arrange(y)
# A tibble: 5 × 3
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     3 a     L    
3     4 a     L    
4     2 b     K    
5     5 b     K    

It sorts rows by ascending order (smallest to biggest), but if we precise it can also sort rows by descending order. It is different from the group_by in part (a) since it does not creates groups of the variable values.

c. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

df |>   group_by(y) |>   summarize(mean_x = mean(x))

The output will produce a 2x2 tibble with the mean x for each group of y.

df |> group_by(y) |>  summarize(mean_x = mean(x))
# A tibble: 2 × 2
  y     mean_x
  <chr>  <dbl>
1 a       2.67
2 b       3.5 

The data pipeline connects an action (verb) to a dataframe (df). In this case, the action is to arrange the data in the dataframe based on the variable y. The pipeline first groups the data by the variable y, and then calculates the mean of x within each group of y. Specifically, the mean value of x for group a is lower than the mean value of x for group b.

d. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

df |>   group_by(y, z) |>   summarize(mean_x = mean(x))

The output will be a 3x3 tibble containing the mean x for each combination of y and z.

df |>   group_by(y, z) |>   summarize(mean_x = mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups:   y [2]
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values.

e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).

df |>   group_by(y, z) |>   summarize(mean_x = mean(x), .groups = "drop")

The output will be the same as (d), but the group message will be suppressed.

df |>   group_by(y, z) |>   summarize(mean_x = mean(x), .groups = "drop")
# A tibble: 3 × 3
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values, and ungroups the data. This output is different from (d) because it produces the same summary statistics as (d), but without displaying the grouping information in the output. It retains the columns y, z, and mean_x without any grouping details.

f. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

df |>   group_by(y, z) |>   summarize(mean_x = mean(x))  
df |>   group_by(y, z) |>   mutate(mean_x = mean(x))

The first output will be a 3x3 tibble with the mean x for each combination of y and z (same as (d)).

The second output will be a 3x4 tibble with an extra column titled mean_x in addition to the variable x.

df |>   group_by(y, z) |>   summarize(mean_x = mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups:   y [2]
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

The pipeline groups df by y and z and then produces the mean x.

df |>   group_by(y, z) |>   mutate(mean_x = mean(x))
# A tibble: 5 × 4
# Groups:   y, z [3]
      x y     z     mean_x
  <int> <chr> <chr>  <dbl>
1     1 a     K        1  
2     2 b     K        3.5
3     3 a     L        3.5
4     4 a     L        3.5
5     5 b     K        3.5

The pipeline groups df by y and z, then creates a new column to produce a 5x4 tibble with the mean x for each combination of x, y, and z.

This output is different because using the mutate function we create a new column that retains the original x column, meaning that each observation remains separate plus the newly added column that stores the calculation of mean_x.