library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
planes
## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows
airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,448 more rows
weather
## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

Exercises 4.2.5

All flights that had an arrival delay of two or more hours

flights_arr2 <- flights |> 
  filter(arr_delay >= 120)

Flew to Houston (IAH or HOU)

flights_houston <- flights |>
  filter(dest %in% c("IAH", "HOU"))

Were operated by United, American, or Delta

flights_UAD <- flights |>
  filter(carrier %in% c("UA", "AA", "DL"))

Departed in summer (July, August, and September)

flights_summer <- flights |>
  filter(month %in% c(7, 8, 9))

Arrived more than two hours late, but didn’t leave late

flights |>
  filter(arr_delay > 120 & dep_delay == 0)
## # A tibble: 3 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013    10     7     1350           1350         0     1736           1526
## 2  2013     5    23     1810           1810         0     2208           2000
## 3  2013     7     1      905            905         0     1443           1223
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Were delayed by at least an hour, but made up over 30 minutes in flight

flights |> 
  filter(dep_delay >= 60 & arr_delay <= 30)
## # A tibble: 239 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     3     1850           1745        65     2148           2120
##  2  2013     1     3     1950           1845        65     2228           2227
##  3  2013     1     3     2015           1915        60     2135           2111
##  4  2013     1     6     1019            900        79     1558           1530
##  5  2013     1     7     1543           1430        73     1758           1735
##  6  2013     1    11     1020            920        60     1311           1245
##  7  2013     1    12     1706           1600        66     1949           1927
##  8  2013     1    12     1953           1845        68     2154           2137
##  9  2013     1    19     1456           1355        61     1636           1615
## 10  2013     1    21     1531           1430        61     1843           1815
## # ℹ 229 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.
flights |> 
  arrange(desc(dep_delay), dep_time) 
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> 
  arrange(sched_dep_time)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7    27       NA            106        NA       NA            245
##  2  2013     1     2      458            500        -2      703            650
##  3  2013     1     3      458            500        -2      650            650
##  4  2013     1     4      456            500        -4      631            650
##  5  2013     1     5      458            500        -2      640            650
##  6  2013     1     6      458            500        -2      718            650
##  7  2013     1     7      454            500        -6      637            648
##  8  2013     1     8      454            500        -6      625            648
##  9  2013     1     9      457            500        -3      647            648
## 10  2013     1    10      450            500       -10      634            648
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

flights |>
  arrange(distance/air_time)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    28     1917           1825        52     2118           1935
##  2  2013     6    29      755            800        -5     1035            909
##  3  2013     8    28      932            940        -8     1116           1051
##  4  2013     1    30     1037            955        42     1221           1100
##  5  2013    11    27      556            600        -4      727            658
##  6  2013     5    21      558            600        -2      721            657
##  7  2013    12     9     1540           1535         5     1720           1656
##  8  2013     6    10     1356           1300        56     1646           1414
##  9  2013     7    28     1322           1325        -3     1612           1432
## 10  2013     4    11     1349           1345         4     1542           1453
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Was there a flight on every day of 2013?

flights |> 
  distinct(year, month, day)
## # A tibble: 365 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     2
##  3  2013     1     3
##  4  2013     1     4
##  5  2013     1     5
##  6  2013     1     6
##  7  2013     1     7
##  8  2013     1     8
##  9  2013     1     9
## 10  2013     1    10
## # ℹ 355 more rows

Answer: yes

Which flights traveled the farthest distance? Which traveled the least distance?

flights |>
  arrange(desc(distance))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      857            900        -3     1516           1530
##  2  2013     1     2      909            900         9     1525           1530
##  3  2013     1     3      914            900        14     1504           1530
##  4  2013     1     4      900            900         0     1516           1530
##  5  2013     1     5      858            900        -2     1519           1530
##  6  2013     1     6     1019            900        79     1558           1530
##  7  2013     1     7     1042            900       102     1620           1530
##  8  2013     1     8      901            900         1     1504           1530
##  9  2013     1     9      641            900      1301     1242           1530
## 10  2013     1    10      859            900        -1     1449           1530
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

JFK-HNL

flights |>
  arrange(distance)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7    27       NA            106        NA       NA            245
##  2  2013     1     3     2127           2129        -2     2222           2224
##  3  2013     1     4     1240           1200        40     1333           1306
##  4  2013     1     4     1829           1615       134     1937           1721
##  5  2013     1     4     2128           2129        -1     2218           2224
##  6  2013     1     5     1155           1200        -5     1241           1306
##  7  2013     1     6     2125           2129        -4     2224           2224
##  8  2013     1     7     2124           2129        -5     2212           2224
##  9  2013     1     8     2127           2130        -3     2304           2225
## 10  2013     1     9     2126           2129        -3     2217           2224
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

EWR-LGA and EWR-PHL

Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

Yes, if you filter first then there is less work since fewer items are being arranged. If you arrange before filtering, the function will have to do more work.

Exercises 4.3.5

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
flights |> 
  select(dep_time, sched_dep_time, dep_delay)
## # A tibble: 336,776 × 3
##    dep_time sched_dep_time dep_delay
##       <int>          <int>     <dbl>
##  1      517            515         2
##  2      533            529         4
##  3      542            540         2
##  4      544            545        -1
##  5      554            600        -6
##  6      554            558        -4
##  7      555            600        -5
##  8      557            600        -3
##  9      557            600        -3
## 10      558            600        -2
## # ℹ 336,766 more rows

I expect that dep_time - sched_dep_time = dep_delay. As dep_time increases and/or sched_dep time decreases, dep_delay increases.

  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
flights |> 
  select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows
flights |>
  select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows
flights |>
  select(contains("arr"), contains("dep")) #Doesn't work because sched_arr_time and carrier contain "arr"
## # A tibble: 336,776 × 7
##    arr_time sched_arr_time arr_delay carrier dep_time sched_dep_time dep_delay
##       <int>          <int>     <dbl> <chr>      <int>          <int>     <dbl>
##  1      830            819        11 UA           517            515         2
##  2      850            830        20 UA           533            529         4
##  3      923            850        33 AA           542            540         2
##  4     1004           1022       -18 B6           544            545        -1
##  5      812            837       -25 DL           554            600        -6
##  6      740            728        12 UA           554            558        -4
##  7      913            854        19 B6           555            600        -5
##  8      709            723       -14 EV           557            600        -3
##  9      838            846        -8 B6           557            600        -3
## 10      753            745         8 AA           558            600        -2
## # ℹ 336,766 more rows
  1. What happens if you specify the name of the same variable multiple times in a select() call?
flights |> 
  select(carrier, carrier, arr_time)
## # A tibble: 336,776 × 2
##    carrier arr_time
##    <chr>      <int>
##  1 UA           830
##  2 UA           850
##  3 AA           923
##  4 B6          1004
##  5 DL           812
##  6 UA           740
##  7 B6           913
##  8 EV           709
##  9 B6           838
## 10 AA           753
## # ℹ 336,766 more rows

Only 1 column appears for the repeated variable

  1. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

variables <- c(“year”, “month”, “day”, “dep_delay”, “arr_delay”)

variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |> 
  select(any_of(variables))
## # A tibble: 336,776 × 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ℹ 336,766 more rows

Any_of can specify and display only the columns that have been assigned to a character vector. It can be used within a select function to quickly select preassigned columns by using the vector name, rather than typing all the names out. It can be used for negative selections (make sure a variable is removed)

  1. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

flights |> select(contains(“TIME”))

flights |> select(contains("TIME"))
## # A tibble: 336,776 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # ℹ 336,766 more rows

The select function/helpers do not appear to be case-sensitive.

flights |> select(contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0

“ignore.case” is TRUE by default; change it to FALSE. Nothing appears since there are no uppercase column names.

  1. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.
flights |>
  rename(air_time_min = air_time) |>
  relocate(air_time_min)
## # A tibble: 336,776 × 19
##    air_time_min  year month   day dep_time sched_dep_time dep_delay arr_time
##           <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1          227  2013     1     1      517            515         2      830
##  2          227  2013     1     1      533            529         4      850
##  3          160  2013     1     1      542            540         2      923
##  4          183  2013     1     1      544            545        -1     1004
##  5          116  2013     1     1      554            600        -6      812
##  6          150  2013     1     1      554            558        -4      740
##  7          158  2013     1     1      555            600        -5      913
##  8           53  2013     1     1      557            600        -3      709
##  9          140  2013     1     1      557            600        -3      838
## 10          138  2013     1     1      558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Why doesn’t the following work, and what does the error mean?

flights |> select(tailnum) |> arrange(arr_delay) #> Error in arrange(): #> ℹ In argument: ..1 = arr_delay. #> Caused by error: #> ! object ‘arr_delay’ not found

flights |>
  select(tailnum)
## # A tibble: 336,776 × 1
##    tailnum
##    <chr>  
##  1 N14228 
##  2 N24211 
##  3 N619AA 
##  4 N804JB 
##  5 N668DN 
##  6 N39463 
##  7 N516JB 
##  8 N829AS 
##  9 N593JB 
## 10 N3ALAA 
## # ℹ 336,766 more rows

Arr_delay was not selected. Only tailnum is in this tibble, therefore it is not possible to arrange by arr_delay.

4.5.7 Exercises

  1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
flights |>
  group_by(carrier, dest) |>
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |>
  arrange(desc(avg_delay))
## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.
## # A tibble: 314 × 3
## # Groups:   carrier [16]
##    carrier dest  avg_delay
##    <chr>   <chr>     <dbl>
##  1 UA      STL        77.5
##  2 OO      ORD        67  
##  3 OO      DTW        61  
##  4 UA      RDU        60  
##  5 EV      PBI        48.7
##  6 EV      TYS        41.8
##  7 EV      CAE        36.7
##  8 EV      TUL        34.9
##  9 9E      BGR        34  
## 10 WN      MSY        33.4
## # ℹ 304 more rows

There does not appear to be a pattern in airports in the top 10 delays. However, 4 of the top 10 highest average delayed flights were operated by ExpressJet, while other airlines only appeared up to 2 more times.

  1. Find the flights that are most delayed upon departure from each destination.
flights |>
  filter(dep_delay > 0) |>
  group_by(dest) |> 
  slice_max(dep_delay, n = 1) |> 
  relocate(dest)
## # A tibble: 103 × 19
## # Groups:   dest [103]
##    dest   year month   day dep_time sched_dep_time dep_delay arr_time
##    <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1 ABQ    2013    12    14     2223           2001       142      133
##  2 ACK    2013     7    23     1139            800       219     1250
##  3 ALB    2013     1    25      123           2000       323      229
##  4 ANC    2013     8    17     1740           1625        75     2042
##  5 ATL    2013     7    22     2257            759       898      121
##  6 AUS    2013     7    10     2056           1505       351     2347
##  7 AVL    2013     6    14     1158            816       222     1335
##  8 BDL    2013     2    21     1728           1316       252     1839
##  9 BGR    2013    12     1     1504           1056       248     1628
## 10 BHM    2013     4    10       25           1900       325      136
## # ℹ 93 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
  1. How do delays vary over the course of the day. Illustrate your answer with a plot.
flights |>
  ggplot(aes(x = dep_time, y = dep_delay, na.rm = TRUE)) +
  geom_point()
## Warning: Removed 8255 rows containing missing values (`geom_point()`).

Delays increase from 5-10 am and continue to increase a little over the course of the day.There are more delays at night.

  1. What happens if you supply a negative n to slice_min() and friends?
flights |>
  slice_min(dep_delay, n = -1)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

All rows are shown instead of just showing the lowest/highest value.

  1. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

count() allows you to count the number of occurrences per unique observation (e.g., # of flights per origin, # of flights per destination, # of flights for a given origin and destination, etc.). sort = TRUE places the most common values (or highest counts) at the top. When using count, the results are sorted by “n.”

  1. Suppose we have the following tiny data frame:

df <- tibble( x = 1:5, y = c(“a”, “b”, “a”, “a”, “b”), z = c(“K”, “K”, “L”, “L”, “K”) )

  1. Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

It will create a 3x5 data frame in the environment with variables x, y, and z. x will be the numbers 1-5, y will be the characters a, b, a, a, b, and z will be the characters K, K, L, L, K; in corresponding order with x.

df <- tibble(
  x = 1:5,
  y = c("a", "b", "a", "a", "b"),
  z = c("K", "K", "L", "L", "K")
)

group_by() groups the selected variable(s) based on the value into a 5x3 tibble.

df |> group_by(y)

df |>
  group_by(y)
## # A tibble: 5 × 3
## # Groups:   y [2]
##       x y     z    
##   <int> <chr> <chr>
## 1     1 a     K    
## 2     2 b     K    
## 3     3 a     L    
## 4     4 a     L    
## 5     5 b     K
  1. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?

df |> arrange(y)

It will arrange observations based on the value of y (a and b) in a 5x3 tibble.

df |> 
  arrange(y)
## # A tibble: 5 × 3
##       x y     z    
##   <int> <chr> <chr>
## 1     1 a     K    
## 2     3 a     L    
## 3     4 a     L    
## 4     2 b     K    
## 5     5 b     K

Arrange() sorts n by ascending order by default and can also sort by descending order if specified. This is different from group_by() because with arrange(), the output is not grouped by a variable’s value.

  1. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

df |> group_by(y) |> summarize(mean_x = mean(x))

The output will produce a 2x2 tibble with the mean x for each group of y.

df |>
  group_by(y) |>
  summarize(mean_x = mean(x))
## # A tibble: 2 × 2
##   y     mean_x
##   <chr>  <dbl>
## 1 a       2.67
## 2 b       3.5

The pipeline connects the action (verb) to be used with the dataframe that it will be used on (information in the dataframe “df” will be arranged by y). The pipeline first groups the data in df by the variable y, then produces the mean of x within each group of y. The mean x of a is lower than the mean x of b.

  1. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

df |> group_by(y, z) |> summarize(mean_x = mean(x))

The output will be a 3x3 tibble containing the mean x for each combination of y and z.

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))
## `summarise()` has grouped output by 'y'. You can override using the `.groups`
## argument.
## # A tibble: 3 × 3
## # Groups:   y [2]
##   y     z     mean_x
##   <chr> <chr>  <dbl>
## 1 a     K        1  
## 2 a     L        3.5
## 3 b     K        3.5

The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values.

  1. df |> group_by(y, z) |> summarize(mean_x = mean(x), .groups = “drop”)

Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).

The output will be the same as (d).

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x), .groups = "drop")
## # A tibble: 3 × 3
##   y     z     mean_x
##   <chr> <chr>  <dbl>
## 1 a     K        1  
## 2 a     L        3.5
## 3 b     K        3.5

The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values, and ungroups the data. This output is the same as (d) because the values in the dataframe were separated (ungrouped) without explicitly coding for this by grouping by all but 1 variable.

  1. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

df |> group_by(y, z) |> summarize(mean_x = mean(x))

The output will be a 3x3 tibble with the mean x for each combination of y and z (same as (d)).

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))
## `summarise()` has grouped output by 'y'. You can override using the `.groups`
## argument.
## # A tibble: 3 × 3
## # Groups:   y [2]
##   y     z     mean_x
##   <chr> <chr>  <dbl>
## 1 a     K        1  
## 2 a     L        3.5
## 3 b     K        3.5

The pipeline groups df by y and z and then produces the mean x.

df |> group_by(y, z) |> mutate(mean_x = mean(x))

The output will be a 3x4 tibble with an extra column titled mean_x in addition to the variable x.

df |>
  group_by(y, z) |>
  mutate(mean_x = mean(x))
## # A tibble: 5 × 4
## # Groups:   y, z [3]
##       x y     z     mean_x
##   <int> <chr> <chr>  <dbl>
## 1     1 a     K        1  
## 2     2 b     K        3.5
## 3     3 a     L        3.5
## 4     4 a     L        3.5
## 5     5 b     K        3.5

The pipeline groups df by y and z, then creates a new column to produce a 5x4 tibble with the mean x for each combination of x, y, and z. This output is different because using the mutate function to create a new column retains the original x column, meaning that each observation remains separate. The mean_x is equal to x in this tibble because of the separate observations.