First, I load in the dplyr package and the dataset.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(nycflights13)

## Warning: package 'nycflights13' was built under R version 4.3.3

Question 1: Find flights with an arrival delay of two or more hours.

flights |>
  filter(arr_delay >= 120) |>
  relocate(arr_delay)

## # A tibble: 10,200 × 19
##    arr_delay  year month   day dep_time sched_dep_time dep_delay arr_time
##        <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1       137  2013     1     1      811            630       101     1047
##  2       851  2013     1     1      848           1835       853     1001
##  3       123  2013     1     1      957            733       144     1056
##  4       145  2013     1     1     1114            900       134     1447
##  5       127  2013     1     1     1505           1310       115     1638
##  6       125  2013     1     1     1525           1340       105     1831
##  7       136  2013     1     1     1549           1445        64     1912
##  8       123  2013     1     1     1558           1359       119     1718
##  9       123  2013     1     1     1732           1630        62     2028
## 10       138  2013     1     1     1803           1620       103     2008
## # ℹ 10,190 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 2: Sort flights to find the fastest flights

flights |>
  mutate(flight_speed = distance / air_time) |>
  arrange(desc(flight_speed)) |>
  relocate(flight_speed)

## # A tibble: 336,776 × 20
##    flight_speed  year month   day dep_time sched_dep_time dep_delay arr_time
##           <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1        11.7   2013     5    25     1709           1700         9     1923
##  2        10.8   2013     7     2     1558           1513        45     1745
##  3        10.8   2013     5    13     2040           2025        15     2225
##  4        10.7   2013     3    23     1914           1910         4     2045
##  5         9.86  2013     1    12     1559           1600        -1     1849
##  6         9.4   2013    11    17      650            655        -5     1059
##  7         9.29  2013     2    21     2355           2358        -3      412
##  8         9.27  2013    11    17      759            800        -1     1212
##  9         9.24  2013    11    16     2003           1925        38       17
## 10         9.24  2013    11    16     2349           2359       -10      402
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The results in the new flight_speed column are given in miles per minute. The fastest flight of the year was from LGA to ATL on May 25th, with an average flight speed of 11.72 miles per minute.

Question 3: Was there a flight on every day of 2013?

flights |>
  distinct(month, day)

## # A tibble: 365 × 2
##    month   day
##    <int> <int>
##  1     1     1
##  2     1     2
##  3     1     3
##  4     1     4
##  5     1     5
##  6     1     6
##  7     1     7
##  8     1     8
##  9     1     9
## 10     1    10
## # ℹ 355 more rows

The distinct() function returns a tibble with 365 distinct month and day combinations, so, yes, there was a flight on every day of 2013.

Question 4: Which flights traveled the farthest distance? Which traveled the least distance?

flights |> 
  arrange(desc(distance)) |>
  head()

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      857            900        -3     1516           1530
## 2  2013     1     2      909            900         9     1525           1530
## 3  2013     1     3      914            900        14     1504           1530
## 4  2013     1     4      900            900         0     1516           1530
## 5  2013     1     5      858            900        -2     1519           1530
## 6  2013     1     6     1019            900        79     1558           1530
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> 
  arrange(desc(distance)) |>
  tail()

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     3     9     1959           2000        -1     2052           2054
## 2  2013     3    16     1947           1950        -3     2055           2044
## 3  2013     3    23     1946           1950        -4     2030           2044
## 4  2013     3    30     1942           1950        -8     2026           2044
## 5  2013     4     6     1948           1950        -2     2034           2044
## 6  2013     7    27       NA            106        NA       NA            245
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The flights that travelled the furthest distance were those that started at JFK and ended in HNL (Honolulu, Hawaii), with a total distance travelled of 4,983 miles.

The flights that travelled the shortest distance were those that left from EWR and ended in PHL (Philadelphia, Pennsylvania) with a total distance travelled of 80 miles. The shortest flight was the irregular flight from EWR to LGA that was discussed in class, with a distance of 17 miles, but for the purposes of my analysis, I won’t count this piece of data.

Question 5: Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

When using both filter() and arrange(), the order will not matter in terms of the produced tibble. You can either filter with your desired conditions and then arrange these rows, or you can arrange all of the rows and then select your desired conditions. For example,

#Version 1
flights |>
  filter(dest == "MSN") |>
  arrange(desc(arr_delay))

## # A tibble: 572 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     9    12     1841           1350       291     2135           1531
##  2  2013    12     5     2000           1420       340     2132           1555
##  3  2013    10     7     1912           1425       287     2048           1547
##  4  2013     3     8     1907           1405       302     2031           1533
##  5  2013     3    14     1845           1405       280     2026           1533
##  6  2013    11    17       45           1935       310      206           2132
##  7  2013     6    30     1855           1415       280     2013           1539
##  8  2013    10    11     1857           1425       272     2011           1547
##  9  2013     2     6     2242           1825       257       15           2008
## 10  2013    12    31     1819           1505       194     2023           1641
## # ℹ 562 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

#Version 2 
flights |>
  arrange(desc(arr_delay)) |>
  filter(dest == "MSN")

## # A tibble: 572 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     9    12     1841           1350       291     2135           1531
##  2  2013    12     5     2000           1420       340     2132           1555
##  3  2013    10     7     1912           1425       287     2048           1547
##  4  2013     3     8     1907           1405       302     2031           1533
##  5  2013     3    14     1845           1405       280     2026           1533
##  6  2013    11    17       45           1935       310      206           2132
##  7  2013     6    30     1855           1415       280     2013           1539
##  8  2013    10    11     1857           1425       272     2011           1547
##  9  2013     2     6     2242           1825       257       15           2008
## 10  2013    12    31     1819           1505       194     2023           1641
## # ℹ 562 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

These two versions produce the same result. However, in terms of computational efficiency, the order does matter. When arrange() is used first, the code must arrange all 300,000+ rows just for those rows to then be narrowed down by conditions. If filter() is used first, the code first identifies the data of interest, and then arranges this smaller sub-set of the larger dataset. Therefore, using filter() and then arrange() is more efficient.

Izzy Sunby - Homework 1 - STAT333

Izzy Sunby

2025-02-17

Question 1: Find flights with an arrival delay of two or more hours.

Question 2: Sort flights to find the fastest flights

Question 3: Was there a flight on every day of 2013?

Question 4: Which flights traveled the farthest distance? Which traveled the least distance?

Question 5: Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.