Load the libraries and view the “flights” dataset

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(nycflights13)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
view(flights)
describe(flights)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
##                vars      n    mean      sd median trimmed     mad  min  max
## year              1 336776 2013.00    0.00   2013 2013.00    0.00 2013 2013
## month             2 336776    6.55    3.41      7    6.56    4.45    1   12
## day               3 336776   15.71    8.77     16   15.70   11.86    1   31
## dep_time          4 328521 1349.11  488.28   1401 1346.82  634.55    1 2400
## sched_dep_time    5 336776 1344.25  467.34   1359 1341.60  613.80  106 2359
## dep_delay         6 328521   12.64   40.21     -2    3.32    5.93  -43 1301
## arr_time          7 328063 1502.05  533.26   1535 1526.42  619.73    1 2400
## sched_arr_time    8 336776 1536.38  497.46   1556 1550.67  618.24    1 2359
## arr_delay         9 327346    6.90   44.63     -5   -1.03   20.76  -86 1272
## carrier*         10 336776    7.14    4.14      6    7.00    5.93    1   16
## flight           11 336776 1971.92 1632.47   1496 1830.51 1608.62    1 8500
## tailnum*         12 334264 1814.32 1199.75   1798 1778.21 1587.86    1 4043
## origin*          13 336776    1.95    0.82      2    1.94    1.48    1    3
## dest*            14 336776   50.03   28.12     50   49.56   32.62    1  105
## air_time         15 327346  150.69   93.69    129  140.03   75.61   20  695
## distance         16 336776 1039.91  733.23    872  955.27  569.32   17 4983
## hour             17 336776   13.18    4.66     13   13.15    5.93    1   23
## minute           18 336776   26.23   19.30     29   25.64   23.72    0   59
## time_hour        19 336776     NaN      NA     NA     NaN      NA  Inf -Inf
##                range  skew kurtosis   se
## year               0   NaN      NaN 0.00
## month             11 -0.01    -1.19 0.01
## day               30  0.01    -1.19 0.02
## dep_time        2399 -0.02    -1.09 0.85
## sched_dep_time  2253 -0.01    -1.20 0.81
## dep_delay       1344  4.80    43.95 0.07
## arr_time        2399 -0.47    -0.19 0.93
## sched_arr_time  2358 -0.35    -0.38 0.86
## arr_delay       1358  3.72    29.23 0.08
## carrier*          15  0.36    -1.21 0.01
## flight          8499  0.66    -0.85 2.81
## tailnum*        4042  0.17    -1.24 2.08
## origin*            2  0.09    -1.50 0.00
## dest*            104  0.13    -1.08 0.05
## air_time         675  1.07     0.86 0.16
## distance        4966  1.13     1.19 1.26
## hour              22  0.00    -1.21 0.01
## minute            59  0.09    -1.24 0.03
## time_hour       -Inf    NA       NA   NA
view(flights)

Analysis question: how does arrival delay differ between winter months like January and summer months like July?

Use str to examine the structure of the data

str(flights)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

Filter for the top 10 arrival delays in January of 1 minute or more. Used >= 1 minute becuase the data has negative delays meaning some flights arrived ahead of schedule.

January_arrdelays <- flights %>%
  filter(month == "1" & arr_delay >= 1) %>%
  arrange(desc(arr_delay))%>%
  head(10)
print(January_arrdelays)
## # A tibble: 10 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     1    10     1121           1635      1126     1239           1810
##  3  2013     1     1      848           1835       853     1001           1950
##  4  2013     1    13     1809            810       599     2054           1042
##  5  2013     1    16     1622            800       502     1911           1054
##  6  2013     1    23     1551            753       478     1812           1006
##  7  2013     1     1     2343           1724       379      314           1938
##  8  2013     1    10     1525            900       385     1713           1039
##  9  2013     1    25       15           1815       360      208           1958
## 10  2013     1     2     1607           1030       337     2003           1355
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The data shows that the top 10 arrival delays in Janaury ranged from 368 minutes to 1,272 minutes.

Filter for top 10 arrival delays in July of 1 minute or more

July_arrdelays <- flights %>%
  filter(month == "7" & arr_delay >= 1) %>%
  arrange(desc(arr_delay))%>%
  head(10)
 print(July_arrdelays)
## # A tibble: 10 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7    22      845           1600      1005     1044           1815
##  2  2013     7    22     2257            759       898      121           1026
##  3  2013     7     7     2059           1030       629      106           1350
##  4  2013     7    21     1555            615       580     1955            910
##  5  2013     7     7     2123           1030       653       17           1345
##  6  2013     7    27     1456            600       536     1649            712
##  7  2013     7     6      149           1600       589      456           1935
##  8  2013     7    10     2346           1410       576      141           1630
##  9  2013     7    28        6           1600       486      231           1815
## 10  2013     7    10     2238           1439       479       39           1644
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The data shows that the top 10 arrival delays in the month of July ranged from 475 minutes to 989 minutes.

Do a plot of arrival delays in minutes by month.

plot1 <- flights %>% 
  ggplot(aes(month, arr_delay)) +
  geom_point() + 
  ggtitle("Arrival delays in minutes by month")
plot1
## Warning: Removed 9430 rows containing missing values (geom_point).

#Comment on the data

This data shows arrival time and departure time by flights to and from New York City in the year 2013 including arrival delays and departure delays shown in minutes. Delayed flights are a major concern of passengers and these tend to occur in periods of busy travel such as summer or in periods of bad weather such as winter. The data has a total of 336,776 cases entered for the twelve months of 2013. I decided to investigate how arrival delays differed between a winter month such as January and a summer month such as July. I filtered the data for the top 10 arrival delays in the month of January and found out that the delays ranged from 368 minutes to 1,272 minutes. I also filtered for the top 10 arrival delays in the month of July and discovered that the delays ranged from 475 minutes to 989 minutes. This suggested that there are probably longer arrival delays in January than July. To investigate further I decided to plot the entire data of arrival delay by month. The plot shows that across the months there was an almost similar proportion of flights that had a negative arrival delay, meaning they arrived ahead of schedule. The plot further shows that, excluding outliers, the months of June and July had the longest arrival delays, followed by December and January. Thus, suggesting that the busy summer months as well as the bad-weather winter months have the longest arrival delays. Other areas to be explored could include delays in departure; and delays in arrival and departure by type of carrier and by destination.