NYC Flights Homework

Load the libraries and view the “flights” dataset

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(nycflights13)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
#view(flights)
describe(flights)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
##                vars      n    mean      sd median trimmed     mad  min  max
## year              1 336776 2013.00    0.00   2013 2013.00    0.00 2013 2013
## month             2 336776    6.55    3.41      7    6.56    4.45    1   12
## day               3 336776   15.71    8.77     16   15.70   11.86    1   31
## dep_time          4 328521 1349.11  488.28   1401 1346.82  634.55    1 2400
## sched_dep_time    5 336776 1344.25  467.34   1359 1341.60  613.80  106 2359
## dep_delay         6 328521   12.64   40.21     -2    3.32    5.93  -43 1301
## arr_time          7 328063 1502.05  533.26   1535 1526.42  619.73    1 2400
## sched_arr_time    8 336776 1536.38  497.46   1556 1550.67  618.24    1 2359
## arr_delay         9 327346    6.90   44.63     -5   -1.03   20.76  -86 1272
## carrier*         10 336776    7.14    4.14      6    7.00    5.93    1   16
## flight           11 336776 1971.92 1632.47   1496 1830.51 1608.62    1 8500
## tailnum*         12 334264 1814.32 1199.75   1798 1778.21 1587.86    1 4043
## origin*          13 336776    1.95    0.82      2    1.94    1.48    1    3
## dest*            14 336776   50.03   28.12     50   49.56   32.62    1  105
## air_time         15 327346  150.69   93.69    129  140.03   75.61   20  695
## distance         16 336776 1039.91  733.23    872  955.27  569.32   17 4983
## hour             17 336776   13.18    4.66     13   13.15    5.93    1   23
## minute           18 336776   26.23   19.30     29   25.64   23.72    0   59
## time_hour        19 336776     NaN      NA     NA     NaN      NA  Inf -Inf
##                range  skew kurtosis   se
## year               0   NaN      NaN 0.00
## month             11 -0.01    -1.19 0.01
## day               30  0.01    -1.19 0.02
## dep_time        2399 -0.02    -1.09 0.85
## sched_dep_time  2253 -0.01    -1.20 0.81
## dep_delay       1344  4.80    43.95 0.07
## arr_time        2399 -0.47    -0.19 0.93
## sched_arr_time  2358 -0.35    -0.38 0.86
## arr_delay       1358  3.72    29.23 0.08
## carrier*          15  0.36    -1.21 0.01
## flight          8499  0.66    -0.85 2.81
## tailnum*        4042  0.17    -1.24 2.08
## origin*            2  0.09    -1.50 0.00
## dest*            104  0.13    -1.08 0.05
## air_time         675  1.07     0.86 0.16
## distance        4966  1.13     1.19 1.26
## hour              22  0.00    -1.21 0.01
## minute            59  0.09    -1.24 0.03
## time_hour       -Inf    NA       NA   NA
summary(flights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                  
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

Now create one data visualization with this dataset

Your assignment is to create one plot to visualize one aspect of this dataset. The plot may be any type we have covered so far in this class (bargraphs, scatterplots, boxplots, histograms, treemaps, heatmaps, streamgraphs, or alluvials)

(dec23<-filter(flights, month == 12, day ==23))
## # A tibble: 985 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    23       11           2110       181      244           2339
##  2  2013    12    23       29           2359        30      515            437
##  3  2013    12    23       30           2136       174      148           2259
##  4  2013    12    23       46           2330        76      544            409
##  5  2013    12    23       58           2359        59      550            440
##  6  2013    12    23      135           2250       165      251              8
##  7  2013    12    23      136           2359        97      616            445
##  8  2013    12    23      140           2245       175      241           2355
##  9  2013    12    23      454            500        -6      646            651
## 10  2013    12    23      539            540        -1      830            850
## # ... with 975 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
(dec24 <-filter(flights, month == 12, day ==24))
## # A tibble: 761 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    24        9           2359        10      444            445
##  2  2013    12    24      458            500        -2      652            651
##  3  2013    12    24      513            515        -2      813            814
##  4  2013    12    24      543            540         3      844            850
##  5  2013    12    24      546            550        -4     1032           1027
##  6  2013    12    24      555            600        -5      851            915
##  7  2013    12    24      556            600        -4      845            846
##  8  2013    12    24      557            600        -3      908            849
##  9  2013    12    24      558            600        -2      827            831
## 10  2013    12    24      558            600        -2      729            718
## # ... with 751 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
hist(dec23$arr_delay, main = "Arrival delays for Dec 23")

hist(dec24$arr_delay, main = "Arrivel delays for Dec 24")

Looking at specific carriers that have arrival delays more than 1 hours around the holidays

(dec23<-filter(flights, month == 12, day ==23, arr_delay >=60))
## # A tibble: 193 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    23       11           2110       181      244           2339
##  2  2013    12    23       30           2136       174      148           2259
##  3  2013    12    23       46           2330        76      544            409
##  4  2013    12    23       58           2359        59      550            440
##  5  2013    12    23      135           2250       165      251              8
##  6  2013    12    23      136           2359        97      616            445
##  7  2013    12    23      140           2245       175      241           2355
##  8  2013    12    23      658            645        13     1110            955
##  9  2013    12    23      830            830         0     1112            958
## 10  2013    12    23      835            724        71     1131           1024
## # ... with 183 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
hist(dec23$arr_delay, main = "Arrival delays for Dec 23 one hour or more")

(dec24 <-filter(flights, month == 12, day ==24, arr_delay>=60))
## # A tibble: 17 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    24      640            551        49     1004            900
##  2  2013    12    24      812            701        71     1122           1008
##  3  2013    12    24     1022            800       142     1345           1105
##  4  2013    12    24     1026            900        86     1141           1023
##  5  2013    12    24     1034            947        47     1537           1430
##  6  2013    12    24     1035            835       120     1243           1106
##  7  2013    12    24     1206           1100        66     1528           1410
##  8  2013    12    24     1349           1215        94     1559           1445
##  9  2013    12    24     1413           1310        63     1708           1606
## 10  2013    12    24     1630           1455        95     1941           1820
## 11  2013    12    24     1739           1600        99     1926           1802
## 12  2013    12    24     1750           1535       135     2038           1849
## 13  2013    12    24     1801           1350       251     2108           1705
## 14  2013    12    24     1932           1715       137     2153           1850
## 15  2013    12    24     2016           1530       286     2326           1915
## 16  2013    12    24     2059           1729       210     2339           2035
## 17  2013    12    24     2247           2141        66      139             37
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
hist(dec24$arr_delay, main = "Arrivel delays for Dec 24 one hour or more")

(dec23<-filter(flights, month == 12, day ==23, arr_delay >=60))
## # A tibble: 193 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    23       11           2110       181      244           2339
##  2  2013    12    23       30           2136       174      148           2259
##  3  2013    12    23       46           2330        76      544            409
##  4  2013    12    23       58           2359        59      550            440
##  5  2013    12    23      135           2250       165      251              8
##  6  2013    12    23      136           2359        97      616            445
##  7  2013    12    23      140           2245       175      241           2355
##  8  2013    12    23      658            645        13     1110            955
##  9  2013    12    23      830            830         0     1112            958
## 10  2013    12    23      835            724        71     1131           1024
## # ... with 183 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
plot1 <- ggplot(dec23, aes(x=carrier, y = arr_delay, fill = carrier)) + ggtitle("Arrival delays per carrier on Dec 23") + geom_bar(stat = "identity") + labs(x = "Carriers", y = "Arrival Delays")
plot1

On Dec 23rd, two days before the holidays, it is interesting how B6 and EV are clearly the airlines with the highest number of arrival delays. With EV (ExpressJet Airlines Inc.) at over 7,000 arrival delays and B6(JetBlue Ariways) at almost 6,000, I would suggest you do not fly either of this airlines if trying to transport two days before the holidays!

(dec24<-filter(flights, month == 12, day ==24, arr_delay >=60))
## # A tibble: 17 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    24      640            551        49     1004            900
##  2  2013    12    24      812            701        71     1122           1008
##  3  2013    12    24     1022            800       142     1345           1105
##  4  2013    12    24     1026            900        86     1141           1023
##  5  2013    12    24     1034            947        47     1537           1430
##  6  2013    12    24     1035            835       120     1243           1106
##  7  2013    12    24     1206           1100        66     1528           1410
##  8  2013    12    24     1349           1215        94     1559           1445
##  9  2013    12    24     1413           1310        63     1708           1606
## 10  2013    12    24     1630           1455        95     1941           1820
## 11  2013    12    24     1739           1600        99     1926           1802
## 12  2013    12    24     1750           1535       135     2038           1849
## 13  2013    12    24     1801           1350       251     2108           1705
## 14  2013    12    24     1932           1715       137     2153           1850
## 15  2013    12    24     2016           1530       286     2326           1915
## 16  2013    12    24     2059           1729       210     2339           2035
## 17  2013    12    24     2247           2141        66      139             37
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
plot2 <- ggplot(dec24, aes(x=carrier, y = arr_delay, fill = carrier)) + ggtitle("Arrival delays per carrier on Dec 24") + geom_bar(stat = "identity") + labs(x = "Carriers", y = "Arrival Delays")
plot2

For Dec 24th, there are 9 carriers that have arrival delays that are an hour or longer. By this bar graph, is is clear how the number of arrival days are at the top for AA ( American Airlines Inc.). While the airlines with the lowest number of arrival delays are US (US Airways Inc.), WN (Southwest Airlines Inc.) and 9E (Endeador Air Inc.). Odly enough, I haven’t ever used or heard of US or 9E; however, I am eager to here southwest is in the low rates of arrival delays for the day before the holidays!

Summary

After comparing the data initially with histograms to generally see the data comparisons between Dec 23rd and Dec 24th, I figured this would be an interesting investigation! Turns out, there are entirely different airlines between the two days experiencing extreme arrival delays. The bar graphs both show the airlines arrival delays, one for Dec 23rd and the other for Dec 24th. I did a brief analysis on both graphs seperately to give some overall suggestions for this flying around the holidays! If you are planning on flying out to meet your family for dinner on Dec 23rd, I suggest not using EV (ExpressJet Airplines) or B6(JetBlue) because they are the carriers with the most delays exceeding an hour and more. If you are looking to fly on the 23rd, I would suggest using YV(Mesa Airlines), FL(AirTrain Airways), 9E(Endeavor Air), US(Us Airways), and AA(American Airlines) as these are all roughly under the 1,000 for arrival delays. Still a lot but not nearly as bad as ExpressJet and JetBlue. What is most interesting about this data is that there is completely different data for Dec 24th. Infact, even the number of arrival delays generally is much lower. With American Airlines having the highest number of delays for Dec 24th at a little under 600, this doesn’t even compare to the delays for Dec 23rd. US, WN, and 9E are the airlines with the lowest rates of arrival delays for Dec 24th. Therefor, I would reccomend using US airways, Southwest Airlines, or Endeavor Air if you are planning to fly on Dec 24th. I hope this helps when deciding what day and which airlines are best to schedule your holiday flights!

Final Checklist

Requirements for the plot:

  1. Include at least one dplyr command (filter, sort, summarize, group_by, select, mutate, ….)
  2. Include labels for the x- and y-axes
  3. Include a title
  4. Your plot must incorporate at least 2 colors
  5. Include a legend that indicates what the colors represent
  6. Write a brief paragraph that describes the visualization you have created and at least one aspect of the plot that you would like to highlight.