library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(nycflights13)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
#view(flights)
describe(flights)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## vars n mean sd median trimmed mad min max
## year 1 336776 2013.00 0.00 2013 2013.00 0.00 2013 2013
## month 2 336776 6.55 3.41 7 6.56 4.45 1 12
## day 3 336776 15.71 8.77 16 15.70 11.86 1 31
## dep_time 4 328521 1349.11 488.28 1401 1346.82 634.55 1 2400
## sched_dep_time 5 336776 1344.25 467.34 1359 1341.60 613.80 106 2359
## dep_delay 6 328521 12.64 40.21 -2 3.32 5.93 -43 1301
## arr_time 7 328063 1502.05 533.26 1535 1526.42 619.73 1 2400
## sched_arr_time 8 336776 1536.38 497.46 1556 1550.67 618.24 1 2359
## arr_delay 9 327346 6.90 44.63 -5 -1.03 20.76 -86 1272
## carrier* 10 336776 7.14 4.14 6 7.00 5.93 1 16
## flight 11 336776 1971.92 1632.47 1496 1830.51 1608.62 1 8500
## tailnum* 12 334264 1814.32 1199.75 1798 1778.21 1587.86 1 4043
## origin* 13 336776 1.95 0.82 2 1.94 1.48 1 3
## dest* 14 336776 50.03 28.12 50 49.56 32.62 1 105
## air_time 15 327346 150.69 93.69 129 140.03 75.61 20 695
## distance 16 336776 1039.91 733.23 872 955.27 569.32 17 4983
## hour 17 336776 13.18 4.66 13 13.15 5.93 1 23
## minute 18 336776 26.23 19.30 29 25.64 23.72 0 59
## time_hour 19 336776 NaN NA NA NaN NA Inf -Inf
## range skew kurtosis se
## year 0 NaN NaN 0.00
## month 11 -0.01 -1.19 0.01
## day 30 0.01 -1.19 0.02
## dep_time 2399 -0.02 -1.09 0.85
## sched_dep_time 2253 -0.01 -1.20 0.81
## dep_delay 1344 4.80 43.95 0.07
## arr_time 2399 -0.47 -0.19 0.93
## sched_arr_time 2358 -0.35 -0.38 0.86
## arr_delay 1358 3.72 29.23 0.08
## carrier* 15 0.36 -1.21 0.01
## flight 8499 0.66 -0.85 2.81
## tailnum* 4042 0.17 -1.24 2.08
## origin* 2 0.09 -1.50 0.00
## dest* 104 0.13 -1.08 0.05
## air_time 675 1.07 0.86 0.16
## distance 4966 1.13 1.19 1.26
## hour 22 0.00 -1.21 0.01
## minute 59 0.09 -1.24 0.03
## time_hour -Inf NA NA NA
summary(flights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :29.00 Median :2013-07-03 10:00:00
## Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :59.00 Max. :2013-12-31 23:00:00
##
Your assignment is to create one plot to visualize one aspect of this dataset. The plot may be any type we have covered so far in this class (bargraphs, scatterplots, boxplots, histograms, treemaps, heatmaps, streamgraphs, or alluvials)
(dec23<-filter(flights, month == 12, day ==23))
## # A tibble: 985 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 23 11 2110 181 244 2339
## 2 2013 12 23 29 2359 30 515 437
## 3 2013 12 23 30 2136 174 148 2259
## 4 2013 12 23 46 2330 76 544 409
## 5 2013 12 23 58 2359 59 550 440
## 6 2013 12 23 135 2250 165 251 8
## 7 2013 12 23 136 2359 97 616 445
## 8 2013 12 23 140 2245 175 241 2355
## 9 2013 12 23 454 500 -6 646 651
## 10 2013 12 23 539 540 -1 830 850
## # ... with 975 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
(dec24 <-filter(flights, month == 12, day ==24))
## # A tibble: 761 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 24 9 2359 10 444 445
## 2 2013 12 24 458 500 -2 652 651
## 3 2013 12 24 513 515 -2 813 814
## 4 2013 12 24 543 540 3 844 850
## 5 2013 12 24 546 550 -4 1032 1027
## 6 2013 12 24 555 600 -5 851 915
## 7 2013 12 24 556 600 -4 845 846
## 8 2013 12 24 557 600 -3 908 849
## 9 2013 12 24 558 600 -2 827 831
## 10 2013 12 24 558 600 -2 729 718
## # ... with 751 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
hist(dec23$arr_delay, main = "Arrival delays for Dec 23")
hist(dec24$arr_delay, main = "Arrivel delays for Dec 24")
(dec23<-filter(flights, month == 12, day ==23, arr_delay >=60))
## # A tibble: 193 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 23 11 2110 181 244 2339
## 2 2013 12 23 30 2136 174 148 2259
## 3 2013 12 23 46 2330 76 544 409
## 4 2013 12 23 58 2359 59 550 440
## 5 2013 12 23 135 2250 165 251 8
## 6 2013 12 23 136 2359 97 616 445
## 7 2013 12 23 140 2245 175 241 2355
## 8 2013 12 23 658 645 13 1110 955
## 9 2013 12 23 830 830 0 1112 958
## 10 2013 12 23 835 724 71 1131 1024
## # ... with 183 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
hist(dec23$arr_delay, main = "Arrival delays for Dec 23 one hour or more")
(dec24 <-filter(flights, month == 12, day ==24, arr_delay>=60))
## # A tibble: 17 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 24 640 551 49 1004 900
## 2 2013 12 24 812 701 71 1122 1008
## 3 2013 12 24 1022 800 142 1345 1105
## 4 2013 12 24 1026 900 86 1141 1023
## 5 2013 12 24 1034 947 47 1537 1430
## 6 2013 12 24 1035 835 120 1243 1106
## 7 2013 12 24 1206 1100 66 1528 1410
## 8 2013 12 24 1349 1215 94 1559 1445
## 9 2013 12 24 1413 1310 63 1708 1606
## 10 2013 12 24 1630 1455 95 1941 1820
## 11 2013 12 24 1739 1600 99 1926 1802
## 12 2013 12 24 1750 1535 135 2038 1849
## 13 2013 12 24 1801 1350 251 2108 1705
## 14 2013 12 24 1932 1715 137 2153 1850
## 15 2013 12 24 2016 1530 286 2326 1915
## 16 2013 12 24 2059 1729 210 2339 2035
## 17 2013 12 24 2247 2141 66 139 37
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
hist(dec24$arr_delay, main = "Arrivel delays for Dec 24 one hour or more")
(dec23<-filter(flights, month == 12, day ==23, arr_delay >=60))
## # A tibble: 193 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 23 11 2110 181 244 2339
## 2 2013 12 23 30 2136 174 148 2259
## 3 2013 12 23 46 2330 76 544 409
## 4 2013 12 23 58 2359 59 550 440
## 5 2013 12 23 135 2250 165 251 8
## 6 2013 12 23 136 2359 97 616 445
## 7 2013 12 23 140 2245 175 241 2355
## 8 2013 12 23 658 645 13 1110 955
## 9 2013 12 23 830 830 0 1112 958
## 10 2013 12 23 835 724 71 1131 1024
## # ... with 183 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
plot1 <- ggplot(dec23, aes(x=carrier, y = arr_delay, fill = carrier)) + ggtitle("Arrival delays per carrier on Dec 23") + geom_bar(stat = "identity") + labs(x = "Carriers", y = "Arrival Delays")
plot1
On Dec 23rd, two days before the holidays, it is interesting how B6 and EV are clearly the airlines with the highest number of arrival delays. With EV (ExpressJet Airlines Inc.) at over 7,000 arrival delays and B6(JetBlue Ariways) at almost 6,000, I would suggest you do not fly either of this airlines if trying to transport two days before the holidays!
(dec24<-filter(flights, month == 12, day ==24, arr_delay >=60))
## # A tibble: 17 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 24 640 551 49 1004 900
## 2 2013 12 24 812 701 71 1122 1008
## 3 2013 12 24 1022 800 142 1345 1105
## 4 2013 12 24 1026 900 86 1141 1023
## 5 2013 12 24 1034 947 47 1537 1430
## 6 2013 12 24 1035 835 120 1243 1106
## 7 2013 12 24 1206 1100 66 1528 1410
## 8 2013 12 24 1349 1215 94 1559 1445
## 9 2013 12 24 1413 1310 63 1708 1606
## 10 2013 12 24 1630 1455 95 1941 1820
## 11 2013 12 24 1739 1600 99 1926 1802
## 12 2013 12 24 1750 1535 135 2038 1849
## 13 2013 12 24 1801 1350 251 2108 1705
## 14 2013 12 24 1932 1715 137 2153 1850
## 15 2013 12 24 2016 1530 286 2326 1915
## 16 2013 12 24 2059 1729 210 2339 2035
## 17 2013 12 24 2247 2141 66 139 37
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
plot2 <- ggplot(dec24, aes(x=carrier, y = arr_delay, fill = carrier)) + ggtitle("Arrival delays per carrier on Dec 24") + geom_bar(stat = "identity") + labs(x = "Carriers", y = "Arrival Delays")
plot2
For Dec 24th, there are 9 carriers that have arrival delays that are an hour or longer. By this bar graph, is is clear how the number of arrival days are at the top for AA ( American Airlines Inc.). While the airlines with the lowest number of arrival delays are US (US Airways Inc.), WN (Southwest Airlines Inc.) and 9E (Endeador Air Inc.). Odly enough, I haven’t ever used or heard of US or 9E; however, I am eager to here southwest is in the low rates of arrival delays for the day before the holidays!
After comparing the data initially with histograms to generally see the data comparisons between Dec 23rd and Dec 24th, I figured this would be an interesting investigation! Turns out, there are entirely different airlines between the two days experiencing extreme arrival delays. The bar graphs both show the airlines arrival delays, one for Dec 23rd and the other for Dec 24th. I did a brief analysis on both graphs seperately to give some overall suggestions for this flying around the holidays! If you are planning on flying out to meet your family for dinner on Dec 23rd, I suggest not using EV (ExpressJet Airplines) or B6(JetBlue) because they are the carriers with the most delays exceeding an hour and more. If you are looking to fly on the 23rd, I would suggest using YV(Mesa Airlines), FL(AirTrain Airways), 9E(Endeavor Air), US(Us Airways), and AA(American Airlines) as these are all roughly under the 1,000 for arrival delays. Still a lot but not nearly as bad as ExpressJet and JetBlue. What is most interesting about this data is that there is completely different data for Dec 24th. Infact, even the number of arrival delays generally is much lower. With American Airlines having the highest number of delays for Dec 24th at a little under 600, this doesn’t even compare to the delays for Dec 23rd. US, WN, and 9E are the airlines with the lowest rates of arrival delays for Dec 24th. Therefor, I would reccomend using US airways, Southwest Airlines, or Endeavor Air if you are planning to fly on Dec 24th. I hope this helps when deciding what day and which airlines are best to schedule your holiday flights!