NYC Flights Homework

Load the libraries and view the “flights” dataset

library(tidyverse)
library(nycflights13)
head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
data("flights")

My Breakdown

I want to look at a specific airport and determine which of the top 3 airlines is the best. I’ll measure this by examining delay times.

Now lets identify all the carrier names

unique(flights$carrier)
##  [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV"
## [16] "OO"

Next Let’s filter out and choose one airport Location and the big 3 airlines

I’ve been to JFK, so I’ll choose that one. As for how I determined my top 3 airlines Here is my source: Top 10 Largest Airlines in the United States by Capacity

Now that I know that United, American, and Delta airlines are the top 3 in the year 2023. I want to look at their performance in the year 2013 to see if there are any significant time delays.

JFK <- flights %>%
  filter(origin == "JFK", carrier %in% c("AA",  "DL", "UA"))
  JFK
## # A tibble: 39,018 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      558            600        -2      924            917
##  3  2013     1     1      606            610        -4      837            845
##  4  2013     1     1      611            600        11      945            931
##  5  2013     1     1      628            630        -2     1137           1140
##  6  2013     1     1      655            655         0     1021           1030
##  7  2013     1     1      655            700        -5     1037           1045
##  8  2013     1     1      656            659        -3      949            959
##  9  2013     1     1      712            715        -3     1023           1035
## 10  2013     1     1      743            730        13     1107           1100
## # ℹ 39,008 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Now I’ll create a new variable to determine flights with the greatest amount of delays

dep_delays <- JFK %>%
  arrange(desc(dep_delay))
head(dep_delays)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    20     1139           1845      1014     1457           2210
## 2  2013     4    10     1100           1900       960     1342           2211
## 3  2013     6    27      959           1900       899     1236           2226
## 4  2013     5    19      713           1700       853     1007           1955
## 5  2013    12    14      830           1845       825     1210           2154
## 6  2013     3    18     1020           2100       800     1336             32
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

I want to quickly look at all the columns within the dataset

names(flights)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

Now that I have created and found the data I need, It’s time to plot!

delays <- dep_delays %>%
  mutate(month = factor(month, levels = 1:12, labels = month.abb))

flight_plot <- dep_delays %>%
  ggplot(aes(x = month, y = dep_delay, color = carrier)) +
  geom_point(alpha = 0.5) +
  scale_x_discrete(labels = month.abb) + 
  scale_x_continuous(breaks = 1:12, labels = month.abb) +
  scale_color_discrete(name = "Airlines", labels = c('American Airlines', 'Delta Airlines', 'United Airlines')) +
  xlab("Months") +
  ylab("Delay Times") +
  ggtitle("JFK 2013 Airlines Delays")

flight_plot

My Discoveries

This visualization describes the delay times for American, Delta, and United Airlines in 2013. It gives us a breakdown per month of how many delays there were for each airline. After analyzing the data, I would like to highlight the fact that Delta Airlines appears to have much more delay times(spread). In fact, it actually has a lot more outliers, meaning that they have the greatest amount of waiting time. It also shows that there are much more green dots bunch up together. Which shows the multiple amount of delays across the year. Particularly during the summer time. A very interesting note is that American Airlines actually had the highest delay time in September that equals almost 17 hours. Another note is that United Airlines did not have any extreme delay times or spread in that year. In fact, during the months of March, April, July, and August, the data shows that United Airlines left earlier than their scheduled times.

Hope you enjoyed! Thank you!