library(tidyverse)
library(nycflights13)
library(lubridate)
library(dplyr)

##Introduction

The problem being investigated today is the relationship between the day of flight departure out of New York City airports, and the departure status of these flights. There are many variables being discussed in these data some of which are shared:

#Percentage of Flights Canceled and Delayed

This data set shows the days and month during 2013, the total flights on that day, how many were canceled or delayed, and the percentage of the total amount it was canceled or delayed. This specific data set singles out days in which the percentage of flights canceled or delayed exceeds 35%.

flights_can_or_del <- flights %>%
  group_by(month, day) %>% 
  summarize(total_flights = n(),
            can_or_del = sum(is.na(dep_time) | dep_delay > 60)) %>% 
  mutate(percent = (can_or_del / total_flights) * 100) %>% 
  filter(percent > 35)

Only 7 out of 12 months have 1 - 3 days in which they have a percentage of 35% or higher of cancellations. These dates are: February 8th and 9th, March 3rd, May 23rd, June 24th and 28th, July 1st, 10th, and 23rd, September 2nd and 12th, and December 5th. Of these dates, the ones with the highest percentages are in the colder part of the year - February and March - and the ones in the warmer part of the year - June and July - have the lower percentages. This being said the total number of flights would also have to be taken into consideration into this comparison. There are more flights in the warmer months than the colder months.

#Weather Conditions in NYC 2013 https://weatherspark.com/h/y/23912/2013/Historical-Weather-during-2013-in-New-York-City-New-York-United-States

-December 5: cold, overcast, haze, smoke, light drizzle

In general the weather does correlate with the amount of cancellations there are on that date. The winter months, February and March both have snow that would cause a cancellation. The spring and summer months, May, June, and July, all have rain and thunderstorms that would be cause for a high number of cancellations. September and December also have weather that would cause a number of cancellations with thunderstorms and haze or smoke. All of these factors make visibility or flying ability low or incapable, creating a high number of cancellations.

#Flights Canceled or Delayed More Than an Hour This data set shows flights for each scheduled departure hour as well as the total number of flights canceled or delayed more than an hour during that hour. This data set also gives a percentage of the amount of flights canceled or delayed more than an hour.

flights_perc <- flights %>% 
  mutate(hour = sched_dep_time %/% 100) %>% 
  group_by(hour) %>% 
  summarize(total_flights = n(),
            can_del_flights = sum(is.na(dep_time) | dep_delay > 60)) %>% 
  mutate(per_can_or_del = (can_del_flights / total_flights) * 100)

These data show that the later in the night a flight is scheduled, the more likely it is to be canceled. This being said, there is also a factor to be taken into consideration when looking at this table. The total number of flights and the number of canceled or delayed flights. Given there is only 1 flight at 1 am, as well as one canceled or delayed flight, the percentage for this hour is 100%. This leaves this hour as an outlier in the set. Aside from this outlier, there is enough information in the other hours to grasp the percentage that a flight will be canceled or delayed.

#Departure Hour and Percentage Canceled or Delayed

This plot shows the relationship between the hour and the percentage of flights canceled or delayed more than an hour. The departure hour is shown on the x-axis, and the percentage of flights canceled or delayed more than an hour is shown on the y-axis.

ggplot(flights_perc) +
geom_point(mapping = aes(x = hour, y = per_can_or_del)) +
  labs(x = "Departure Hour",
     y = "Percentage of Flights Canceled or Delayed",
     title = "Departure Hour and Their Canceled or Delayed Flights")

As discussed previously, there is a general trend of flights canceling or delaying as the day goes on. The outlier of hour 1 is shown at 100%. All other hours start from hour 5 until hour 23 (11pm). There is a very slow but obvious increase in cancellations until about hour 21. The data then starts to decrease, but as the data shows earlier, the number of total flights also decreases. The riskiest time to fly during the day would be around 7 - 9 pm (hours 19-21). This can be due to a few things, two of them being flight traffic and cancellations or delays from other airports, as well as simply weather delays as described earlier. If a delay or cancellation is wanting to be avoided, the best time would be to go early in the morning around hours 5-7.

#Flights Early or On Time This data set shows data from “Flights Canceled or Delayed More Than an Hour”. Instead this data will show the number of flights which were early or on time on those specific days.

flights_ear_on_time <- flights %>%
  group_by(month, day) %>% 
  summarize(total_flights = n(),
            can_or_del = sum(is.na(dep_time) | dep_delay > 60)) %>% 
  mutate(percent_can = (can_or_del / total_flights) * 100) %>% 
  filter(percent_can > 35) %>% 
  mutate(percent_early = 100 - percent_can) %>%
  select(month, day, total_flights, percent_early) 
print(flights_ear_on_time)

These data show that the higher percentages are during the spring and summer months, while the lower percentages are during the early winter months. This makes sense as the percentage canceled or delayed from earlier showed the opposite. Given the weather information, this also makes more sense as it is easier to arrive earlier due to a summer thunderstorm - simply rerouting- than winter snow - defrosting time.

#Average Scheduled Departure Hour This data set shows the average scheduled hour of departure connected to the month and days from “Flights Canceled or Delayed More Than an Hour.”

flights_early <- flights %>%
  group_by(month, day) %>% 
  summarize(total_flights = n(),
            can_or_del = sum(is.na(dep_time) | dep_delay > 60)) %>% 
  mutate(percent_can = (can_or_del / total_flights) * 100) %>% 
  filter(percent_can > 35) %>% 
  select(month, day)
avg_dep <- flights %>% 
  group_by(month, day) %>% 
  summarize(avg_sched_hour = mean(sched_dep_time %/% 100)) %>% 
  filter(month == 2 & day == 8 | month == 2 & day == 9 | month == 3 & day == 8 | month == 5 & day == 23 | month == 6 & day == 24 | month == 6 & day == 28 | month == 7 & day == 1 | month == 7 & day == 10 | month == 7 & day == 23 | month == 9 & day == 2 | month == 9 & day == 12 | month == 12 & day == 5) 
print(avg_dep)

These data shows that the average scheduled hour to be between 12 and 13, or 12 pm and 1 pm. In general this suits the plot as these times are on the lower end of cancellations or delays. This also makes sense if there was any bad weather in the morning they could wait until later to fly out, as well as flying earlier to avoid weather. Given this is also around the middle of the day, it would make sense most are around then. In general, the average scheduled hour is a little later than one would’ve thought, but once more comparisons are made, it makes sense.

#Conclusion The problem investigated was the relationship between the day of flight departure out of New York City airports, and the departure status of these flights. This data shows that the departure status is canceled or delayed more than an hour in higher percentages on days where there was a lot of snow. Although still high in canceled or delayed departure, there is a higher percentage of flights that was early or on time during the summer months with thunderstorms. This can be due to easier avoidance. The last finding was the average scheduled departure time being between 12 pm and 1 pm. This can be simply due to the middle of the hours being between those times, or due to the avoidance of weather. Overall, there aren’t too many days of the year which to avoid flying, but if there is bad weather, do expect a delay.