Title: “Flights scheduled departure data”

Author: Jack Lustig

Date: 10/1/2025

—- Introduction —-

For this report we will figure out why some days had a high cancellation or delayed flights. We will also be looking at the scheduled departure hours for all the days of the year. We will try to see if we can find out what is the bestand worst times to depart. The variables that we will be using to answer the follow questions are Month, Day, schedule_departure_hour, Total_flights, CD(Total amount of Canceled and Delayed), Flight_ontime_Earlier. All the data comes from nycflights13. The first data set has 12 rows and 5 columns. The second data set has 20 rows and 4 columns. The third data set has 12 rows and 6 columns.

—- Data set 1 —-

The data set below will show the dates that flights were either cancelled or delayed more than an hour, that exceeds more than 35% of total flights for that day.

flights%>%
  group_by(month, day)%>%
  mutate(Total_flights = (case_when(dep_delay >=0 ~ 1, dep_delay<0 ~ 1, is.na(dep_delay) ~ 1)))%>%
  summarise(Total_flights = sum(Total_flights), CD = sum(Canceled = sum(is.na(dep_time)), Delayed = sum(case_when(dep_delay>=60 ~ 1, dep_delay<60 ~ 0, is.na(dep_delay) ~ 0 ))), Canceled_delayed_onehour = (CD / (Total_flights)))%>%
  filter(Canceled_delayed_onehour >= 0.35)
## # A tibble: 12 × 5
## # Groups:   month [7]
##    month   day Total_flights    CD Canceled_delayed_onehour
##    <int> <int>         <dbl> <dbl>                    <dbl>
##  1     2     8           930   508                    0.546
##  2     2     9           684   421                    0.615
##  3     3     8           979   578                    0.590
##  4     5    23           988   436                    0.441
##  5     6    24           994   352                    0.354
##  6     6    28           994   372                    0.374
##  7     7     1           966   412                    0.427
##  8     7    10          1004   364                    0.363
##  9     7    23           997   365                    0.366
## 10     9     2           929   327                    0.352
## 11     9    12           992   404                    0.407
## 12    12     5           969   386                    0.398

We will be looking at the weather in New York for all the days that had a cancellation or delay of one hour or more that had a total of 35% or more of total flights. We will be getting the weather data from 2013 Weather History in New York City New York, United States. On February 8th and 9th there was heavy snow fall, On March 8th there was lots of snow, ice and fog. On May 23rd there was a thunderstorm with heavy rain. On June 24th, and 28th, there was light rain. On July 1st and 23rd there was a thunderstorm with heavy rain. On July 10th there was light rain. On September 2nd there was a thunderstorm with light rain. On September 12th there was a thunderstorm with heavy rain. On December 5th there was mist, fog and light rain. The data shown for each of the days in New York can match up with many of the cancellations or delays. But for some of the days (June 24th, 28th), (July 10th) does not have any major conditions that can cause a delay or cancellation.

—- Data set 2 —-

In the data set below we will be seeing how many flights took off in each departure hour. We will be trying to see if we can find a pattern when all the flights were scheduled to depart. We will also have a graph that will try to explain the relation between Departure hour (X) and (Canceled delayed percentage) (Y).

Hour_flight<-flights%>%
  group_by(scheduel_departure_hour = sched_dep_time %/% 100)%>%
  mutate(Total_flights = (case_when(dep_time >=0 ~ 1, dep_time<0 ~ 1, is.na(dep_time) ~ 1)))%>%
  summarise(Total_flights = sum(Total_flights), CD = sum(Canceled = sum(is.na(dep_time)), Delayed = sum(case_when(dep_delay>=60 ~ 1, dep_delay<60 ~ 0, is.na(dep_delay) ~ 0 ))), Canceled_delayed_onehour = (CD / (Total_flights)))
arrange(Hour_flight)
## # A tibble: 20 × 4
##    scheduel_departure_hour Total_flights    CD Canceled_delayed_onehour
##                      <dbl>         <dbl> <dbl>                    <dbl>
##  1                       1             1     1                   1     
##  2                       5          1953    38                   0.0195
##  3                       6         25951  1007                   0.0388
##  4                       7         22821   803                   0.0352
##  5                       8         27242  1367                   0.0502
##  6                       9         20312  1062                   0.0523
##  7                      10         16708  1106                   0.0662
##  8                      11         16033  1066                   0.0665
##  9                      12         18181  1402                   0.0771
## 10                      13         19956  1817                   0.0911
## 11                      14         21706  2325                   0.107 
## 12                      15         23888  2926                   0.122 
## 13                      16         23002  3402                   0.148 
## 14                      17         24426  3645                   0.149 
## 15                      18         21783  3352                   0.154 
## 16                      19         21441  4051                   0.189 
## 17                      20         16739  3181                   0.190 
## 18                      21         10933  2226                   0.204 
## 19                      22          2639   426                   0.161 
## 20                      23          1061   111                   0.105
ggplot(data = Hour_flight, mapping = aes(x = scheduel_departure_hour, y = Canceled_delayed_onehour, color = Total_flights))+
  geom_point()+geom_line()+
  labs(x = "Departure Hour", y = "CD %", title = "Relationship between Departure hour and (Canceled delayed)", caption ="")

The plot has a given shape of increasing line from hour 5 and 22. The reason why flights might get delayed or canceled throughout the day is due to air traffic. Most flights are flown throughout the early morning and late at night, so this can cause many planes to be departing at similar times. Only so many planes can take off and land at the same time. So this can cause air traffic which can lead to planes not departing on time. Each time there is a small setback it can affect all the flights after it. There is an outlier at hour 1. This is due to it only having 1 flight departing at that time, which ended up being canceled. Due to this, the most riskiest times to be departing on schedule or being canceled is at are hours 19, 20, 21.

—- Data set 3 —-

The data set below shows us the total number of flights, and its percentage that left on time or earlier for the previous dates. The data set also shows us the average departure hour for each day. We will be using this data to see if we can find out when the average flight took off on the given dates to see if ther eis a pattern.

flights%>%
  group_by(month, day)%>%
  mutate(Total_flights = (case_when(dep_delay >=0 ~ 1, dep_delay<0 ~ 1, is.na(dep_delay) ~ 1)))%>%
  summarise(Total_flights = sum(Total_flights), CD = sum(Canceled = sum(is.na(dep_time)), Delayed = sum(case_when(dep_delay>=60 ~ 1, dep_delay<60 ~ 0, is.na(dep_delay) ~ 0 ))), Canceled_delayed_onehour = (CD / (Total_flights)), avg_scheduel_departure_hour = mean(sched_dep_time %/% 100), Flight_ontime_Earlier =  sum(case_when(dep_delay<= 0~ 1, dep_delay>0 ~ 0, is.na(dep_delay) ~ 0 )))%>%
  filter(Canceled_delayed_onehour >= 0.35)%>%
  group_by(month, day, avg_scheduel_departure_hour)%>%
  summarise(Total_flights, Flight_ontime_Earlier, On_time_or_earlier_percent = (Flight_ontime_Earlier/ Total_flights))
## # A tibble: 12 × 6
## # Groups:   month, day [12]
##    month   day avg_scheduel_departure_hour Total_flights Flight_ontime_Earlier
##    <int> <int>                       <dbl>         <dbl>                 <dbl>
##  1     2     8                        13.1           930                   219
##  2     2     9                        12.7           684                   125
##  3     3     8                        13.3           979                   146
##  4     5    23                        13.2           988                   329
##  5     6    24                        13.2           994                   369
##  6     6    28                        13.2           994                   309
##  7     7     1                        13.0           966                   229
##  8     7    10                        13.2          1004                   360
##  9     7    23                        13.2           997                   257
## 10     9     2                        13.4           929                   386
## 11     9    12                        13.1           992                   386
## 12    12     5                        13.2           969                   326
## # ℹ 1 more variable: On_time_or_earlier_percent <dbl>

The data from all the data sets shows us that it’s possible for the flights that were on time or earlier for those days to leave early in the morning and avoid that bad weather. All of the dates have a scheduled departure time around 1pm or 13:00. With most flights taking off in the early morning and late at night the average will sit in the middle. This can be problematic because all the flights that take off afterwards have a chance to be delayed more than an hour or being canceled.

In conclusion We were able to figure out that the dates that had a 35% or greater cancellation or delay greater than an hour. We were also able to find what caused most of the cancellations and delays for the dates above. We also found out the best and worst hosr to leave throughout the day and night.