For this report we will figure out why some days had a high cancellation or delayed flights. We will also be looking at the scheduled departure hours for all the days of the year. We will try to see if we can find out what is the bestand worst times to depart. The variables that we will be using to answer the follow questions are Month, Day, schedule_departure_hour, Total_flights, CD(Total amount of Canceled and Delayed), Flight_ontime_Earlier. All the data comes from nycflights13. The first data set has 12 rows and 5 columns. The second data set has 20 rows and 4 columns. The third data set has 12 rows and 6 columns.
The data set below will show the dates that flights were either cancelled or delayed more than an hour, that exceeds more than 35% of total flights for that day.
flights%>%
group_by(month, day)%>%
mutate(Total_flights = (case_when(dep_delay >=0 ~ 1, dep_delay<0 ~ 1, is.na(dep_delay) ~ 1)))%>%
summarise(Total_flights = sum(Total_flights), CD = sum(Canceled = sum(is.na(dep_time)), Delayed = sum(case_when(dep_delay>=60 ~ 1, dep_delay<60 ~ 0, is.na(dep_delay) ~ 0 ))), Canceled_delayed_onehour = (CD / (Total_flights)))%>%
filter(Canceled_delayed_onehour >= 0.35)
## # A tibble: 12 × 5
## # Groups: month [7]
## month day Total_flights CD Canceled_delayed_onehour
## <int> <int> <dbl> <dbl> <dbl>
## 1 2 8 930 508 0.546
## 2 2 9 684 421 0.615
## 3 3 8 979 578 0.590
## 4 5 23 988 436 0.441
## 5 6 24 994 352 0.354
## 6 6 28 994 372 0.374
## 7 7 1 966 412 0.427
## 8 7 10 1004 364 0.363
## 9 7 23 997 365 0.366
## 10 9 2 929 327 0.352
## 11 9 12 992 404 0.407
## 12 12 5 969 386 0.398
We will be looking at the weather in New York for all the days that had a cancellation or delay of one hour or more that had a total of 35% or more of total flights. We will be getting the weather data from 2013 Weather History in New York City New York, United States. On February 8th and 9th there was heavy snow fall, On March 8th there was lots of snow, ice and fog. On May 23rd there was a thunderstorm with heavy rain. On June 24th, and 28th, there was light rain. On July 1st and 23rd there was a thunderstorm with heavy rain. On July 10th there was light rain. On September 2nd there was a thunderstorm with light rain. On September 12th there was a thunderstorm with heavy rain. On December 5th there was mist, fog and light rain. The data shown for each of the days in New York can match up with many of the cancellations or delays. But for some of the days (June 24th, 28th), (July 10th) does not have any major conditions that can cause a delay or cancellation.
In the data set below we will be seeing how many flights took off in each departure hour. We will be trying to see if we can find a pattern when all the flights were scheduled to depart. We will also have a graph that will try to explain the relation between Departure hour (X) and (Canceled delayed percentage) (Y).
Hour_flight<-flights%>%
group_by(scheduel_departure_hour = sched_dep_time %/% 100)%>%
mutate(Total_flights = (case_when(dep_time >=0 ~ 1, dep_time<0 ~ 1, is.na(dep_time) ~ 1)))%>%
summarise(Total_flights = sum(Total_flights), CD = sum(Canceled = sum(is.na(dep_time)), Delayed = sum(case_when(dep_delay>=60 ~ 1, dep_delay<60 ~ 0, is.na(dep_delay) ~ 0 ))), Canceled_delayed_onehour = (CD / (Total_flights)))
arrange(Hour_flight)
## # A tibble: 20 × 4
## scheduel_departure_hour Total_flights CD Canceled_delayed_onehour
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1
## 2 5 1953 38 0.0195
## 3 6 25951 1007 0.0388
## 4 7 22821 803 0.0352
## 5 8 27242 1367 0.0502
## 6 9 20312 1062 0.0523
## 7 10 16708 1106 0.0662
## 8 11 16033 1066 0.0665
## 9 12 18181 1402 0.0771
## 10 13 19956 1817 0.0911
## 11 14 21706 2325 0.107
## 12 15 23888 2926 0.122
## 13 16 23002 3402 0.148
## 14 17 24426 3645 0.149
## 15 18 21783 3352 0.154
## 16 19 21441 4051 0.189
## 17 20 16739 3181 0.190
## 18 21 10933 2226 0.204
## 19 22 2639 426 0.161
## 20 23 1061 111 0.105
ggplot(data = Hour_flight, mapping = aes(x = scheduel_departure_hour, y = Canceled_delayed_onehour, color = Total_flights))+
geom_point()+geom_line()+
labs(x = "Departure Hour", y = "CD %", title = "Relationship between Departure hour and (Canceled delayed)", caption ="")
The plot has a given shape of increasing line from hour 5 and 22. The reason why flights might get delayed or canceled throughout the day is due to air traffic. Most flights are flown throughout the early morning and late at night, so this can cause many planes to be departing at similar times. Only so many planes can take off and land at the same time. So this can cause air traffic which can lead to planes not departing on time. Each time there is a small setback it can affect all the flights after it. There is an outlier at hour 1. This is due to it only having 1 flight departing at that time, which ended up being canceled. Due to this, the most riskiest times to be departing on schedule or being canceled is at are hours 19, 20, 21.
The data set below shows us the total number of flights, and its percentage that left on time or earlier for the previous dates. The data set also shows us the average departure hour for each day. We will be using this data to see if we can find out when the average flight took off on the given dates to see if ther eis a pattern.
flights%>%
group_by(month, day)%>%
mutate(Total_flights = (case_when(dep_delay >=0 ~ 1, dep_delay<0 ~ 1, is.na(dep_delay) ~ 1)))%>%
summarise(Total_flights = sum(Total_flights), CD = sum(Canceled = sum(is.na(dep_time)), Delayed = sum(case_when(dep_delay>=60 ~ 1, dep_delay<60 ~ 0, is.na(dep_delay) ~ 0 ))), Canceled_delayed_onehour = (CD / (Total_flights)), avg_scheduel_departure_hour = mean(sched_dep_time %/% 100), Flight_ontime_Earlier = sum(case_when(dep_delay<= 0~ 1, dep_delay>0 ~ 0, is.na(dep_delay) ~ 0 )))%>%
filter(Canceled_delayed_onehour >= 0.35)%>%
group_by(month, day, avg_scheduel_departure_hour)%>%
summarise(Total_flights, Flight_ontime_Earlier, On_time_or_earlier_percent = (Flight_ontime_Earlier/ Total_flights))
## # A tibble: 12 × 6
## # Groups: month, day [12]
## month day avg_scheduel_departure_hour Total_flights Flight_ontime_Earlier
## <int> <int> <dbl> <dbl> <dbl>
## 1 2 8 13.1 930 219
## 2 2 9 12.7 684 125
## 3 3 8 13.3 979 146
## 4 5 23 13.2 988 329
## 5 6 24 13.2 994 369
## 6 6 28 13.2 994 309
## 7 7 1 13.0 966 229
## 8 7 10 13.2 1004 360
## 9 7 23 13.2 997 257
## 10 9 2 13.4 929 386
## 11 9 12 13.1 992 386
## 12 12 5 13.2 969 326
## # ℹ 1 more variable: On_time_or_earlier_percent <dbl>
The data from all the data sets shows us that it’s possible for the flights that were on time or earlier for those days to leave early in the morning and avoid that bad weather. All of the dates have a scheduled departure time around 1pm or 13:00. With most flights taking off in the early morning and late at night the average will sit in the middle. This can be problematic because all the flights that take off afterwards have a chance to be delayed more than an hour or being canceled.
In conclusion We were able to figure out that the dates that had a 35% or greater cancellation or delay greater than an hour. We were also able to find what caused most of the cancellations and delays for the dates above. We also found out the best and worst hosr to leave throughout the day and night.