This report seeks to answer the following question:
-Create a data set that lists the days during 2013 on which the percentage of flights that were either canceled or delayed on departure by more than one hour exceeded 35%. -Do a Google search to see what the weather was like in New York on each of the days from the previous problem in 2013. Do your findings explain the large number of delays/cancellations? -Create a data set that shows, for each scheduled departure hour during the day, the percentage of flights that were canceled or delayed by more than one hour. -Obtain a plot that shows the relationship between the hour and the percentage of flights canceled or delayed by more than an hour. -Explain the relationship in the previous problem. -Create a data set that contains the flights that left on time or early on the above dates. -For each date in the data set from the previous problem, find the average scheduled hour of departure. Do your findings mostly confirm this guess? -Did any of the problematic dates have an average departure hour in the afternoon? If so, why might that have been?
We will be using a data set called flights obtained from
a built in R data set. It contains on-time data for all flights that
departed from New York City in 2013. There are 19 variables for each
flight; the relevant ones in this report are dep_delay
(departure delays in minutes, negative times represent early
departures), month (month of departure), hour
(hour of scheduled departure), and day (day of departure).
Also, throughout this report, the following variables are created;
dep_status(this states whether the flight is late,
canceled, or not late), percentage(this represents the
percentage of flight that were canceled or delayed), and
avg_hour(this represents the average scheduled time of
departure).
Throughout, we will need the functionality of the
tidyverse package, mainly to create visualizations, and
we will need the nycflights13 package in order to use
the flights data.
library(tidyverse)
library(nycflights13)
The first task that I am assigned is to discover the days which
canceled/delayed flights made up 35% or more of the total flights. In
order to do this, I will need to create a new data set, which I will
call flights_w_arr_status2. From there, I am going to use a
“case/when” function to set the new dep_status variable
equal to its necessary values of canceled, late, and not late. Then I
need to find the percentage of flights that were canceled/delayed on
each day, and filter based on days that had a percentage of 35 or
higher. In order to ensure that I look at the data on a day by day
basis, I will use the group by function, and group by month and day.
After I put all of these steps together, I end up with the following
code:
flights_w_arr_status2 <- flights %>%
mutate(dep_status = case_when(is.na(dep_delay) ~ "canceled",
dep_delay <= 60 ~ "not late",
dep_delay > 60 ~ "late"))%>%
group_by(month, day)%>%
summarize(percentage = (sum(dep_status == "late" | dep_status == "canceled")/ (sum (dep_status == "late" | dep_status == "not late"| dep_status == "canceled"))) * 100)%>%
filter(percentage > 35)
view(flights_w_arr_status2)
After running the code, I end up finding 12 different dates in which the percentage of delayed/canceled flights exceeded 35%. The next question that I was asked to answer, actually pertains to these dates. I was instructed to research the weather on each of the dates that were found, because the most common reason for mass flight cancellations is weather. What I found for each date is as follows:
-On February 8 and 9, there was a large winter storm, which saw up to 11.4 inches of snow. -On March 8, there was 2 to 3 feet of snow. -On May 23, there was 1.81” of rain. -On June 24, the day was pretty sunny. However, there were jet accidents on the runway that caused these cancellations. -On June 28, The weather in New York was not terrible, but there were severe storms across the Eastern Seaboard region resulting in delays. -On July 1, there were heavy thunderstorms. In fact, 0.52 inches came down between 9:00 AM and 9:30 AM alone. -On July 10, The city had 0.53 inched of precipitation between midnight and daybreak. -On July 23, the weather was sunny, similar to June 24. But, there was a Southwest Airlines emergency landing which resulted in a runway closure. -On September 2, there were severe thunderstorms with heavy rain. -On September 12, New York experienced a mix of haze, mist, and thunderstorms. -On December 5, there was a significant winter storm that brought heavy snow, sleet, and freezing rain.
All in all, the theory that weather causes the majority of cancellations was correct. In fact, weather was the cause for the mass cancellations on each of these days, except for June 24 and July 23. These two dates both had runway incidents that led to the large number of canceled flights.
The next task that I am assigned, is to view the percentage of flights that were canceled/delayed at each scheduled departure hour of the day. This is a rather easy thing to do, because of what we did in our first code chunk. The only difference between this code chunk and the first code chunk, will be grouping the percentage by hour this time, in order to view the percentage of delayed/canceled flights at each hour of the day.
flights_w_arr_status3 <- flights %>%
mutate(dep_status = case_when(is.na(dep_delay) ~ "canceled",
dep_delay <= 60 ~ "not late",
dep_delay > 60 ~ "late"))%>%
group_by(hour)%>%
summarize(percentage = (sum(dep_status == "late" | dep_status == "canceled")/ (sum (dep_status == "late" | dep_status == "not late"| dep_status == "canceled"))) * 100)
view (flights_w_arr_status3)
After running the code, there are two things that stood out to me. First, it appears that as the day goes on, the percentage of canceled/delayed flights generally increases, until 21:00 (9:00 PM), when the chances start to decrease again. The other thing that stood out to me is when the hour is 1, the percentage chance of cancellation is 100. I thought that this may be because of a limited number of flights that left at 1, so I tested my theory by running the following code:
flights%>%
filter(hour==1)%>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 1
I was correct, as there is only one flight that left at 1, and it was canceled.
Now that I have cleared the air with that data discrepancy, I was asked to create a plot of the data that I found in the previous question. I am going to create a scatter plot with a line in order to view the trend of the data. The trend I expect to see is what I just mentioned. An increase in canceled/delayed flights until 9:00PM, which is when the percentage will finally start to decrease.
ggplot(data = flights_w_arr_status3, aes(x=hour, y=percentage)) +
geom_point() +
geom_line() +
labs(
title = "Perentage of Flights Canceled or Delayed",
x= "Hour",
y = "Percentage"
)
After graphing the data, it once again appears that my prediction was correct. The shape of the plot involves a steady increase until around 21:00, where a steady decrease starts and continues through the rest of the plot. My theory for why the shape looks like this, is because the bad weather comes more towards the middle of the day and at night, which causes an increase in cancellations at these times. One thing to note when looking at this shape, is the first 4 hours can be ignored, because only one flight was scheduled to take off at 1:00 (this is what creates the outlier in the plot), and no flights were scheduled to take off from 2:00 to 4:00.
For the last section, I am going to look into the flights that left early despite being scheduled on bad weather days. My first order of business is to create a data set that only shows data for the days that saw mass cancellations (the days found in the first section), and only shows the flights that saw a departure delay of less than or equal to zero. The following code chunk does just that:
on_time_flights <- flights%>%
filter(dep_delay <= 0 &
((month == 2 & day == 8 |
month == 2 & day == 9 |
month == 3 & day == 8 |
month == 5 & day == 23 |
month == 6 & day == 24 |
month == 6 & day == 28 |
month == 7 & day == 1 |
month == 7 & day == 10 |
month == 7 & day == 23 |
month == 9 & day == 2 |
month == 9 & day == 12 |
month == 12 & day == 5
)))
view(on_time_flights)
With this data set now, I am tasked with finding the average departure time on each date. What I expect to find in this data is for the average time to generally be in the morning, because this would correlate with my previous hypothesis that said that the bad weather comes later on in the day. I will use the mean function in order to calculate the average hour on each date.
avg_hour <- on_time_flights %>%
group_by(month,day)%>%
summarize(avg_hour = mean(hour, na.rm = TRUE))
view(avg_hour)
After running the code, my suspicions were once again correct, except for one singular outlier. Every single date in question had an average departure time that was before 11:00AM, except for one. The date of 2/9 had an average departure time of 16.96(close to 5:00PM). I think that I have a pretty solid explanation of why this outlier exists though. There was a winter storm that caused mass cancellations on 2/8 and 2/9, my theory is that it took until late on 2/9 for them to clear the snow from the runway. So the idea that the weather came later still remains true, it just came late on 2/8, and carried over into 2/9, until the runway was finally cleared later that day.
In summary, flight cancellations and delays in New York City, in the year 2013, can largely be attributed to dramatic weather events that made it impossible for planes to travel. This weather typically comes later on in the day, which is why flights are typically more likely to be able to take off earlier in the morning. In conclusion, when booking a flight, it is smart to check the weather beforehand, and if bad weather is unavoidable, try to book a flight early in the morning to reduce your risk of cancellation.