This report is going to focus on much of the data that was ignored in the Lecture notes - namely the cancelled flights. The diverted flights were also of interest, but had a much smaller data set.
I thought it would be very interesting to look at the cancelled flights. In terms of frustration with flying, I can think of nothing worse than being on a cancelled flight. While we have already discussed previously why these observations are not the best due to missing data, it is a very important factor to passengers, and we still have plenty of information at our fingertips to draw some valuable conclusions.
To get started, I pulled in the data set, while at the same time maintaining an “original” data set, as I knew I would be needing it later, due to modifications to the “flights” data. At the same time, I pulled in all packages (not shown).
#Read data file
flights_original <- read.csv("domestic_flights_jan_2016_agg.csv", stringsAsFactors = FALSE)
flights <- flights_original
I then performed numerous format changes and calculations, a la the Lecture Notes for this assignment.
#Format FlightDate to date class
flights$FlightDate <- as.Date(flights$FlightDate, format = "%m/%d/%Y")
#Format CRSDepTime
flights <- flights %>% mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
flights$new_CRSDepTime <- as.POSIXct(flights$new_CRSDepTime, format = "%Y-%m-%d %H%M")
#Format CRSArrTime
flights <- flights %>% mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
flights$new_CRSArrTime <- as.POSIXct(flights$new_CRSArrTime, format = "%Y-%m-%d %H%M")
#Format day variables
flights <- flights %>% filter(Cancelled == 0) %>% mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)),new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)), new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)), new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)))
#Format other date variables
flights$new_DepTime <- as.POSIXct(flights$new_DepTime, format = "%Y-%m-%d %H%M")
flights$new_ArrTime <- as.POSIXct(flights$new_ArrTime, format = "%Y-%m-%d %H%M")
flights$new_WheelsOff <- as.POSIXct(flights$new_WheelsOff, format = "%Y-%m-%d %H%M")
flights$new_WheelsOn <- as.POSIXct(flights$new_WheelsOn, format = "%Y-%m-%d %H%M")
#Calculate Departure Delay, Minutes, >15 Minutes
flights <- flights %>% filter(Cancelled == 0) %>% mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins")))
flights <- flights %>% filter(Cancelled == 0) %>% mutate(DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay), DepDel15 = ifelse(DepDelay >= 15, 1, 0))
flights %>% select(DepDelay, DepDelayMinutes, DepDel15) %>% slice (1:5)
#Calculate TaxiOut, TaxiIn, ArrDelay, Minutes, >15 Minutes
flights <- flights %>% filter(Cancelled == 0) %>% mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")), TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")), ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")), ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay), ArrDel15 = ifelse(ArrDelay >= 15, 1, 0))
#Calculate AirTime, AirSpeed
flights <- flights %>% filter(Cancelled == 0) %>% mutate(AirTime = ActualElapsedTime - TaxiOut - TaxiIn) %>% mutate(AirSpeed = Distance / (AirTime / 60))
For the first analysis here, I looked at the cancelled flights by airline. While I already suspect that more troubled regions in terms of weather would weigh heavily on the regional airlines, we can discuss that after seeing what we find.
First I pulled out only the cancelled flights, of which there were 11,665. I then grouped the data by carrier and summarized their cancelled flights.
As an aside, I considered doing a similar analysis of diverted flights, but with less than 900 observations, I figured the data set was getting relateively small.
#Filter Cancelled
flights_can <- flights_original %>% group_by(Carrier) %>% summarize(Percent_Cancelled = sum(Cancelled)/n() * 100)
#Cancelled Flights by Carrier - Charted
flights_can %>% ggvis(~Carrier, ~Percent_Cancelled) %>% layer_bars(fill := 'blue')
From this we see that B6, Jet blue (hence the blue bars) has the most cancelled flights. It does make sense on a simple level that the low cost carrier is also the one with the highest cancellations. What was surprising, was seeing American Airlines in second place. I would have assumed smaller operations would have higher cancellations, with less redundancies and support. Alternatively, I suppose with higher flight volume the large air carriers have the ability to more easily consolidate two under booked flights into one to save on operating costs.
Since I made a statement regarding weather or other circumstances, I thought it only fair to consider the cancellations by state. While this too may correlate with the carrier densities, I thought it would be valuable to see if any one state/region had particularly high cancellations. As such I performed the follow analysis:
#Cancelled flights by State
flights_can_OST <- flights_original %>% group_by(OriginState) %>% summarize(Percent_Cancelled = sum(Cancelled)/n() * 100)
flights_can_DST <- flights_original %>% group_by(DestState) %>% summarize(Percent_Cancelled = sum(Cancelled)/n() * 100)
#Cancelled flights by State - List
top_Ostates <- flights_can_OST %>% arrange(desc(Percent_Cancelled)) %>% slice(1:10)
top_Dstates <- flights_can_DST %>% arrange(desc(Percent_Cancelled)) %>% slice(1:10)
kable(top_Ostates, digits = 1)
| OriginState | Percent_Cancelled |
|---|---|
| NJ | 9.4 |
| VA | 9.0 |
| MD | 8.8 |
| VT | 7.2 |
| NC | 6.3 |
| NY | 5.9 |
| ME | 5.7 |
| WV | 5.6 |
| PA | 5.1 |
| NH | 4.3 |
kable(top_Dstates, digits = 1)
| DestState | Percent_Cancelled |
|---|---|
| NJ | 9.5 |
| VA | 8.9 |
| MD | 8.6 |
| VT | 7.6 |
| NC | 6.5 |
| ME | 6.3 |
| NY | 5.9 |
| PA | 5.2 |
| WV | 4.6 |
| NH | 4.3 |
So what does this say? Looking at these two lists side by side the top 10 states for cancelled flights both in and out are the exact same states - the only movement is numbers 8 and 9, WV and PA, which switch positions. One thing this might indicate, perhaps not surprisingly, is that cancelled flights have a reactionary effect down the line. That is, a cancelled flight to a destination state will frequently result in a cancelled flight from the same state, now its origin state.
Interestingly, the majority of these states are all in the Northeast. While I suppose these could be poorly run airports, there could be other factors in play, such as abnormal challenging weather, so we shall ask another question.
The chart below shows us is what many others reported on when evaluating cancelled flights. There appear to be serious weather related issues. To confirm this hypothesis, I looked back to the news from the 23rd - 25th. According to my sources ,a blizzard resulted in thousands of cancelled flights in late January 2016.
This supports the idea that weather plays a large part in overall cancelled flights, and explains why certain states, particularly in the the Northeast had so many of the cancellations.
#Flights by FlightDate
flights_original$FlightDate <- as.Date(flights_original$FlightDate, format = "%m/%d/%Y")
flights_can_jan <- flights_original %>% filter(Cancelled == 1) %>% group_by(FlightDate) %>% summarize(Flights_Cancelled = sum(Cancelled))
#Flights by FlightDate - Scatter
flights_can_jan %>% ggvis(~FlightDate, ~Flights_Cancelled) %>% layer_points(fill := 'blue') %>% add_axis("x", properties = axis_props(labels = list(angle = 45)), tick_padding = 20, title_offset = 60) %>% add_axis("y", title_offset = 50)
Had some fun on formatting this chart
Lastly, I looked at a more specific range of time periods. Previously, we looked at delays and two time segments of the day. For this analysis, I wanted to break the time period even more so, including early morning, mid-day, and evening. This also looked at only the cancelled flights. From this analysis we can see that the later in the day, the greater chance of a cancelled flight. This would suggest that what we found in previous analysis that both delays and cancellations are more likely later in the day. While that is a valid statement, the overall percentage of cancelled flights is not drastically different from one time period to another, suggesting that cancellations are more likely later in the day, but not substantially more.
Note: I really wanted to chart this by carrier - to have all carriers listed for each Day Part to see if certain carriers were better or worse given the time of day, but was unable to get that feature to work after substantial effort
flights_can_Arr <- flights_original %>% select(Carrier, CRSArrTime, Cancelled) %>% mutate(DayPart = ifelse(CRSArrTime < 800, "Early AM", ifelse(CRSArrTime > 1600, "Night", "Mid-Day"))) %>% group_by(DayPart) %>% summarize(Percent_Cancelled = sum(Cancelled) / n() * 100)
flights_can_Arr %>% ggvis(~DayPart, ~Percent_Cancelled) %>% layer_points(fill := 'red')
The conclusions and recommendations that can be drawn from this analysis include: