In project 2, we’re looking at some time-series data as they relate to airline performance. Our data set includes domestic flights reported for on-time performance in the US for January 2016.
Given that our data set included tail numbers for the aircraft operating each flight, this student-analyst was interested in whether we could discern any relationships between on-time performance and aircraft age or manufacturer.
To join the aircraft data to the tail number, this writer consulted the FAA’s aircraft registration database, downloadable here.
We begin by setting up our environment and reading in our values:
#Set Working Directory and Load the Requisite Libraries
setwd("~/Documents/MBA 676/Unit 6 Stuff")
getwd()
library (dplyr)
library (lubridate)
library (chron)
library (quantmod)
library (dygraphs)
library(mice)
library(ggvis)
library(chron)
library(knitr)
#Load Project CSV
US_Flights <- read.csv("domestic_flights_jan_2016.csv", header = TRUE, stringsAsFactors = FALSE)
#Read in N_Number Reference (Tail Numbers)
N_Num_Ref <- read.csv("N_Number_Lookup.csv", header = TRUE, stringsAsFactors = FALSE)
Next, we drop records for canceled and diverted flights. We also drop records for AA and MQ carriers. While this is unfortunate in terms of reducing our data set, these carriers use Fleet IDs rather than standard N_Numbers. If we didn’t eliminate these carriers, our join between the data sets would result in erroneous matches. See Hadley’s article, page 4 FMI regarding tail numbers and fleet IDs. This activity reduces the size of our frame from the original 445,827, to 360,562 records.
#Drop Diverted and Cancelled Flights
US_Flights_2 <- US_Flights %>% filter(Diverted == 0) %>% filter(Cancelled == 0)
#Create US_Flights_3 because we can't match AA or MQ tail numbers per Hadley
US_Flights_3 <- US_Flights_2 %>% filter(Carrier != "AA" & Carrier != "MQ")
Next, we join our flights and aircraft data. The summary command helps find missing values in the result set.
We drop rows where we are missing year manufactured data and perform some cleanup to remove the empty cancelled and diverted flags. These steps bring our row count down by about 9,000 records, close to our final working value.
#Join plane data to flight data
US_Flights_Planes <- US_Flights_3 %>% inner_join(N_Num_Ref)
#Look for missing values
summary(US_Flights_Planes)
## FlightDate Carrier TailNum FlightNum
## Length:358128 Length:358128 Length:358128 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 725
## Mode :character Mode :character Mode :character Median :1678
## Mean :2238
## 3rd Qu.:3231
## Max. :7438
##
## Origin OriginCityName OriginState
## Length:358128 Length:358128 Length:358128
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Dest DestCityName DestState CRSDepTime
## Length:358128 Length:358128 Length:358128 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 924
## Mode :character Mode :character Mode :character Median :1329
## Mean :1334
## 3rd Qu.:1731
## Max. :2359
##
## DepTime WheelsOff WheelsOn CRSArrTime
## Min. : 1 Min. : 1 Min. : 1 Min. : 1
## 1st Qu.: 927 1st Qu.: 941 1st Qu.:1104 1st Qu.:1120
## Median :1335 Median :1346 Median :1518 Median :1526
## Mean :1338 Mean :1360 Mean :1483 Mean :1503
## 3rd Qu.:1737 3rd Qu.:1751 3rd Qu.:1913 3rd Qu.:1920
## Max. :2400 Max. :2400 Max. :2400 Max. :2359
##
## ArrTime Cancelled Diverted CRSElapsedTime ActualElapsedTime
## Min. : 1 Min. :0 Min. :0 Min. : 21 Min. : 17.0
## 1st Qu.:1109 1st Qu.:0 1st Qu.:0 1st Qu.: 86 1st Qu.: 82.0
## Median :1521 Median :0 Median :0 Median :124 Median :117.0
## Mean :1488 Mean :0 Mean :0 Mean :142 Mean :135.6
## 3rd Qu.:1918 3rd Qu.:0 3rd Qu.:0 3rd Qu.:175 3rd Qu.:167.0
## Max. :2400 Max. :0 Max. :0 Max. :705 Max. :721.0
##
## Distance N.NUMBER SERIAL.NUMBER MFR.MDL.CODE
## Min. : 31.0 Length:358128 Length:358128 Length:358128
## 1st Qu.: 372.0 Class :character Class :character Class :character
## Median : 645.0 Mode :character Mode :character Mode :character
## Mean : 811.6
## 3rd Qu.:1050.0
## Max. :4983.0
##
## MFR Model YEAR.MFR
## Length:358128 Length:358128 Min. :1984
## Class :character Class :character 1st Qu.:2000
## Mode :character Mode :character Median :2003
## Mean :2003
## 3rd Qu.:2007
## Max. :2015
## NA's :8968
#Drop rows where YEAR.MFR = NA and the empty Cancelled and Diverted columns
US_Flights_Planes <- US_Flights_Planes %>% subset(select = -Cancelled)
US_Flights_Planes <- US_Flights_Planes %>% subset(select = -Diverted)
US_Flights_Planes <- US_Flights_Planes %>% filter(complete.cases(.))
Next, we execute a number of steps to define our calculated variables and convert them to date-time objects.
#Store flight date as Date object
US_Flights_Planes$FlightDate <- as.Date(US_Flights_Planes$FlightDate, format = "%m/%d/%Y")
US_Flights_Planes <- US_Flights_Planes %>% mutate(newCRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
US_Flights_Planes$newCRSDepTime <- as.POSIXct(US_Flights_Planes$newCRSDepTime, format = "%Y-%m-%d %H%M")
US_Flights_Planes <- US_Flights_Planes %>% mutate(newCRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
US_Flights_Planes$newCRSArrTime <- as.POSIXct(US_Flights_Planes$newCRSArrTime, format = "%Y-%m-%d %H%M")
US_Flights_Planes <- US_Flights_Planes %>% mutate(NewDepTime = paste(FlightDate, sprintf("%04d", DepTime)), NewWheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)), NewWheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), NewArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
US_Flights_Planes$NewDepTime <- as.POSIXct(US_Flights_Planes$NewDepTime, format = "%Y-%m-%d %H%M")
US_Flights_Planes$NewWheelsOff <- as.POSIXct(US_Flights_Planes$NewWheelsOff, format = "%Y-%m-%d %H%M")
US_Flights_Planes$NewWheelsOn <- as.POSIXct(US_Flights_Planes$NewWheelsOn, format = "%Y-%m-%d %H%M")
US_Flights_Planes$NewArrTime <- as.POSIXct(US_Flights_Planes$NewArrTime, format = "%Y-%m-%d %H%M")
#Calculate Departure Delay, Positive Departure Delay, and Major Departure Delay
US_Flights_Planes <- US_Flights_Planes %>% mutate(DepDelay = as.integer(difftime(NewDepTime, newCRSDepTime, units = "mins")))
US_Flights_Planes <- US_Flights_Planes %>% mutate(PosDepDelay = ifelse(DepDelay > 0, DepDelay,0), MajDepDelay = ifelse(PosDepDelay >= 15, 1, 0))
head(US_Flights_Planes %>% select(DepDelay, PosDepDelay, MajDepDelay))
## DepDelay PosDepDelay MajDepDelay
## 1 -9 0 0
## 2 -10 0 0
## 3 -17 0 0
## 4 -9 0 0
## 5 25 25 1
## 6 1 1 0
#Calculate Taxi Out, Taxi In, Arrival Delay, Positive Arrival Delay, Major Arrival Delay, and Flight Time Buffer
US_Flights_Planes <- US_Flights_Planes %>% mutate(TaxiOut = as.integer(difftime(NewWheelsOff, NewDepTime, units = "mins")), TaxiIn = as.integer(difftime(NewArrTime, NewWheelsOn, units = "mins")), ArrDelay = as.integer(difftime(NewArrTime, newCRSArrTime, units = "mins")), PosArrDelay = ifelse(ArrDelay < 0, 0, ArrDelay), MajArrDelay = ifelse(ArrDelay >= 15, 1, 0), FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime)
#Calculate AirTime and AirSpeed
US_Flights_Planes <- US_Flights_Planes %>% mutate(AirTime = ActualElapsedTime - TaxiOut - TaxiIn)
US_Flights_Planes <- US_Flights_Planes %>% mutate(AirSpeed = Distance / (AirTime / 60))
summary(US_Flights_Planes)
## FlightDate Carrier TailNum FlightNum
## Min. :2016-01-01 Length:349160 Length:349160 Min. : 1
## 1st Qu.:2016-01-08 Class :character Class :character 1st Qu.: 722
## Median :2016-01-15 Mode :character Mode :character Median :1670
## Mean :2016-01-15 Mean :2228
## 3rd Qu.:2016-01-24 3rd Qu.:3159
## Max. :2016-01-31 Max. :7438
## Origin OriginCityName OriginState
## Length:349160 Length:349160 Length:349160
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Dest DestCityName DestState CRSDepTime
## Length:349160 Length:349160 Length:349160 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 924
## Mode :character Mode :character Mode :character Median :1329
## Mean :1334
## 3rd Qu.:1731
## Max. :2359
## DepTime WheelsOff WheelsOn CRSArrTime
## Min. : 1 Min. : 1 Min. : 1 Min. : 1
## 1st Qu.: 927 1st Qu.: 941 1st Qu.:1104 1st Qu.:1120
## Median :1334 Median :1346 Median :1518 Median :1526
## Mean :1338 Mean :1360 Mean :1483 Mean :1503
## 3rd Qu.:1737 3rd Qu.:1751 3rd Qu.:1914 3rd Qu.:1920
## Max. :2400 Max. :2400 Max. :2400 Max. :2359
## ArrTime CRSElapsedTime ActualElapsedTime Distance
## Min. : 1 Min. : 21.0 Min. : 17.0 Min. : 31.0
## 1st Qu.:1108 1st Qu.: 86.0 1st Qu.: 82.0 1st Qu.: 373.0
## Median :1522 Median :124.0 Median :118.0 Median : 646.0
## Mean :1488 Mean :142.3 Mean :135.9 Mean : 814.3
## 3rd Qu.:1918 3rd Qu.:175.0 3rd Qu.:167.0 3rd Qu.:1050.0
## Max. :2400 Max. :705.0 Max. :721.0 Max. :4983.0
## N.NUMBER SERIAL.NUMBER MFR.MDL.CODE
## Length:349160 Length:349160 Length:349160
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## MFR Model YEAR.MFR
## Length:349160 Length:349160 Min. :1984
## Class :character Class :character 1st Qu.:2000
## Mode :character Mode :character Median :2003
## Mean :2003
## 3rd Qu.:2007
## Max. :2015
## newCRSDepTime newCRSArrTime
## Min. :2016-01-01 00:15:00 Min. :2016-01-01 00:01:00
## 1st Qu.:2016-01-08 11:30:00 1st Qu.:2016-01-08 13:30:00
## Median :2016-01-15 21:25:00 Median :2016-01-15 22:56:00
## Mean :2016-01-16 08:05:51 Mean :2016-01-16 09:48:26
## 3rd Qu.:2016-01-24 08:57:00 3rd Qu.:2016-01-24 10:35:00
## Max. :2016-01-31 23:59:00 Max. :2016-01-31 23:59:00
## NewDepTime NewWheelsOff
## Min. :2016-01-01 00:01:00 Min. :2016-01-01 00:01:00
## 1st Qu.:2016-01-08 11:36:45 1st Qu.:2016-01-08 11:54:00
## Median :2016-01-15 21:42:00 Median :2016-01-15 21:54:30
## Mean :2016-01-16 08:09:42 Mean :2016-01-16 08:22:16
## 3rd Qu.:2016-01-24 08:56:00 3rd Qu.:2016-01-24 09:10:00
## Max. :2016-02-01 00:00:00 Max. :2016-01-31 23:59:00
## NewWheelsOn NewArrTime
## Min. :2016-01-01 00:01:00 Min. :2016-01-01 00:01:00
## 1st Qu.:2016-01-08 13:24:00 1st Qu.:2016-01-08 13:27:00
## Median :2016-01-15 22:52:00 Median :2016-01-15 22:54:00
## Mean :2016-01-16 09:36:38 Mean :2016-01-16 09:39:20
## 3rd Qu.:2016-01-24 10:21:00 3rd Qu.:2016-01-24 10:26:00
## Max. :2016-02-01 00:00:00 Max. :2016-02-01 00:00:00
## DepDelay PosDepDelay MajDepDelay TaxiOut
## Min. :-1438.00 Min. : 0.00 Min. :0.0000 Min. :-1433.00
## 1st Qu.: -5.00 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 11.00
## Median : -2.00 Median : 0.00 Median :0.0000 Median : 14.00
## Mean : 3.85 Mean : 10.37 Mean :0.1629 Mean : 12.57
## 3rd Qu.: 5.00 3rd Qu.: 5.00 3rd Qu.:0.0000 3rd Qu.: 19.00
## Max. : 1438.00 Max. :1438.00 Max. :1.0000 Max. : 157.00
## TaxiIn ArrDelay PosArrDelay
## Min. :-1438.000 Min. :-1438.000 Min. : 0.00
## 1st Qu.: 4.000 1st Qu.: -16.000 1st Qu.: 0.00
## Median : 6.000 Median : -7.000 Median : 0.00
## Mean : 2.701 Mean : -9.099 Mean : 12.58
## 3rd Qu.: 8.000 3rd Qu.: 5.000 3rd Qu.: 5.00
## Max. : 155.000 Max. : 1439.000 Max. :1439.00
## MajArrDelay FlightTimeBuffer AirTime AirSpeed
## Min. :0.0000 Min. :-169.000 Min. : 8.0 Min. : 2.757
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 60.0 1st Qu.:361.935
## Median :0.0000 Median : 7.000 Median : 94.0 Median :410.435
## Mean :0.1556 Mean : 6.462 Mean : 120.6 Mean :404.935
## 3rd Qu.:0.0000 3rd Qu.: 14.000 3rd Qu.: 144.0 3rd Qu.:454.529
## Max. :1.0000 Max. : 78.000 Max. :2927.0 Max. :781.154
Now, we check the data frame’s summary (above), to look for suspicious data. In this case, we remove rows where the times cross a date line and introduce issues with the calculated delay metrics. This brings us to our final row count of 334,670 observations.
#Filter out flights that cross datelines and appear to arrive prior to departure
US_Flights_Planes <- US_Flights_Planes %>% filter(NewArrTime > NewDepTime & CRSArrTime > CRSDepTime)
Now that we have a clean data frame, it’s time to investigate some possible reasons for flight delays. One questions was, “Does the age of the aircraft flying result in greater numbers of delays?”
Graphic_1 <- US_Flights_Planes %>% group_by("Manufacturing Year" = YEAR.MFR, label = TRUE) %>% summarize("Percent Major Delay" = (sum(MajDepDelay)/ n())*100, "Number of Flights in Cohort" = n())
Graphic_1 %>% ggvis(~`Manufacturing Year`, ~`Percent Major Delay`, fill = ~`Number of Flights in Cohort`, size := 200, opacity :=.8) %>% layer_points
One could expect an increase in delays for older aircraft, due to increased mechanical issues causing delays. Our chart doesn’t seem to support that theory though!
Per Patrick Smith’s Ask The Pilot, aged aircraft are subjected to increasingly rigorous maintenance routines. Perhaps this increased attention actually keeps maintenance delays under control for older aircraft.
In the chart, we actually see an upward trend in delays for newer aircraft - perhaps due to issues incumbent with newer, more complex systems?
Next, it was time to investigate if we could discern a pattern between certain aircraft manufacturers and major delays in our data. Many are well aware of the rivalry between Airbus and Boeing, and perhaps we can see if one manufacturer’s aircraft experience fewer delays than the other.
#Is there a relationship between manufacturer and major departure delays?
Graphic_2 <-US_Flights_Planes %>% group_by("Manufacturer" = MFR, label = TRUE) %>% summarize("Percent Major Delay" = (sum(MajDepDelay)/ n())*100, "Number of Flights in Cohort" = n())
Graphic_2 %>% ggvis(~`Manufacturer`, ~`Percent Major Delay`, fill = ~`Number of Flights in Cohort`) %>% layer_bars(width = .5) %>% add_axis("x", tick_padding = 35, properties = axis_props(axis = list(stroke = "blue", strokeWidth = 2), ticks = list(stroke = "blue", strokeWidth = 2), labels = list(angle = 45, align = "left", fontSize = 10)))
While there are some variances apparent, we did not find a remarkable difference in delays between Airbus and Boeing. Cessna was notable in this view for experiencing major delays almost twice as frequently as Boeing/Airbus.
We’ve all heard some variation of, “Well folks, we got a bit of a late start coming out of [your city here], but we’re going to try and make up most of it in the air.” In this exploration, we’ll look at the prevalence of arrival delays by distance between airports. The theory we’re testing is for a negative relationship between long flights and arrival delays. For simplicity, we’ve filtered this visualization to the top five manufacturers and arrival delays under ten hours.
#Store the top 5 MFR's by number of flights
Top_5_MFR <- US_Flights_Planes %>% group_by(MFR = MFR) %>% summarize("Flights" = n()) %>% top_n(5)
#Is there a relationship between flight distance and arrival delay?
US_Flights_Planes %>% inner_join(Top_5_MFR) %>% filter(PosArrDelay < 600) %>% ggvis(~Distance, ~PosArrDelay, fill= ~ MFR, opacity :=.7)
From this visualization, we can confirm that pilots are often able to reduce arrival delays on longer flights, because we’re seeing Arrival Delays (in minutes) decreasing with length of flight.
An interesting additional finding - from this visualization, we gain some insight into how carriers deploy their various planes across their routes: