MBA 676 Unit 6, Fall 2016

In project 2, we’re looking at some time-series data as they relate to airline performance. Our data set includes domestic flights reported for on-time performance in the US for January 2016.

Given that our data set included tail numbers for the aircraft operating each flight, this student-analyst was interested in whether we could discern any relationships between on-time performance and aircraft age or manufacturer.

To join the aircraft data to the tail number, this writer consulted the FAA’s aircraft registration database, downloadable here.

Some Data Wrangling:

We begin by setting up our environment and reading in our values:

#Set Working Directory and Load the Requisite Libraries
setwd("~/Documents/MBA 676/Unit 6 Stuff")
getwd()
library (dplyr)
library (lubridate)
library (chron)
library (quantmod)
library (dygraphs)
library(mice)
library(ggvis)
library(chron)
library(knitr)
#Load Project CSV
US_Flights <- read.csv("domestic_flights_jan_2016.csv", header = TRUE, stringsAsFactors = FALSE)
#Read in N_Number Reference (Tail Numbers)
N_Num_Ref <- read.csv("N_Number_Lookup.csv", header = TRUE, stringsAsFactors = FALSE)

Next, we drop records for canceled and diverted flights. We also drop records for AA and MQ carriers. While this is unfortunate in terms of reducing our data set, these carriers use Fleet IDs rather than standard N_Numbers. If we didn’t eliminate these carriers, our join between the data sets would result in erroneous matches. See Hadley’s article, page 4 FMI regarding tail numbers and fleet IDs. This activity reduces the size of our frame from the original 445,827, to 360,562 records.

#Drop Diverted and Cancelled Flights
US_Flights_2 <- US_Flights %>% filter(Diverted == 0) %>% filter(Cancelled == 0)
#Create US_Flights_3 because we can't match AA or MQ tail numbers per Hadley
US_Flights_3 <- US_Flights_2 %>% filter(Carrier != "AA" & Carrier != "MQ")

Next, we join our flights and aircraft data. The summary command helps find missing values in the result set.

We drop rows where we are missing year manufactured data and perform some cleanup to remove the empty cancelled and diverted flags. These steps bring our row count down by about 9,000 records, close to our final working value.

#Join plane data to flight data
US_Flights_Planes <- US_Flights_3 %>% inner_join(N_Num_Ref)
#Look for missing values
summary(US_Flights_Planes)
##   FlightDate          Carrier            TailNum            FlightNum   
##  Length:358128      Length:358128      Length:358128      Min.   :   1  
##  Class :character   Class :character   Class :character   1st Qu.: 725  
##  Mode  :character   Mode  :character   Mode  :character   Median :1678  
##                                                           Mean   :2238  
##                                                           3rd Qu.:3231  
##                                                           Max.   :7438  
##                                                                         
##     Origin          OriginCityName     OriginState       
##  Length:358128      Length:358128      Length:358128     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      Dest           DestCityName        DestState           CRSDepTime  
##  Length:358128      Length:358128      Length:358128      Min.   :   1  
##  Class :character   Class :character   Class :character   1st Qu.: 924  
##  Mode  :character   Mode  :character   Mode  :character   Median :1329  
##                                                           Mean   :1334  
##                                                           3rd Qu.:1731  
##                                                           Max.   :2359  
##                                                                         
##     DepTime       WheelsOff       WheelsOn      CRSArrTime  
##  Min.   :   1   Min.   :   1   Min.   :   1   Min.   :   1  
##  1st Qu.: 927   1st Qu.: 941   1st Qu.:1104   1st Qu.:1120  
##  Median :1335   Median :1346   Median :1518   Median :1526  
##  Mean   :1338   Mean   :1360   Mean   :1483   Mean   :1503  
##  3rd Qu.:1737   3rd Qu.:1751   3rd Qu.:1913   3rd Qu.:1920  
##  Max.   :2400   Max.   :2400   Max.   :2400   Max.   :2359  
##                                                             
##     ArrTime       Cancelled    Diverted CRSElapsedTime ActualElapsedTime
##  Min.   :   1   Min.   :0   Min.   :0   Min.   : 21    Min.   : 17.0    
##  1st Qu.:1109   1st Qu.:0   1st Qu.:0   1st Qu.: 86    1st Qu.: 82.0    
##  Median :1521   Median :0   Median :0   Median :124    Median :117.0    
##  Mean   :1488   Mean   :0   Mean   :0   Mean   :142    Mean   :135.6    
##  3rd Qu.:1918   3rd Qu.:0   3rd Qu.:0   3rd Qu.:175    3rd Qu.:167.0    
##  Max.   :2400   Max.   :0   Max.   :0   Max.   :705    Max.   :721.0    
##                                                                         
##     Distance        N.NUMBER         SERIAL.NUMBER      MFR.MDL.CODE      
##  Min.   :  31.0   Length:358128      Length:358128      Length:358128     
##  1st Qu.: 372.0   Class :character   Class :character   Class :character  
##  Median : 645.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 811.6                                                           
##  3rd Qu.:1050.0                                                           
##  Max.   :4983.0                                                           
##                                                                           
##      MFR               Model              YEAR.MFR   
##  Length:358128      Length:358128      Min.   :1984  
##  Class :character   Class :character   1st Qu.:2000  
##  Mode  :character   Mode  :character   Median :2003  
##                                        Mean   :2003  
##                                        3rd Qu.:2007  
##                                        Max.   :2015  
##                                        NA's   :8968
#Drop rows where YEAR.MFR = NA and the empty Cancelled and Diverted columns
US_Flights_Planes <- US_Flights_Planes %>% subset(select = -Cancelled)
US_Flights_Planes <- US_Flights_Planes %>% subset(select = -Diverted)
US_Flights_Planes <- US_Flights_Planes %>% filter(complete.cases(.))

Next, we execute a number of steps to define our calculated variables and convert them to date-time objects.

#Store flight date as Date object
US_Flights_Planes$FlightDate <- as.Date(US_Flights_Planes$FlightDate, format = "%m/%d/%Y")

US_Flights_Planes <- US_Flights_Planes %>% mutate(newCRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))

US_Flights_Planes$newCRSDepTime <- as.POSIXct(US_Flights_Planes$newCRSDepTime, format = "%Y-%m-%d %H%M")

US_Flights_Planes <- US_Flights_Planes %>% mutate(newCRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))

US_Flights_Planes$newCRSArrTime <- as.POSIXct(US_Flights_Planes$newCRSArrTime, format = "%Y-%m-%d %H%M")

US_Flights_Planes <- US_Flights_Planes %>% mutate(NewDepTime = paste(FlightDate, sprintf("%04d", DepTime)), NewWheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)), NewWheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), NewArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))

US_Flights_Planes$NewDepTime <- as.POSIXct(US_Flights_Planes$NewDepTime, format = "%Y-%m-%d %H%M")

US_Flights_Planes$NewWheelsOff <- as.POSIXct(US_Flights_Planes$NewWheelsOff, format = "%Y-%m-%d %H%M")

US_Flights_Planes$NewWheelsOn <- as.POSIXct(US_Flights_Planes$NewWheelsOn, format = "%Y-%m-%d %H%M")

US_Flights_Planes$NewArrTime <- as.POSIXct(US_Flights_Planes$NewArrTime, format = "%Y-%m-%d %H%M")
#Calculate Departure Delay, Positive Departure Delay, and Major Departure Delay
US_Flights_Planes <- US_Flights_Planes %>% mutate(DepDelay = as.integer(difftime(NewDepTime, newCRSDepTime, units = "mins")))

US_Flights_Planes <- US_Flights_Planes %>% mutate(PosDepDelay = ifelse(DepDelay > 0, DepDelay,0), MajDepDelay = ifelse(PosDepDelay >= 15, 1, 0))
head(US_Flights_Planes %>% select(DepDelay, PosDepDelay, MajDepDelay))
##   DepDelay PosDepDelay MajDepDelay
## 1       -9           0           0
## 2      -10           0           0
## 3      -17           0           0
## 4       -9           0           0
## 5       25          25           1
## 6        1           1           0
#Calculate Taxi Out, Taxi In, Arrival Delay, Positive Arrival Delay, Major Arrival Delay, and Flight Time Buffer
US_Flights_Planes <- US_Flights_Planes %>% mutate(TaxiOut = as.integer(difftime(NewWheelsOff, NewDepTime, units = "mins")), TaxiIn = as.integer(difftime(NewArrTime, NewWheelsOn, units = "mins")), ArrDelay = as.integer(difftime(NewArrTime, newCRSArrTime, units = "mins")), PosArrDelay = ifelse(ArrDelay < 0, 0, ArrDelay), MajArrDelay = ifelse(ArrDelay >= 15, 1, 0), FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime)

#Calculate AirTime and AirSpeed
US_Flights_Planes <- US_Flights_Planes %>% mutate(AirTime = ActualElapsedTime - TaxiOut - TaxiIn)

US_Flights_Planes <- US_Flights_Planes %>% mutate(AirSpeed = Distance / (AirTime / 60))
summary(US_Flights_Planes)
##    FlightDate           Carrier            TailNum            FlightNum   
##  Min.   :2016-01-01   Length:349160      Length:349160      Min.   :   1  
##  1st Qu.:2016-01-08   Class :character   Class :character   1st Qu.: 722  
##  Median :2016-01-15   Mode  :character   Mode  :character   Median :1670  
##  Mean   :2016-01-15                                         Mean   :2228  
##  3rd Qu.:2016-01-24                                         3rd Qu.:3159  
##  Max.   :2016-01-31                                         Max.   :7438  
##     Origin          OriginCityName     OriginState       
##  Length:349160      Length:349160      Length:349160     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      Dest           DestCityName        DestState           CRSDepTime  
##  Length:349160      Length:349160      Length:349160      Min.   :   1  
##  Class :character   Class :character   Class :character   1st Qu.: 924  
##  Mode  :character   Mode  :character   Mode  :character   Median :1329  
##                                                           Mean   :1334  
##                                                           3rd Qu.:1731  
##                                                           Max.   :2359  
##     DepTime       WheelsOff       WheelsOn      CRSArrTime  
##  Min.   :   1   Min.   :   1   Min.   :   1   Min.   :   1  
##  1st Qu.: 927   1st Qu.: 941   1st Qu.:1104   1st Qu.:1120  
##  Median :1334   Median :1346   Median :1518   Median :1526  
##  Mean   :1338   Mean   :1360   Mean   :1483   Mean   :1503  
##  3rd Qu.:1737   3rd Qu.:1751   3rd Qu.:1914   3rd Qu.:1920  
##  Max.   :2400   Max.   :2400   Max.   :2400   Max.   :2359  
##     ArrTime     CRSElapsedTime  ActualElapsedTime    Distance     
##  Min.   :   1   Min.   : 21.0   Min.   : 17.0     Min.   :  31.0  
##  1st Qu.:1108   1st Qu.: 86.0   1st Qu.: 82.0     1st Qu.: 373.0  
##  Median :1522   Median :124.0   Median :118.0     Median : 646.0  
##  Mean   :1488   Mean   :142.3   Mean   :135.9     Mean   : 814.3  
##  3rd Qu.:1918   3rd Qu.:175.0   3rd Qu.:167.0     3rd Qu.:1050.0  
##  Max.   :2400   Max.   :705.0   Max.   :721.0     Max.   :4983.0  
##    N.NUMBER         SERIAL.NUMBER      MFR.MDL.CODE      
##  Length:349160      Length:349160      Length:349160     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      MFR               Model              YEAR.MFR   
##  Length:349160      Length:349160      Min.   :1984  
##  Class :character   Class :character   1st Qu.:2000  
##  Mode  :character   Mode  :character   Median :2003  
##                                        Mean   :2003  
##                                        3rd Qu.:2007  
##                                        Max.   :2015  
##  newCRSDepTime                 newCRSArrTime                
##  Min.   :2016-01-01 00:15:00   Min.   :2016-01-01 00:01:00  
##  1st Qu.:2016-01-08 11:30:00   1st Qu.:2016-01-08 13:30:00  
##  Median :2016-01-15 21:25:00   Median :2016-01-15 22:56:00  
##  Mean   :2016-01-16 08:05:51   Mean   :2016-01-16 09:48:26  
##  3rd Qu.:2016-01-24 08:57:00   3rd Qu.:2016-01-24 10:35:00  
##  Max.   :2016-01-31 23:59:00   Max.   :2016-01-31 23:59:00  
##    NewDepTime                   NewWheelsOff                
##  Min.   :2016-01-01 00:01:00   Min.   :2016-01-01 00:01:00  
##  1st Qu.:2016-01-08 11:36:45   1st Qu.:2016-01-08 11:54:00  
##  Median :2016-01-15 21:42:00   Median :2016-01-15 21:54:30  
##  Mean   :2016-01-16 08:09:42   Mean   :2016-01-16 08:22:16  
##  3rd Qu.:2016-01-24 08:56:00   3rd Qu.:2016-01-24 09:10:00  
##  Max.   :2016-02-01 00:00:00   Max.   :2016-01-31 23:59:00  
##   NewWheelsOn                    NewArrTime                 
##  Min.   :2016-01-01 00:01:00   Min.   :2016-01-01 00:01:00  
##  1st Qu.:2016-01-08 13:24:00   1st Qu.:2016-01-08 13:27:00  
##  Median :2016-01-15 22:52:00   Median :2016-01-15 22:54:00  
##  Mean   :2016-01-16 09:36:38   Mean   :2016-01-16 09:39:20  
##  3rd Qu.:2016-01-24 10:21:00   3rd Qu.:2016-01-24 10:26:00  
##  Max.   :2016-02-01 00:00:00   Max.   :2016-02-01 00:00:00  
##     DepDelay         PosDepDelay       MajDepDelay        TaxiOut        
##  Min.   :-1438.00   Min.   :   0.00   Min.   :0.0000   Min.   :-1433.00  
##  1st Qu.:   -5.00   1st Qu.:   0.00   1st Qu.:0.0000   1st Qu.:   11.00  
##  Median :   -2.00   Median :   0.00   Median :0.0000   Median :   14.00  
##  Mean   :    3.85   Mean   :  10.37   Mean   :0.1629   Mean   :   12.57  
##  3rd Qu.:    5.00   3rd Qu.:   5.00   3rd Qu.:0.0000   3rd Qu.:   19.00  
##  Max.   : 1438.00   Max.   :1438.00   Max.   :1.0000   Max.   :  157.00  
##      TaxiIn             ArrDelay          PosArrDelay     
##  Min.   :-1438.000   Min.   :-1438.000   Min.   :   0.00  
##  1st Qu.:    4.000   1st Qu.:  -16.000   1st Qu.:   0.00  
##  Median :    6.000   Median :   -7.000   Median :   0.00  
##  Mean   :    2.701   Mean   :   -9.099   Mean   :  12.58  
##  3rd Qu.:    8.000   3rd Qu.:    5.000   3rd Qu.:   5.00  
##  Max.   :  155.000   Max.   : 1439.000   Max.   :1439.00  
##   MajArrDelay     FlightTimeBuffer      AirTime          AirSpeed      
##  Min.   :0.0000   Min.   :-169.000   Min.   :   8.0   Min.   :  2.757  
##  1st Qu.:0.0000   1st Qu.:   0.000   1st Qu.:  60.0   1st Qu.:361.935  
##  Median :0.0000   Median :   7.000   Median :  94.0   Median :410.435  
##  Mean   :0.1556   Mean   :   6.462   Mean   : 120.6   Mean   :404.935  
##  3rd Qu.:0.0000   3rd Qu.:  14.000   3rd Qu.: 144.0   3rd Qu.:454.529  
##  Max.   :1.0000   Max.   :  78.000   Max.   :2927.0   Max.   :781.154

Now, we check the data frame’s summary (above), to look for suspicious data. In this case, we remove rows where the times cross a date line and introduce issues with the calculated delay metrics. This brings us to our final row count of 334,670 observations.

#Filter out flights that cross datelines and appear to arrive prior to departure
US_Flights_Planes <- US_Flights_Planes %>% filter(NewArrTime > NewDepTime & CRSArrTime > CRSDepTime)

The Fun Part

Year of Aircraft Manufacture and Delays

Now that we have a clean data frame, it’s time to investigate some possible reasons for flight delays. One questions was, “Does the age of the aircraft flying result in greater numbers of delays?”

Graphic_1 <- US_Flights_Planes %>% group_by("Manufacturing Year" = YEAR.MFR, label = TRUE) %>% summarize("Percent Major Delay" = (sum(MajDepDelay)/ n())*100, "Number of Flights in Cohort" = n())
Graphic_1 %>% ggvis(~`Manufacturing Year`, ~`Percent Major Delay`, fill = ~`Number of Flights in Cohort`, size := 200, opacity :=.8) %>% layer_points

One could expect an increase in delays for older aircraft, due to increased mechanical issues causing delays. Our chart doesn’t seem to support that theory though!

Per Patrick Smith’s Ask The Pilot, aged aircraft are subjected to increasingly rigorous maintenance routines. Perhaps this increased attention actually keeps maintenance delays under control for older aircraft.

In the chart, we actually see an upward trend in delays for newer aircraft - perhaps due to issues incumbent with newer, more complex systems?

Aircraft Manufacturer and Major Delays

Next, it was time to investigate if we could discern a pattern between certain aircraft manufacturers and major delays in our data. Many are well aware of the rivalry between Airbus and Boeing, and perhaps we can see if one manufacturer’s aircraft experience fewer delays than the other.

#Is there a relationship between manufacturer and major departure delays?
Graphic_2 <-US_Flights_Planes %>% group_by("Manufacturer" = MFR, label = TRUE) %>% summarize("Percent Major Delay" = (sum(MajDepDelay)/ n())*100, "Number of Flights in Cohort" = n())
Graphic_2 %>% ggvis(~`Manufacturer`, ~`Percent Major Delay`, fill = ~`Number of Flights in Cohort`) %>% layer_bars(width = .5) %>% add_axis("x", tick_padding = 35, properties = axis_props(axis = list(stroke = "blue", strokeWidth = 2), ticks = list(stroke = "blue", strokeWidth = 2), labels = list(angle = 45, align = "left", fontSize = 10)))

While there are some variances apparent, we did not find a remarkable difference in delays between Airbus and Boeing. Cessna was notable in this view for experiencing major delays almost twice as frequently as Boeing/Airbus.

Can Pilots Reduce Arrival Delays on Longer Flights?

We’ve all heard some variation of, “Well folks, we got a bit of a late start coming out of [your city here], but we’re going to try and make up most of it in the air.” In this exploration, we’ll look at the prevalence of arrival delays by distance between airports. The theory we’re testing is for a negative relationship between long flights and arrival delays. For simplicity, we’ve filtered this visualization to the top five manufacturers and arrival delays under ten hours.

#Store the top 5 MFR's by number of flights
Top_5_MFR <- US_Flights_Planes %>% group_by(MFR = MFR) %>% summarize("Flights" = n()) %>% top_n(5)
#Is there a relationship between flight distance and arrival delay?
US_Flights_Planes %>% inner_join(Top_5_MFR) %>% filter(PosArrDelay < 600) %>% ggvis(~Distance, ~PosArrDelay, fill= ~ MFR, opacity :=.7)

From this visualization, we can confirm that pilots are often able to reduce arrival delays on longer flights, because we’re seeing Arrival Delays (in minutes) decreasing with length of flight.

An interesting additional finding - from this visualization, we gain some insight into how carriers deploy their various planes across their routes:

  1. Boeing / Airbus: Shared responsibility on both short- and long-haul flights, but carriers deploy these aircraft exclusively on their longest routes.
  2. Embraer / Bombardier: Regional Jets (RJ’s) limited to short-medium haul routes, generally under 1,500 miles.
  3. McDonnell Douglas: These venerable craft are some of the oldest in the fleet (though the absolute oldest are aged Boeing 757 and 737’s). This manufacturer produced aircraft that could fly trans-continental routes in their heyday, but are largely limited to medium-length routes for economic, rather than technical reasons today.