By Michael Shepherd
Oct. 9, 2016

Introduction

This is an analysis of data from January 2016 collected by the U.S. Department of Transportation. It contains time data on nearly 446,000 scheduled U.S. flights.

My analysis looks to see how Maine’s largest airport, the Portland International Jetport (PWM), compares to other large U.S. airports when it comes to its rate of ahead-of-schedule flights.

It stands to reason that because of Maine’s harsh January weather, flights could be slower. This analysis will test that hypothesis, looking at aggregate numbers and the performance of the five carriers that serve PWM: American Airlines (AA), Jetblue(B6), Delta (DL), ExpressJet (EV) and Southwest Airlines (WN).

The three questions I’m looking to answer are:

  1. What percentage of flights from PWM arrived most ahead of schedule and how did it break down by carrier?
  2. Which most arrived 10 minutes or more ahead of schedule and how did it break down by carrier?
  3. Did this adhere to national norms?

Process

First, I loaded the data packages used in our class’ last lesson and loaded the full flight dataset.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggvis)
library(tidyr)
library(knitr)
AllFlights <- read.csv("http://media.usm.maine.edu/~suleiman/mba676/domestic_flights_jan_2016.csv", header = TRUE, stringsAsFactors = FALSE)

A quick look at the data shows that flights that have “NA” time data have either been cancelled or diverted. I’ll filter those out before running calculations.

Before that, I need to transform the category that defines the dates of flights to a month-day-year format. It will figure into other calculations later on.

AllFlights$FlightDate <- as.Date(AllFlights$FlightDate, format = "%m/%d/%Y")

Next, I cut those cancelled and diverted flights from the dataset and did a quick check on the data to see the latest time that a flight left — midnight. Since my analysis only focuses on the arrival time of flights leaving Portland, this doesn’t pose an issue.

AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>% summarize(maxcrsdep = max(CRSDepTime), maxdep = max(DepTime))
##   maxcrsdep maxdep
## 1      2359   2400

Next, I created the new variables from the class notes that allow us to manipulate more time data. I don’t need all of them since my analysis is focused on the arrival side, but I made them anyway.

AllFlights <- AllFlights %>% 
  mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
AllFlights$new_CRSDepTime <- as.POSIXct(AllFlights$new_CRSDepTime, format="%Y-%m-%d %H%M")
AllFlights <- AllFlights %>% 
  mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
AllFlights$new_CRSArrTime <- as.POSIXct(AllFlights$new_CRSArrTime, format="%Y-%m-%d %H%M")
AllFlights <- AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>% 
  mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)), new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)),
  new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
AllFlights$new_WheelsOff <- as.POSIXct(AllFlights$new_WheelsOff, format="%Y-%m-%d %H%M")
AllFlights$new_WheelsOn <- as.POSIXct(AllFlights$new_WheelsOn, format="%Y-%m-%d %H%M")
AllFlights$new_DepTime <- as.POSIXct(AllFlights$new_DepTime, format="%Y-%m-%d %H%M")
AllFlights$new_ArrTime <- as.POSIXct(AllFlights$new_ArrTime, format="%Y-%m-%d %H%M")

Now, I start to make the unique variables that I need to calculate the differences between Portland flights’ estimated and actual arrival times, starting by simply subtracting the estimated time from the actual time.

AllFlights <- AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>%
  mutate(ArrAhead = as.integer(difftime(new_CRSArrTime, new_ArrTime, units = "mins")))

But that creates a problem with negative numbers for flights that are delayed. So, I create three more variables.
The first one gets rid of the negative values for those flights and only counts the difference for the on-time ones. The second is a dummy variable marked with a “1” if flights are ahead of schedule by 10 minutes; the third is a dummy variable for flights ahead of schedule at all.

AllFlights <- AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>% 
  mutate(ArrAheadReal = ifelse(ArrAhead < 0, 0, ArrAhead),
         ArrAhead10 = ifelse(ArrAhead >= 10, 1, 0),
         ArrAheadDummy = ifelse(ArrAhead > 0, 1, 0))

Then, I finally create the subset of data isolating PWM flights.

PWMFlightsOut <- AllFlights %>% filter(Origin == "PWM")

Question 1

Calculations from that newly created dataset show that nearly 72 percent of flights from Portland arrived ahead of schedule.

PWMFlightsOut %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
##   perc_ArrAheadDummy
## 1           0.715493

Four of the five carriers that served Portland were above 70 percent, but American Airlines lagged, at less than 54 percent.

Q1A <- PWMFlightsOut %>% group_by(Carrier) %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
## Adding missing grouping variables: `Carrier`
kable(Q1A)
Carrier perc_ArrAheadDummy
AA 0.5357143
B6 0.7349398
DL 0.7777778
EV 0.7058824
WN 0.7656250
Q1A %>% ggvis(~Carrier, ~perc_ArrAheadDummy) %>% layer_bars()

#### Question 2
Just over 55 percent of flights from Portland arrived 10 minutes or more ahead of schedule.

PWMFlightsOut %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
##   perc_ArrAhead10
## 1       0.5549296

Breaking that down by carrier, American Airlines is again the laggard, with only 25 percent of flights arriving 10 minutes or more ahead of schedule. The others are over our average.

Q2A <- PWMFlightsOut %>% group_by(Carrier) %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
## Adding missing grouping variables: `Carrier`
kable(Q2A)
Carrier perc_ArrAhead10
AA 0.2500000
B6 0.5662651
DL 0.6296296
EV 0.5686275
WN 0.6093750
Q2A %>% ggvis(~Carrier, ~perc_ArrAhead10) %>% layer_bars()

## Question 3
Now, we see if PWM’s overall and carrier numbers hold to national norms, finding first that it narrowly beat the national average for 12 carriers on ahead-of-schedule flights, 72 percent to 66 percent.

AllFlights %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
##   perc_ArrAheadDummy
## 1          0.6581406

However, PWM beats the national average more dramatically when it comes to flights that are 10 minutes or more early, 55 percent to 42 percent.

AllFlights %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
##   perc_ArrAhead10
## 1       0.4207289

For all national ahead-of-schedule flights, we see that American Airlines did 10 percentage points better nationally that it did at PWM. The others do worse, with Jetblue nearly 20 points behind. That helps explain the narrow gap between the PWM and national averages.

Q3B <- AllFlights %>% group_by(Carrier) %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
## Adding missing grouping variables: `Carrier`
kable(Q3B)
Carrier perc_ArrAheadDummy
AA 0.6380059
AS 0.7002711
B6 0.5436132
DL 0.7089688
EV 0.6747676
F9 0.7410943
HA 0.6634769
NK 0.4888620
OO 0.6034969
UA 0.7337173
VX 0.5845151
WN 0.6627029

On flights that arrive more than 10 minutes early, we see that ExpressJet is the only PWM carrier beating the national average, although American Airlines does far better than it does at PWM.

Q3C <- AllFlights %>% group_by(Carrier) %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
## Adding missing grouping variables: `Carrier`
kable(Q3C)
Carrier perc_ArrAhead10
AA 0.4078861
AS 0.4659675
B6 0.3574698
DL 0.4795117
EV 0.4468206
F9 0.5565688
HA 0.1990431
NK 0.2793364
OO 0.3769299
UA 0.5553963
VX 0.3553085
WN 0.3789609

Conclusion

The analysis confirms that PWM beat the national average when it comes to both categories of ahead-of-schedule flights in January 2016. Since the data is limited, this could be due to luck or natural variances if the national data is a fair indicator of normal performance from each carrier.