By Michael Shepherd
Oct. 9, 2016
This is an analysis of data from January 2016 collected by the U.S. Department of Transportation. It contains time data on nearly 446,000 scheduled U.S. flights.
My analysis looks to see how Maine’s largest airport, the Portland International Jetport (PWM), compares to other large U.S. airports when it comes to its rate of ahead-of-schedule flights.
It stands to reason that because of Maine’s harsh January weather, flights could be slower. This analysis will test that hypothesis, looking at aggregate numbers and the performance of the five carriers that serve PWM: American Airlines (AA), Jetblue(B6), Delta (DL), ExpressJet (EV) and Southwest Airlines (WN).
The three questions I’m looking to answer are:
First, I loaded the data packages used in our class’ last lesson and loaded the full flight dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggvis)
library(tidyr)
library(knitr)
AllFlights <- read.csv("http://media.usm.maine.edu/~suleiman/mba676/domestic_flights_jan_2016.csv", header = TRUE, stringsAsFactors = FALSE)
A quick look at the data shows that flights that have “NA” time data have either been cancelled or diverted. I’ll filter those out before running calculations.
Before that, I need to transform the category that defines the dates of flights to a month-day-year format. It will figure into other calculations later on.
AllFlights$FlightDate <- as.Date(AllFlights$FlightDate, format = "%m/%d/%Y")
Next, I cut those cancelled and diverted flights from the dataset and did a quick check on the data to see the latest time that a flight left — midnight. Since my analysis only focuses on the arrival time of flights leaving Portland, this doesn’t pose an issue.
AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>% summarize(maxcrsdep = max(CRSDepTime), maxdep = max(DepTime))
## maxcrsdep maxdep
## 1 2359 2400
Next, I created the new variables from the class notes that allow us to manipulate more time data. I don’t need all of them since my analysis is focused on the arrival side, but I made them anyway.
AllFlights <- AllFlights %>%
mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
AllFlights$new_CRSDepTime <- as.POSIXct(AllFlights$new_CRSDepTime, format="%Y-%m-%d %H%M")
AllFlights <- AllFlights %>%
mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
AllFlights$new_CRSArrTime <- as.POSIXct(AllFlights$new_CRSArrTime, format="%Y-%m-%d %H%M")
AllFlights <- AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>%
mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)), new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)),
new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
AllFlights$new_WheelsOff <- as.POSIXct(AllFlights$new_WheelsOff, format="%Y-%m-%d %H%M")
AllFlights$new_WheelsOn <- as.POSIXct(AllFlights$new_WheelsOn, format="%Y-%m-%d %H%M")
AllFlights$new_DepTime <- as.POSIXct(AllFlights$new_DepTime, format="%Y-%m-%d %H%M")
AllFlights$new_ArrTime <- as.POSIXct(AllFlights$new_ArrTime, format="%Y-%m-%d %H%M")
Now, I start to make the unique variables that I need to calculate the differences between Portland flights’ estimated and actual arrival times, starting by simply subtracting the estimated time from the actual time.
AllFlights <- AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>%
mutate(ArrAhead = as.integer(difftime(new_CRSArrTime, new_ArrTime, units = "mins")))
But that creates a problem with negative numbers for flights that are delayed. So, I create three more variables.
The first one gets rid of the negative values for those flights and only counts the difference for the on-time ones. The second is a dummy variable marked with a “1” if flights are ahead of schedule by 10 minutes; the third is a dummy variable for flights ahead of schedule at all.
AllFlights <- AllFlights %>% filter(Cancelled == 0, Diverted == 0) %>%
mutate(ArrAheadReal = ifelse(ArrAhead < 0, 0, ArrAhead),
ArrAhead10 = ifelse(ArrAhead >= 10, 1, 0),
ArrAheadDummy = ifelse(ArrAhead > 0, 1, 0))
Then, I finally create the subset of data isolating PWM flights.
PWMFlightsOut <- AllFlights %>% filter(Origin == "PWM")
Calculations from that newly created dataset show that nearly 72 percent of flights from Portland arrived ahead of schedule.
PWMFlightsOut %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
## perc_ArrAheadDummy
## 1 0.715493
Four of the five carriers that served Portland were above 70 percent, but American Airlines lagged, at less than 54 percent.
Q1A <- PWMFlightsOut %>% group_by(Carrier) %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
## Adding missing grouping variables: `Carrier`
kable(Q1A)
Carrier | perc_ArrAheadDummy |
---|---|
AA | 0.5357143 |
B6 | 0.7349398 |
DL | 0.7777778 |
EV | 0.7058824 |
WN | 0.7656250 |
Q1A %>% ggvis(~Carrier, ~perc_ArrAheadDummy) %>% layer_bars()
PWMFlightsOut %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
## perc_ArrAhead10
## 1 0.5549296
Breaking that down by carrier, American Airlines is again the laggard, with only 25 percent of flights arriving 10 minutes or more ahead of schedule. The others are over our average.
Q2A <- PWMFlightsOut %>% group_by(Carrier) %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
## Adding missing grouping variables: `Carrier`
kable(Q2A)
Carrier | perc_ArrAhead10 |
---|---|
AA | 0.2500000 |
B6 | 0.5662651 |
DL | 0.6296296 |
EV | 0.5686275 |
WN | 0.6093750 |
Q2A %>% ggvis(~Carrier, ~perc_ArrAhead10) %>% layer_bars()
AllFlights %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
## perc_ArrAheadDummy
## 1 0.6581406
However, PWM beats the national average more dramatically when it comes to flights that are 10 minutes or more early, 55 percent to 42 percent.
AllFlights %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
## perc_ArrAhead10
## 1 0.4207289
For all national ahead-of-schedule flights, we see that American Airlines did 10 percentage points better nationally that it did at PWM. The others do worse, with Jetblue nearly 20 points behind. That helps explain the narrow gap between the PWM and national averages.
Q3B <- AllFlights %>% group_by(Carrier) %>% select(ArrAheadDummy) %>% summarize(perc_ArrAheadDummy = sum(ArrAheadDummy) / n())
## Adding missing grouping variables: `Carrier`
kable(Q3B)
Carrier | perc_ArrAheadDummy |
---|---|
AA | 0.6380059 |
AS | 0.7002711 |
B6 | 0.5436132 |
DL | 0.7089688 |
EV | 0.6747676 |
F9 | 0.7410943 |
HA | 0.6634769 |
NK | 0.4888620 |
OO | 0.6034969 |
UA | 0.7337173 |
VX | 0.5845151 |
WN | 0.6627029 |
On flights that arrive more than 10 minutes early, we see that ExpressJet is the only PWM carrier beating the national average, although American Airlines does far better than it does at PWM.
Q3C <- AllFlights %>% group_by(Carrier) %>% select(ArrAhead10) %>% summarize(perc_ArrAhead10 = sum(ArrAhead10) / n())
## Adding missing grouping variables: `Carrier`
kable(Q3C)
Carrier | perc_ArrAhead10 |
---|---|
AA | 0.4078861 |
AS | 0.4659675 |
B6 | 0.3574698 |
DL | 0.4795117 |
EV | 0.4468206 |
F9 | 0.5565688 |
HA | 0.1990431 |
NK | 0.2793364 |
OO | 0.3769299 |
UA | 0.5553963 |
VX | 0.3553085 |
WN | 0.3789609 |
The analysis confirms that PWM beat the national average when it comes to both categories of ahead-of-schedule flights in January 2016. Since the data is limited, this could be due to luck or natural variances if the national data is a fair indicator of normal performance from each carrier.