Is there evidence to suggest that some airline carriers gain time once their flight is delayed? An official flight delay is one that departs more than 15 minutes after the scheduled departure time. This analysis compares the time gain (i.e the difference between scheduled flight time and actual flight time) for delayed flights against various predictors. A simplistic linear model suggests that longer travel distances are associated to higher gains and that a few specific carriers are associated with higher gains.
This analysis depends on data stored in a database, specifically Amazon’s cloud-based Redshift data warehouse. Furthermore this script relies on dplyr to connect to the data warehouse and to translate commands into PostgreSQL. For more details on Amazon Redshift or dplyr, please refer to the appendix at the bottom of this page.
The Air On Time data consist of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. There are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The data source can be found here.
Markus Schmidberger wrote a guide for uploading the Air On Time data into a Redshift database. For more information see his post on the AWS Big Data Blog titled Connecting R With Amazon Redshift.
library(dplyr)
library(ggplot2)
library(dygraphs)
library(xts)
library(DT)
library(gridExtra)
(air <- src_postgres(dbname = Sys.getenv('dbname'),
host = Sys.getenv('host'),
port = Sys.getenv('port'),
user = Sys.getenv('user'),
password = Sys.getenv('password')))
## src: postgres 8.0.2 [redshift_user@sol-eng.cjku7otn8uia.us-west-2.redshift.amazonaws.com:5439/airontime]
## tbls: carriers, flights
flights <- tbl(air, "flights")
colnames(flights)
## [1] "year" "month" "dayofmonth"
## [4] "dayofweek" "deptime" "crsdeptime"
## [7] "arrtime" "crsarrtime" "uniquecarrier"
## [10] "flightnum" "tailnum" "actualelapsedtime"
## [13] "crselapsedtime" "airtime" "arrdelay"
## [16] "depdelay" "origin" "dest"
## [19] "distance" "taxiin" "taxiout"
## [22] "cancelled" "cancellationcode" "diverted"
## [25] "carrierdelay" "weatherdelay" "nasdelay"
## [28] "securitydelay" "lateaircraftdelay"
carriers <- tbl(air, "carriers")
glimpse(carriers)
## Observations: 1,491
## Variables: 2
## $ code (chr) "04Q", "06Q", "09Q", "0CQ", "0GQ", "0J", "0KQ", "0...
## $ description (chr) "Tradewind Aviation", "Master Top Linhas Aereas Lt...
Before building the model, we need to get familiar with the data. We will conduct a brief exploratory data analysis on a small subset of the data.
We can reduce our data by extracting a subset of columns and rows. There are 123,534,969 records in the database. We desire to extract a 1% sample so that we can explore patterns in the data prior to building a predictive model. We pull only the columns that we believe are useful in the analysis.
We draw the sample by creating random numbers for all records in the database, then taking any random number that is less than 1%.
We will also create our target variable, “gain”, which is the difference between departure delay and arrival delay. A positive gain means that the duration of the plane flight was shorter than expected. Conversely, a negative gain means that the duration of the plane flight was longer than expected.
query1 <- flights %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
mutate(gain = depdelay - arrdelay) %>%
mutate(x = random()) %>%
collapse() %>%
filter(x < params$query1_frac)
samp1 <- collect(query1)
show_query(query1)
## <SQL>
## SELECT "year", "month", "arrdelay", "depdelay", "distance", "uniquecarrier", "gain", "x"
## FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", "depdelay" - "arrdelay" AS "gain", RANDOM() AS "x"
## FROM "flights") AS "zzz1"
## WHERE "x" < 0.01
The sample data has 1,237,144 rows and 8 columns. These sample data will be used to conduct an exploratory data analysis.
NA’s (i.e. missing values) are a constant nuisance in analyses. We will throw out any record containing an NA.
apply(is.na(samp1), 2, sum)
## year month arrdelay depdelay distance
## 0 0 26132 23226 2003
## uniquecarrier gain x
## 0 26132 0
samp2 <- samp1 %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance))
An official delay is one that is at least 15 minutes. We will filter all records with delays under 15 minutes. Both departure and arrival delays are heavily skewed with outliers at both extremes. We will arbitrarily limit our analysis to departure delays between +15 minutes and +4 hours, and arrival delays between -1 hour and +6 hours.
p1 <- ggplot(samp2, aes(depdelay)) +
geom_density(fill = "lightblue") +
geom_vline(xintercept = c(15, 240), col = 'tomato') +
xlim(-15, 240)
p2 <- ggplot(samp2, aes(depdelay, arrdelay)) +
geom_hex() +
geom_vline(xintercept = c(15, 240), col = 'tomato') +
geom_hline(yintercept = c(-60, 360), col = 'tomato')
grid.arrange(p1, p2, ncol=2)
samp3 <- samp2 %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360)
We don’t need to examine every year in the data to build a good model. In fact, more data can often be worse. Gains changed drastically after 9/11. We will pull data only from 2003 to 2007. We will leave 2008 as our forecast (i.e. test) year.
samp3_by_year <- samp3 %>%
group_by(year) %>%
summarize(gain = mean(gain)) %>%
mutate(year = as.Date(paste(year, '01-01', sep = '-')))
with(samp3_by_year, as.xts(gain, year)) %>%
dygraph(main = "Gain") %>%
dyShading("2003-01-01", "2007-01-01") %>%
dyRangeSelector() %>%
dyEvent(date = "2001-09-11", "9/11/2001", labelLoc = "bottom")
samp4 <- samp3 %>%
filter(year >= 2003 & year <= 2007)
The carrier table maps carrier codes to carrier descriptions. For example, the carrier code, “NW” maps to “Southwest Airlines”. This table will be useful for interpreting our data.
Even without building a model we can see which airlines have the largest average gains in our sampled data. Notice that we are extract the carrier table using the “copy = TRUE” option in the “left_join” function. Small tables can easily be extracted from or copied to the database using this feature.
samp5 <- samp4 %>%
left_join(carriers, by = c('uniquecarrier' = 'code'), copy = TRUE) %>%
mutate(description = substr(description, 1, 25))
samp5 %>%
group_by(description) %>%
summarise(gain = mean(gain)) %>%
ggplot(aes(gain, reorder(description, gain))) +
geom_point() +
labs(title = 'Gain', x = 'Minutes', y = '')
We will build a simplistic linear model to explain and predict time gains. We will predict gain as a function of flight distance, departure delay, and carrier. We are interested to know if certain carriers show positive gains.
Using the filtering criteria derived in the exploratory data analysis, we pull a new random sample. We split this sample into two equal sizes for training and validating the model.
query2 <- flights %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360) %>%
filter(year >= 2003 & year <= 2007) %>%
mutate(x = random()) %>%
collapse() %>%
filter(x < params$query2_frac) %>%
left_join(carriers, by = c("uniquecarrier" = "code")) %>%
mutate(gain = depdelay - arrdelay) %>%
mutate(data = if(x < params$query2_frac / 2) 'train' else 'valid') %>%
select(-x)
samp2 <- collect(query2)
samp2 <- mutate(samp2, uniquecarrier = factor(uniquecarrier))
train1 <- filter(samp2, data == 'train')
valid1 <- filter(samp2, data == 'valid')
show_query(query2)
## <SQL>
## SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", "code" AS "code", "description" AS "description", "depdelay" - "arrdelay" AS "gain", CASE WHEN "x" < 0.1 / 2.0 THEN 'train' ELSE 'valid' END AS "data"
## FROM (SELECT * FROM (SELECT "year", "month", "arrdelay", "depdelay", "distance", "uniquecarrier", "x"
## FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", RANDOM() AS "x"
## FROM "flights"
## WHERE NOT("arrdelay"IS NULL) AND NOT("depdelay"IS NULL) AND NOT("distance"IS NULL) AND "depdelay" > 15.0 AND "depdelay" < 240.0 AND "arrdelay" > - 60.0 AND "arrdelay" < 360.0 AND "year" >= 2003.0 AND "year" <= 2007.0) AS "zzz2"
## WHERE "x" < 0.1) AS "zzz3"
##
## LEFT JOIN
##
## (SELECT "code", "description"
## FROM "carriers") AS "zzz4"
##
## ON ("uniquecarrier" = "code")) AS "zzz5"
The training data has 292,808 rows and 10 columns. The validation data has 292,574 rows and 10 columns.
We build a simplistic model to explain gain. Distance and departure delays are slopes (i.e. quantitative) and carrier is an intercept (i.e. qualitative). Much more sophisticated models with more complex structures would outperform this model.
lm1 <- lm(gain ~ distance + depdelay + uniquecarrier, train1)
The F-value in the analysis of variance tells us that distance is highly significant. And that every additional mile in flight results in a slight time savings. Longer flights have a greater opportunity to make up time. The ANOVA also tells us that departure delay and carrier are also significant, but not as impactful as distance.
anova(lm1)
## Analysis of Variance Table
##
## Response: gain
## Df Sum Sq Mean Sq F value Pr(>F)
## distance 1 512825 512825 1620.65 < 2.2e-16 ***
## depdelay 1 184812 184812 584.05 < 2.2e-16 ***
## uniquecarrier 22 1617430 73520 232.34 < 2.2e-16 ***
## Residuals 292783 92645948 316
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The table of coefficients tell us which carriers are most significant and whether their effects are positive or negative. Notice that not all carriers have significant effects. However, some carriers significantly outperform others. For example, WN (Southwest) and AS (Alaska) make up time, whereas DL (Delta) and AA (American) do the opposite.
data.frame(round(summary(lm1)$coefficients, 4)) %>%
add_rownames() %>%
mutate(code = substring(rowname, 14)) %>%
left_join(carriers, copy = TRUE) %>%
select(rowname, description, Estimate, Std..Error, t.value, Pr...t..) %>%
datatable()
## Joining by: "code"
A quick examination of model residuals and fit statistics reveals that the model is under specified. Residuals are bell shaped, but not normally distributed (i.e. skewed). The r-squared value of 0.0243791 is evidence that this model only explains a small percentage of the overall variation we see in the data.
train1 <- mutate(train1, resid = resid(lm1), pred = predict(lm1))
ind <- sample.int(nrow(lm1$model), 10000)
p3 <- qplot(resid, pred, data = train1[ind,], geom = 'hex', main = 'Fitted vs Residuals')
p5 <- qplot(sample = resid, data = train1[ind,], stat = 'qq', main = 'Residuals QQ Plot')
grid.arrange(p3, p5, ncol=2)
Fortunately, the model performance for the validation data is similar to the performance for the training data. We did not regularize the model, but it seems to predict fairly well against the validation data. The RMSE is similar, and the performance by predicted decile is similar.
p6 <- data.frame(
data = c('train', 'valid'),
rmse = sqrt(c(mean(train1$resid^2), mean((valid1$gain - predict(lm1, valid1))^2)))
) %>%
ggplot(aes(data, rmse, fill=data)) +
geom_bar(stat='identity') +
labs(title='RMSE', x = '', y = 'Minutes')
p7 <- samp2 %>%
mutate(pred = predict(lm1, samp2)) %>%
mutate(decile = ntile(pred, 10)) %>%
group_by(data, decile) %>%
summarize(gain = mean(gain)) %>%
collect() %>%
ggplot(aes(factor(decile), gain, fill = data)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = 'Average gain by predicted decile', x = 'Decile', y = 'Minutes')
grid.arrange(p6, p7, ncol = 2)
We now wish to assess our predictions on future data. We want to use our carrier model to make predictions on “out of time” data. We will use the model scores to score all of the 2008 data in the database and then select the top performers. We will then summarize the top performers by carrier.
We will use dplyr syntax to create SQL code using the follow this process.
Some data manipulation is required to translate the model scores into a look up table. This lookup table will be copied to the database using the “left_join” function and the “copy” option.
coefs <- dummy.coef(lm1)
k <- length(coefs$uniquecarrier)
coefs_lkp <- data.frame(
uniquecarrier = names(coefs$uniquecarrier),
carrier_score = coefs$uniquecarrier,
int_score = rep(coefs$`(Intercept)`, k),
dist_score = rep(coefs$distance, k),
delay_score = rep(coefs$depdelay, k),
row.names = NULL,
stringsAsFactors = FALSE
)
head(coefs_lkp)
## uniquecarrier carrier_score int_score dist_score delay_score
## 1 9E 0.000000 -1.207566 0.003286263 -0.01646424
## 2 AA -2.330609 -1.207566 0.003286263 -0.01646424
## 3 AQ 2.287510 -1.207566 0.003286263 -0.01646424
## 4 AS 1.073973 -1.207566 0.003286263 -0.01646424
## 5 B6 -2.118141 -1.207566 0.003286263 -0.01646424
## 6 CO -2.515113 -1.207566 0.003286263 -0.01646424
Here we see the average gains by carrier for our forecasted time period (2008). Carriers Aloha and Southwest were acurately forecasted to have the highest gains. Continental was erroneously forecasted to have negative gains. Most of the prediction errors were within a few minutes. In general, the time lost or gained by the carriers is not large.
samp3 <- flights %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360) %>%
filter(year == 2008) %>%
mutate(gain = depdelay - arrdelay) %>%
left_join(coefs_lkp, copy = TRUE) %>%
left_join(carriers, by = c('uniquecarrier' = 'code')) %>%
mutate(pred = int_score + carrier_score + dist_score * distance + delay_score * depdelay) %>%
mutate(decile = ntile(pred, 10L)) %>% # calculate predicted deciles
#filter(decile == 10L) %>% # filter by top decile
group_by(description) %>%
summarize(gain = mean(1.0 * gain), pred = mean(pred)) %>%
collect()
ggplot(samp3, aes(gain, pred)) +
geom_point(alpha = 0.75, color = 'red', shape = 3) +
geom_abline(intercept = 0, slope = 1, alpha = 0.15, color = 'blue') +
geom_text(aes(label = substr(description, 1, 20)), size = 3, alpha = 0.75, vjust = -1) +
labs(title='Average Gains Forecast', x = 'Actual', y = 'Predicted')
The objective of our analysis was to investigate whether certain carriers gained time after a departure delay. We built a simplistic linear model against a subset of the data and then validated the model against a hold out group. Finally, we assessed the model against an out of time sample within the database.
We found that distance was the most significant predictor of gained time. The longer the flight, the more time gained. We also found that carrier effects were significant, but less so than distance. Certain carriers had significant positive effects, which is evidence that some carriers gain time after a departure delay.
This model included only a few predictors and used a simple form. More sophisticated models using more predictors (e.g. adjusting for weather) might lead to more conclusive results. However, given the weak fit of this model the level of opportunity is questionable.
Benjamin Montet analyzed a similar data set with fewer carriers and fewer flights in this FiveThirtyEight.com blog post. He found that carriers do indeed make up small gains on some delayed flights, but did not find any significant differences between select carriers. In his blog post on Slate.com, Bryan Lowder makes the point that airlines tend to make up only a few minutes due to flight rules and fuel economics. Finally, Ritchie King and Nate Silver visualize the data in a general flight analysis. They identify the fastest routes and carriers, but without respect to flight delays.
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian. Redshift differs from Amazon’s other hosted database offering, Amazon RDS, by being able to handle analytics workloads on large scale datasets stored by a column-oriented DBMS principle.
An Amazon Redshift cluster is a set of nodes, which consists of a leader node and one or more compute nodes. The type and number of compute nodes that you need depends on the size of your data, the number of queries you will execute, and the query execution performance that you need.
Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance. See more information here.
As well as working with local in-memory data like data frames and data tables, dplyr also works with remote on-disk data stored in databases. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle. The reason you’d want to use dplyr with a database is because either your data is already in a database (and you don’t want to work with static CSV files that someone else has dumped out for you), or you have so much data that it does not fit in memory and you have to use a database. Currently dplyr supports the three most popular open source databases (SQLite, MySQL and PostgreSQL), and Google’s BigQuery. See more information here.