Introduction

Objective

Is there evidence to suggest that some airline carriers make up time once their flight is delayed? An official flight delay is one that departs more than 15 minutes after the scheduled departure time. This analysis compares the difference between scheduled flight time and actual flight time for delayed flights against various predictors.

About the data

The data consist of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. There are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The data source can be found here.

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian. Redshift differs from Amazon’s other hosted database offering, Amazon RDS, by being able to handle analytics workloads on large scale datasets stored by a column-oriented DBMS principle.

An Amazon Redshift cluster is a set of nodes, which consists of a leader node and one or more compute nodes. The type and number of compute nodes that you need depends on the size of your data, the number of queries you will execute, and the query execution performance that you need.

Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance.

Markus Schmidberger wrote a nice guide for uploading the Air On Time data into a Redshift database. For more information see his post on the AWS Big Data Blog titled Connecting R With Amazon Redshift.

dplyr

As well as working with local in-memory data like data frames and data tables, dplyr also works with remote on-disk data stored in databases. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle. The reason you’d want to use dplyr with a database is because either your data is already in a database (and you don’t want to work with static csv files that someone else has dumped out for you), or you have so much data that it does not fit in memory and you have to use a database. Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql), and google’s bigquery. See more information here.

Summary

We will use dplyr to connect to Amazon Redshift. Tools used in this exercise:

  • Amazon Redshift (Data Warehouse)
  • R (Analytic engine)
  • dplyr + DBI (Connectors to Redshift)
  • ggplot2 (Visualization)
  • rmarkdown (Publishing)

Setup

Required Packages

library(dplyr)
library(ggplot2)
library(dygraphs)
library(xts)
library(knitr)
library(DT)
library(gridExtra)
rm(list=ls())
gc()
op <- list(paging = F, info = F, ordering = F, searching = F, bLengthChange = F)

Connect to the database

air <- src_postgres(dbname = 'mydb', 
                    host = 'strata1.cjku7otn8uia.us-west-2.redshift.amazonaws.com', 
                    port = 5439, 
                    user = 'guest', 
                    password = 'ABCd4321')

Connect to tables

airontime3 <- tbl(air, "airontime3")
dim(airontime3)
## [1] 123534969        29
colnames(airontime3)
##  [1] "year"              "month"             "dayofmonth"       
##  [4] "dayofweek"         "deptime"           "crsdeptime"       
##  [7] "arrtime"           "crsarrtime"        "uniquecarrier"    
## [10] "flightnum"         "tailnum"           "actualelapsedtime"
## [13] "crselapsedtime"    "airtime"           "arrdelay"         
## [16] "depdelay"          "origin"            "dest"             
## [19] "distance"          "taxiin"            "taxiout"          
## [22] "cancelled"         "cancellationcode"  "diverted"         
## [25] "carrierdelay"      "weatherdelay"      "nasdelay"         
## [28] "securitydelay"     "lateaircraftdelay"
carriers3 <- tbl(air, "carriers3")
head(carriers3)
##   code                          description
## 1  02Q                        Titan Airways
## 2  07Q                  Flair Airlines Ltd.
## 3  0FQ Maine Aviation Aircraft Charter, LLC
## 4  0JQ                      Vision Airlines
## 5   0Q                  Flying Service N.V.
## 6   2E                  Smokey Bay Air Inc.

Random Sample

There are 123,534,969 records in the database. We desire to extract a 1% sample so that we can explore patterns in the data prior to building a predictive model. The columns we are most intersted in are:

We draw the sample by creating random numbers [0, 1] for all records in the database, then taking any random number that is less than 0.01.

query1 <- airontime3 %>%
  select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
  mutate(diff = arrdelay - depdelay) %>%
  mutate(x = random()) %>%
  collapse() %>%
  filter(x < 0.01)
show_query(query1)
## <SQL>
## SELECT "year", "month", "arrdelay", "depdelay", "distance", "uniquecarrier", "diff", "x"
## FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", "arrdelay" - "depdelay" AS "diff", RANDOM() AS "x"
## FROM "airontime3") AS "zzz1"
## WHERE "x" < 0.01
samp1 <- collect(query1)
dim(samp1)
## [1] 1237077       8

Exploratory Data Analysis

Filter NA’s

apply(is.na(samp1), 2, sum)
##          year         month      arrdelay      depdelay      distance 
##             0             0         25905         23053          2050 
## uniquecarrier          diff             x 
##             0         25905             0
samp2 <- samp1 %>%
  filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance))

Filter Outliers

p1 <- ggplot(samp2, aes(depdelay)) + 
  geom_density(fill="lightblue") + 
  geom_vline(xintercept=c(15, 240), col='tomato') + 
  xlim(-15, 240)

p2 <- ggplot(samp2, aes(depdelay, arrdelay)) + 
  geom_hex() +
  geom_vline(xintercept = c(15, 240), col = 'tomato') +
  geom_hline(yintercept = c(-60, 360), col = 'tomato')

grid.arrange(p1, p2, ncol=2)
samp3 <- samp2 %>%
  filter(depdelay > 15 & depdelay < 240) %>%
  filter(arrdelay > -60 & arrdelay < 360)

Filter Years

samp3_by_year <- samp3 %>%
  group_by(year) %>%
  summarize(diff = mean(diff)) %>%
  mutate(year = as.Date(paste(year, '01-01', sep = '-')))

with(samp3_by_year, as.xts(diff, year)) %>% 
  dygraph(main = "Arrival Delay - Departure Delay") %>%
  dyShading("2003-01-01", "2007-01-01") %>%
  dyRangeSelector()

samp4 <- samp3 %>%
  filter(year >= 2003 & year <= 2007)

Add Carrier Description

samp5 <- samp4 %>% 
  left_join(carriers3, by = c('uniquecarrier' = 'code'), copy = TRUE) %>%
  mutate(description = substr(description, 1, 30))

samp5 %>% 
  group_by(description) %>% 
  summarise(diff = mean(diff)) %>% 
  ggplot(aes(diff,reorder(description,diff))) + 
  geom_point() +
  labs(title = 'ArrDelay - DepDelay', x = 'Minutes', y = '')

Training Data (all one query)

query2 <- airontime3 %>%
  select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
  filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
  filter(depdelay > 15 & depdelay < 240) %>%
  filter(arrdelay > -60 & arrdelay < 360) %>%
  filter(year >= 2003 & year <= 2007) %>%
  mutate(x = random()) %>%
  collapse() %>%
  filter(x < 0.10) %>%
  select(-x) %>%
  left_join(carriers3, by = c("uniquecarrier" = "code")) %>%
  mutate(diff = arrdelay - depdelay) %>%
  rename(carriername = description)

show_query(query2)
## <SQL>
## SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", "code" AS "code", "description" AS "carriername", "arrdelay" - "depdelay" AS "diff"
## FROM (SELECT * FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier"
## FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", RANDOM() AS "x"
## FROM "airontime3"
## WHERE NOT("arrdelay"IS NULL) AND NOT("depdelay"IS NULL) AND NOT("distance"IS NULL) AND "depdelay" > 15.0 AND "depdelay" < 240.0 AND "arrdelay" >  - 60.0 AND "arrdelay" < 360.0 AND "year" >= 2003.0 AND "year" <= 2007.0) AS "zzz2"
## WHERE "x" < 0.1) AS "zzz3"
## 
## LEFT JOIN 
## 
## (SELECT "code", "description"
## FROM "carriers3") AS "zzz4"
## 
## ON ("uniquecarrier" = "code")) AS "zzz5"
train1 <- collect(query2)
dim(train1)
## [1] 586292      9

Linear Model

Model Prep: Create Factors

train2 <- train1 %>%
  mutate(year = factor(year)) %>%
  mutate(month = factor(month)) %>%
  mutate(uniquecarrier = factor(uniquecarrier))

Model

lm1 <- lm(diff ~ distance + depdelay + uniquecarrier, train2)
anova(lm1)
## Analysis of Variance Table
## 
## Response: diff
##                   Df    Sum Sq Mean Sq F value    Pr(>F)    
## distance           1    988769  988769 3115.48 < 2.2e-16 ***
## depdelay           1    341662  341662 1076.53 < 2.2e-16 ***
## uniquecarrier     22   3279128  149051  469.64 < 2.2e-16 ***
## Residuals     586267 186065396     317                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm1)
## 
## Call:
## lm(formula = diff ~ distance + depdelay + uniquecarrier, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -217.13   -9.78   -2.72    5.56  317.57 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.876e-01  2.876e-01   3.434 0.000594 ***
## distance        -3.208e-03  4.735e-05 -67.763  < 2e-16 ***
## depdelay         1.553e-02  5.771e-04  26.905  < 2e-16 ***
## uniquecarrierAA  2.548e+00  2.955e-01   8.623  < 2e-16 ***
## uniquecarrierAQ -3.129e+00  8.082e-01  -3.872 0.000108 ***
## uniquecarrierAS -9.306e-01  3.177e-01  -2.929 0.003404 ** 
## uniquecarrierB6  2.322e+00  3.297e-01   7.043 1.89e-12 ***
## uniquecarrierCO  2.700e+00  3.095e-01   8.722  < 2e-16 ***
## uniquecarrierDH -2.305e+00  3.265e-01  -7.059 1.67e-12 ***
## uniquecarrierDL  2.896e+00  2.984e-01   9.708  < 2e-16 ***
## uniquecarrierEV -1.729e+00  3.013e-01  -5.739 9.53e-09 ***
## uniquecarrierF9  1.284e+00  4.040e-01   3.178 0.001483 ** 
## uniquecarrierFL  1.698e+00  3.116e-01   5.448 5.09e-08 ***
## uniquecarrierHA  2.083e-02  7.105e-01   0.029 0.976615    
## uniquecarrierHP -2.115e+00  3.468e-01  -6.098 1.07e-09 ***
## uniquecarrierMQ  2.199e+00  2.963e-01   7.421 1.16e-13 ***
## uniquecarrierNW  2.938e+00  3.021e-01   9.725  < 2e-16 ***
## uniquecarrierOH -6.219e-01  3.102e-01  -2.005 0.044961 *  
## uniquecarrierOO -4.215e-01  3.000e-01  -1.405 0.160040    
## uniquecarrierTZ  5.803e+00  4.467e-01  12.990  < 2e-16 ***
## uniquecarrierUA  9.514e-01  2.988e-01   3.184 0.001453 ** 
## uniquecarrierUS  8.507e-01  2.997e-01   2.838 0.004539 ** 
## uniquecarrierWN -3.667e+00  2.909e-01 -12.606  < 2e-16 ***
## uniquecarrierXE  2.965e+00  3.027e-01   9.794  < 2e-16 ***
## uniquecarrierYV -2.984e+00  3.285e-01  -9.083  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.81 on 586267 degrees of freedom
## Multiple R-squared:  0.02417,    Adjusted R-squared:  0.02414 
## F-statistic: 605.2 on 24 and 586267 DF,  p-value: < 2.2e-16

Model fit

ind <- sample.int(nrow(lm1$model), 10000)
par(mfrow=c(2,2))
p3 <- qplot(resid(lm1)[ind], fitted(lm1)[ind], geom = 'hex')
p4 <- qplot(resid(lm1)[ind], geom='density')
p5 <-qplot(sample = resid(lm1)[ind], stat = 'qq')
grid.arrange(p3, p4, p5, ncol=3, nrow=1)

Score and summarize by year

Coefficient Lookup Table

coefs <- dummy.coef(lm1)
k <- length(coefs$uniquecarrier)
coefs_lkp <- data.frame(
  uniquecarrier = names(coefs$uniquecarrier),
  carrier_score = coefs$uniquecarrier,
  int_score = rep(coefs$`(Intercept)`, k),
  dist_score = rep(coefs$distance, k),
  row.names=NULL, stringsAsFactors = FALSE
)
datatable(coefs_lkp)

Score and compute RMSE by year

query3 <- airontime3 %>%
  select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
  filter(!is.na(arrdelay)) %>%
  filter(!is.na(depdelay)) %>%
  filter(year >= 2002) %>%
  filter(depdelay > 15) %>%
  filter(depdelay < 240) %>%
  mutate(diff = arrdelay - depdelay) %>%
  left_join(coefs_lkp, copy = TRUE) %>%
  mutate(pred = int_score + carrier_score + dist_score * distance) %>%
  group_by(year) %>%
  summarize(rmse = sqrt(mean((diff - pred)^2))) %>%
  arrange(year)

datatable(collect(query3), op)