Is there evidence to suggest that some airline carriers make up time once their flight is delayed? An official flight delay is one that departs more than 15 minutes after the scheduled departure time. This analysis compares the difference between scheduled flight time and actual flight time for delayed flights against various predictors.
The data consist of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. There are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The data source can be found here.
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian. Redshift differs from Amazon’s other hosted database offering, Amazon RDS, by being able to handle analytics workloads on large scale datasets stored by a column-oriented DBMS principle.
An Amazon Redshift cluster is a set of nodes, which consists of a leader node and one or more compute nodes. The type and number of compute nodes that you need depends on the size of your data, the number of queries you will execute, and the query execution performance that you need.
Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance.
Markus Schmidberger wrote a nice guide for uploading the Air On Time data into a Redshift database. For more information see his post on the AWS Big Data Blog titled Connecting R With Amazon Redshift.
As well as working with local in-memory data like data frames and data tables, dplyr also works with remote on-disk data stored in databases. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle. The reason you’d want to use dplyr with a database is because either your data is already in a database (and you don’t want to work with static csv files that someone else has dumped out for you), or you have so much data that it does not fit in memory and you have to use a database. Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql), and google’s bigquery. See more information here.
We will use dplyr to connect to Amazon Redshift. Tools used in this exercise:
library(dplyr)
library(ggplot2)
library(dygraphs)
library(xts)
library(knitr)
library(DT)
library(gridExtra)
rm(list=ls())
gc()
op <- list(paging = F, info = F, ordering = F, searching = F, bLengthChange = F)
air <- src_postgres(dbname = 'mydb',
host = 'strata1.cjku7otn8uia.us-west-2.redshift.amazonaws.com',
port = 5439,
user = 'guest',
password = 'ABCd4321')
airontime3 <- tbl(air, "airontime3")
dim(airontime3)
## [1] 123534969 29
colnames(airontime3)
## [1] "year" "month" "dayofmonth"
## [4] "dayofweek" "deptime" "crsdeptime"
## [7] "arrtime" "crsarrtime" "uniquecarrier"
## [10] "flightnum" "tailnum" "actualelapsedtime"
## [13] "crselapsedtime" "airtime" "arrdelay"
## [16] "depdelay" "origin" "dest"
## [19] "distance" "taxiin" "taxiout"
## [22] "cancelled" "cancellationcode" "diverted"
## [25] "carrierdelay" "weatherdelay" "nasdelay"
## [28] "securitydelay" "lateaircraftdelay"
carriers3 <- tbl(air, "carriers3")
head(carriers3)
## code description
## 1 02Q Titan Airways
## 2 07Q Flair Airlines Ltd.
## 3 0FQ Maine Aviation Aircraft Charter, LLC
## 4 0JQ Vision Airlines
## 5 0Q Flying Service N.V.
## 6 2E Smokey Bay Air Inc.
There are 123,534,969 records in the database. We desire to extract a 1% sample so that we can explore patterns in the data prior to building a predictive model. The columns we are most intersted in are:
We draw the sample by creating random numbers [0, 1] for all records in the database, then taking any random number that is less than 0.01.
query1 <- airontime3 %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
mutate(diff = arrdelay - depdelay) %>%
mutate(x = random()) %>%
collapse() %>%
filter(x < 0.01)
show_query(query1)
## <SQL>
## SELECT "year", "month", "arrdelay", "depdelay", "distance", "uniquecarrier", "diff", "x"
## FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", "arrdelay" - "depdelay" AS "diff", RANDOM() AS "x"
## FROM "airontime3") AS "zzz1"
## WHERE "x" < 0.01
samp1 <- collect(query1)
dim(samp1)
## [1] 1237077 8
apply(is.na(samp1), 2, sum)
## year month arrdelay depdelay distance
## 0 0 25905 23053 2050
## uniquecarrier diff x
## 0 25905 0
samp2 <- samp1 %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance))
p1 <- ggplot(samp2, aes(depdelay)) +
geom_density(fill="lightblue") +
geom_vline(xintercept=c(15, 240), col='tomato') +
xlim(-15, 240)
p2 <- ggplot(samp2, aes(depdelay, arrdelay)) +
geom_hex() +
geom_vline(xintercept = c(15, 240), col = 'tomato') +
geom_hline(yintercept = c(-60, 360), col = 'tomato')
grid.arrange(p1, p2, ncol=2)
samp3 <- samp2 %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360)
samp3_by_year <- samp3 %>%
group_by(year) %>%
summarize(diff = mean(diff)) %>%
mutate(year = as.Date(paste(year, '01-01', sep = '-')))
with(samp3_by_year, as.xts(diff, year)) %>%
dygraph(main = "Arrival Delay - Departure Delay") %>%
dyShading("2003-01-01", "2007-01-01") %>%
dyRangeSelector()
samp4 <- samp3 %>%
filter(year >= 2003 & year <= 2007)
samp5 <- samp4 %>%
left_join(carriers3, by = c('uniquecarrier' = 'code'), copy = TRUE) %>%
mutate(description = substr(description, 1, 30))
samp5 %>%
group_by(description) %>%
summarise(diff = mean(diff)) %>%
ggplot(aes(diff,reorder(description,diff))) +
geom_point() +
labs(title = 'ArrDelay - DepDelay', x = 'Minutes', y = '')
query2 <- airontime3 %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
filter(!is.na(arrdelay) & !is.na(depdelay) & !is.na(distance)) %>%
filter(depdelay > 15 & depdelay < 240) %>%
filter(arrdelay > -60 & arrdelay < 360) %>%
filter(year >= 2003 & year <= 2007) %>%
mutate(x = random()) %>%
collapse() %>%
filter(x < 0.10) %>%
select(-x) %>%
left_join(carriers3, by = c("uniquecarrier" = "code")) %>%
mutate(diff = arrdelay - depdelay) %>%
rename(carriername = description)
show_query(query2)
## <SQL>
## SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", "code" AS "code", "description" AS "carriername", "arrdelay" - "depdelay" AS "diff"
## FROM (SELECT * FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier"
## FROM (SELECT "year" AS "year", "month" AS "month", "arrdelay" AS "arrdelay", "depdelay" AS "depdelay", "distance" AS "distance", "uniquecarrier" AS "uniquecarrier", RANDOM() AS "x"
## FROM "airontime3"
## WHERE NOT("arrdelay"IS NULL) AND NOT("depdelay"IS NULL) AND NOT("distance"IS NULL) AND "depdelay" > 15.0 AND "depdelay" < 240.0 AND "arrdelay" > - 60.0 AND "arrdelay" < 360.0 AND "year" >= 2003.0 AND "year" <= 2007.0) AS "zzz2"
## WHERE "x" < 0.1) AS "zzz3"
##
## LEFT JOIN
##
## (SELECT "code", "description"
## FROM "carriers3") AS "zzz4"
##
## ON ("uniquecarrier" = "code")) AS "zzz5"
train1 <- collect(query2)
dim(train1)
## [1] 586292 9
train2 <- train1 %>%
mutate(year = factor(year)) %>%
mutate(month = factor(month)) %>%
mutate(uniquecarrier = factor(uniquecarrier))
lm1 <- lm(diff ~ distance + depdelay + uniquecarrier, train2)
anova(lm1)
## Analysis of Variance Table
##
## Response: diff
## Df Sum Sq Mean Sq F value Pr(>F)
## distance 1 988769 988769 3115.48 < 2.2e-16 ***
## depdelay 1 341662 341662 1076.53 < 2.2e-16 ***
## uniquecarrier 22 3279128 149051 469.64 < 2.2e-16 ***
## Residuals 586267 186065396 317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm1)
##
## Call:
## lm(formula = diff ~ distance + depdelay + uniquecarrier, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -217.13 -9.78 -2.72 5.56 317.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.876e-01 2.876e-01 3.434 0.000594 ***
## distance -3.208e-03 4.735e-05 -67.763 < 2e-16 ***
## depdelay 1.553e-02 5.771e-04 26.905 < 2e-16 ***
## uniquecarrierAA 2.548e+00 2.955e-01 8.623 < 2e-16 ***
## uniquecarrierAQ -3.129e+00 8.082e-01 -3.872 0.000108 ***
## uniquecarrierAS -9.306e-01 3.177e-01 -2.929 0.003404 **
## uniquecarrierB6 2.322e+00 3.297e-01 7.043 1.89e-12 ***
## uniquecarrierCO 2.700e+00 3.095e-01 8.722 < 2e-16 ***
## uniquecarrierDH -2.305e+00 3.265e-01 -7.059 1.67e-12 ***
## uniquecarrierDL 2.896e+00 2.984e-01 9.708 < 2e-16 ***
## uniquecarrierEV -1.729e+00 3.013e-01 -5.739 9.53e-09 ***
## uniquecarrierF9 1.284e+00 4.040e-01 3.178 0.001483 **
## uniquecarrierFL 1.698e+00 3.116e-01 5.448 5.09e-08 ***
## uniquecarrierHA 2.083e-02 7.105e-01 0.029 0.976615
## uniquecarrierHP -2.115e+00 3.468e-01 -6.098 1.07e-09 ***
## uniquecarrierMQ 2.199e+00 2.963e-01 7.421 1.16e-13 ***
## uniquecarrierNW 2.938e+00 3.021e-01 9.725 < 2e-16 ***
## uniquecarrierOH -6.219e-01 3.102e-01 -2.005 0.044961 *
## uniquecarrierOO -4.215e-01 3.000e-01 -1.405 0.160040
## uniquecarrierTZ 5.803e+00 4.467e-01 12.990 < 2e-16 ***
## uniquecarrierUA 9.514e-01 2.988e-01 3.184 0.001453 **
## uniquecarrierUS 8.507e-01 2.997e-01 2.838 0.004539 **
## uniquecarrierWN -3.667e+00 2.909e-01 -12.606 < 2e-16 ***
## uniquecarrierXE 2.965e+00 3.027e-01 9.794 < 2e-16 ***
## uniquecarrierYV -2.984e+00 3.285e-01 -9.083 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.81 on 586267 degrees of freedom
## Multiple R-squared: 0.02417, Adjusted R-squared: 0.02414
## F-statistic: 605.2 on 24 and 586267 DF, p-value: < 2.2e-16
ind <- sample.int(nrow(lm1$model), 10000)
par(mfrow=c(2,2))
p3 <- qplot(resid(lm1)[ind], fitted(lm1)[ind], geom = 'hex')
p4 <- qplot(resid(lm1)[ind], geom='density')
p5 <-qplot(sample = resid(lm1)[ind], stat = 'qq')
grid.arrange(p3, p4, p5, ncol=3, nrow=1)
coefs <- dummy.coef(lm1)
k <- length(coefs$uniquecarrier)
coefs_lkp <- data.frame(
uniquecarrier = names(coefs$uniquecarrier),
carrier_score = coefs$uniquecarrier,
int_score = rep(coefs$`(Intercept)`, k),
dist_score = rep(coefs$distance, k),
row.names=NULL, stringsAsFactors = FALSE
)
datatable(coefs_lkp)
query3 <- airontime3 %>%
select(year, month, arrdelay, depdelay, distance, uniquecarrier) %>%
filter(!is.na(arrdelay)) %>%
filter(!is.na(depdelay)) %>%
filter(year >= 2002) %>%
filter(depdelay > 15) %>%
filter(depdelay < 240) %>%
mutate(diff = arrdelay - depdelay) %>%
left_join(coefs_lkp, copy = TRUE) %>%
mutate(pred = int_score + carrier_score + dist_score * distance) %>%
group_by(year) %>%
summarize(rmse = sqrt(mean((diff - pred)^2))) %>%
arrange(year)
datatable(collect(query3), op)