© Copyright 2021. All Rights Reserved.

Executive Summary

We seek to develop a useful Shiny Web app to compare both delays and total elapsed times with different airlines and to estimate likely delay at the destination, given only the delay at the origin and the scheduled elapsed time for nonstop U.S. flights. Length of queues for both taxi-in and taxi-out will impact delays. Moreover, there are other factors such as weather. Rather than study the causal relationships for such delays as in (Clewlow, Simaiakis, and Balakrishnan 2010), we focus on grouping data using suitable variables to create a good predictive model.

(Clewlow, Simaiakis, and Balakrishnan 2010) emphasize the importance of considering the lengths of taxi-in and taxi-out queues. Surely, such matters are considered by airlines and airports. Moreover, it is straightforward that for network optimization, to increase or decrease departures, there are good reasons for a corresponding change in arrivals. Furthermore, there are airline hubs to minimize network inefficiencies, which add additional restraints. There are also other important considerations such as consumer demand.

While it is important to study causal relationships, it is also useful to create good predictive models. Why is a predictive model needed? Because we do not expect simply adding the known delay at the origin to the scheduled arrival time will be a good model. The reason is the scheduled arrival time has padding to allow the airlines to achieve better performance. So our goal is simply to produce a better model than just using the delay and the scheduled elapsed time. We propose the following model:

\[\text{delay}_{arrival} = \text{delay}_{departure}+\overline{\text{airtime}}+\overline{\text{taxi-in}}+\overline{\text{taxi-out}}-\text{elapsed-time}\]

where

The scheduled elapsed time is typically fairly constant, but does change. For example the time of departure might change slightly but sufficiently to affect taxi queues, or perhaps a larger aircraft is used on certain days to allow for increased demand, or the airline “bakes in” additional time to improve on-time performance, which is known as schedule padding. Indeed, the delay at arrival is a biased measure of performance since the airlines can modify the elapsed time to improve their measure of on-time performance. A better measure for the passenger would be the average elapsed time. If the elapsed time is not constant, the user is provided an option to modify this time within the observed range. We avoid using scheduled departure and arrival times to simplify the interface and implementation.

The trick is choosing the variables and importantly how the grouping is done in order to compute estimates to form a good predictive model. We expect our model will benefit from “error cancellation” due to the fact that we are using multiple variables.

We suspect that the taxi-out time will largely depend on the time of departure, which is supported by (Clewlow, Simaiakis, and Balakrishnan 2010). Similarly, we expect the taxi-in time will depend on the time of arrival; however, we expect a good estimation is possible using only the departure time.

In particular, we also consider that the arrival time will have some dependency on the departure time. We guess, however, that changing the flight time is not done easily as there are many considerations such as fuel costs, weather, and so on. For example, is it possible that decreasing the flight time is more significant when the departure delay is within a certain range so that there is an expectation to reduce or even eliminate the arrival delay, and less significant otherwise. So we will take departure delay into consideration when computing averages.

We know there are significant variances in the variables, especially in the airtimes. Obviously, there are limitations on the accuracy of such models. Notwithstanding, we expect our model will be more accurate than simply using the departure delay and the scheduled elapsed time.

Download Data

We look only recent data as there are changes in the industry over time and so prior data may not be comparable to recent data. Unfortunately, the data was largely collected during the pandemic. Hence, the app should be updated during the next year approximately monthly since data is collected on a monthly basis. Subsequently, it would probably be sufficient to update quarterly or biannually.

Data was downloaded from the Bureau of Transportation Statistics website. In particular, suitable fields were selected for each month for the year 2020 and the first two months of 2021, which is the most recent data available. Performance data was downloaded using the link

https://www.bts.gov/

Data is not all contained in one table. Additional tables that are useful include the following lookup tables:

Unzip Downloaded Files

unzip_fn <- function(year,month,filename) {
  folder <- paste0("~/R/Flights/Data/",year,"/",month)
  setwd(folder)
  name <- paste0(filename,".zip")
  if (file.exists(name)) {
     print(paste0("Unzipping ",name,"..."))
     unzip(name)
     file.remove(name)
  }
}
 
filename <- "874451450_T_ONTIME_REPORTING"
year <- "2020"
for(month in 1:12) {
  unzip_fn(year=year,month=month,filename=filename)
}

year = "2021"
for(month in 1:2) {
  unzip_fn(year=year,month=month,filename=filename)
}

Read CSV Files

read_csv_fn <- function(year,month,filename) {
  folder <- paste0("~/R/Flights/Data/",year,"/",month)
  setwd(folder)
  name <- paste0(filename,".csv")
  if (file.exists(name)) {
     print(paste0("Reading ",name,"..."))
     dt <- read.csv(name)
     file.remove(name)
  }
  dt
}
 
filename <- "874451450_T_ONTIME_REPORTING"  
dtm <- list()
m <- 0L
year <- "2020"

for(month in 1:12) {
  m <- m + 1L
  dtm[[m]] <- read_csv_fn(year=year,month=month,filename=filename)
}

year = "2021"
for(month in 1:2) {
  m <- m + 1L
  dtm[[m]] <- read_csv_fn(year=year,month=month,filename=filename)
}

Save List of Data Frames

save(dtm, file="dtm.rda")

Merge Airport Names to Get Codes

folder <- "~/R/Flights/Data/"
setwd(folder)

name_airport_id <- "L_AIRPORT_ID.csv"
if (file.exists(name_airport_id)) {
  print(paste0("Reading ",name_airport_id,"..."))
  dtf_airport_id <- read.csv(name_airport_id)
}

name_airport <- "L_AIRPORT.csv"
if (file.exists(name_airport)) {
  print(paste0("Reading ",name_airport,"..."))
  dtf_airport <- read.csv(name_airport)
}

dtf_airport_code <- merge(dtf_airport,dtf_airport_id,by="Description")

dtf_airport_code$Description <- dtf_airport_code$Code.y
dtf_airport_code$Code.y <- NULL
colnames(dtf_airport_code) <- c("ID","code")
dtf_airport_code <- dtf_airport_code[ order( dtf_airport_code[1]) ,]

setwd("C:/Users/BiqueS/Documents/R/Flights")

library(dplyr)
dtf_airport_code <- dtf_airport_code  %>% distinct(ID, .keep_all= TRUE)
rownames(dtf_airport_code) <- 1:nrow(dtf_airport_code)

save(dtf_airport_code, file="dtf_airport_code.rda")

Form Data.Frame with Chosen Variables

It makes sense to compute statistics for each > origin and destination pair, > airline, and > time frame of departure.

For example, length of queues for both taxi-in and taxi-out will impact delays. We expect such delays will largely depending on the time of departure. We use to time frames as shown in Figure 4 of (Deshpande2012 and Arıkan 2012, 435).

get_dfa <- function(i,j) {
  
  setwd("~/R/Flights")
  source("get_index.R")
  load("dtm.rda")
  dtm <- dtm[i:j]
  load("dtf_airport_code.rda")
  

  len <- 0L
  for(li in 1:length(dtm)) {
    len <- len + nrow(dtm[[li]])
  }
  
  index <- vector(mode = "integer",length = len)
  code_o <- vector(mode = "character",length = len)
  city_o <- vector(mode = "character",length = len)
  code_d <- vector(mode = "character",length = len)
  city_d <- vector(mode = "character",length = len)
  airline <- vector(mode = "character",length = len)
  delay_o <- vector(mode = "integer",length = len)
  delay_d <- vector(mode = "integer",length = len)
  air_time <- vector(mode = "integer",length = len)
  taxi_o <- vector(mode = "integer",length = len)
  taxi_d <- vector(mode = "integer",length = len)
  schedule <- vector(mode = "integer",length = len)
  
  n <- 0L  
  for(df in dtm) {
    for(obs in 1:nrow(df)) {
      dep_delay <- df$DEP_DELAY[obs]
      
      if( !is.na(dep_delay) & !is.na(df$AIR_TIME[obs]) & !is.na(df$ARR_DELAY[obs]) ) {
        n <- n + 1L
        index[n]=get_index(df$DEP_TIME[obs])
        code_o[n]=dtf_airport_code$code[ which(dtf_airport_code$ID == df$ORIGIN_AIRPORT_ID[obs], arr.ind = TRUE) ]
        city_o[n]=df$ORIGIN_CITY_NAME[obs]
        code_d[n]=dtf_airport_code$code[ which(dtf_airport_code$ID == df$DEST_AIRPORT_ID[obs], arr.ind = TRUE) ]
        city_d[n]=df$DEST_CITY_NAME[obs]
        airline[n]=df$OP_UNIQUE_CARRIER[obs]
        delay_o[n]=as.integer(dep_delay)
        delay_d[n]=as.integer(df$ARR_DELAY[obs])
        air_time[n]=as.integer(df$AIR_TIME[obs])
        taxi_o[n]=as.integer(df$TAXI_OUT[obs])
        taxi_d[n]=as.integer(df$TAXI_IN[obs])
        schedule[n]=as.integer(df$CRS_ELAPSED_TIME[obs])
      }
    }
  }
  
  data.frame( index=index[1:n],    # time index 
              code_o=code_o[1:n],   # code of origin
              city_o=city_o[1:n],   # city and state abbreviation of origin
              code_d=code_d[1:n],   # code of destination
              city_d=city_d[1:n],   # city and state abbreviation of destination
              airline=airline[1:n],  # code of airline
              delay_o=delay_o[1:n],  # delay at origin until pushing out from gate
              delay_d=delay_d[1:n],  # delay at destination
              air_time=air_time[1:n], # time above ground
              taxi_o=taxi_o[1:n],   # taxi out
              taxi_d=taxi_d[1:n],   # taxi in
              schedule=schedule[1:n], # scheduled time
        row.names = 1:n )
}

load("dtm.rda")

# Use all data except for most recent month
len <- length(dtm) - 1

library(doParallel)  
no_cores <- detectCores() - 1 
cl <- makeCluster(no_cores)  
registerDoParallel(cl) 

chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)

# library(varhandle)
# varhandle::rm.all.but(keep = c("cl","chunk","first","last","no_cores"))

result <- c()
result <- foreach(i=1:no_cores) %dopar% get_dfa(first[i],last[i])

dfa <-  data.frame(matrix(ncol = 12, nrow = 0))
colnames(dfa) = c("index",    # time index 
                  "code_o",   # code of origin
                  "city_o",   # city and state abbreviation of origin
                  "code_d",   # code of destination
                  "city_d",   # city and state abbreviation of destination
                  "airline",  # code of airline
                  "delay_o",  # delay at origin until pushing out from gate
                  "delay_d",  # delay at destination
                  "air_time", # time above ground
                  "taxi_o",   # taxi out
                  "taxi_d",   # taxi in
                  "schedule") # scheduled time

for(i in 1:no_cores) {
    dfa <- rbind(dfa, result[[i]])
}

dfa <- dfa[ order(dfa[2], dfa[4], dfa[6], dfa[1]) ,]

# Split Data.Frame Using Factors

split.by <- c('code_o','code_d','airline','index')

df1 <- split(dfa, dfa[, split.by], drop=TRUE)

keys1 <- names(df1)

# Find data for last month

dfb <- get_dfa(length(dtm),length(dtm))

dfb <- dfb[ order(dfb[2], dfb[4], dfb[6], dfb[1]) ,]

# Split Data.Frame Using Factors

dfl <- split(dfb, dfb[, split.by], drop=TRUE)

# Merge data from prior months, provided there is data for last month

library(fastmatch)

keys <- names(dfl)

for( i in 1:length(dfl) ) {
  j <- which( keys1 %fin% keys[i], arr.ind = TRUE)
  if( length(j) ) {
    dfl[[i]] = rbind(df1[[j]],dfl[[i]])
  }
}
save(dfl, file = "dfl.rda")

Remove outliers

setwd("~/R/Flights")
load("dfl.rda")

remove_outliers <- function(i,j,s) {
  
  setwd("~/R/Flights")
  load("dfl.rda")
  current_dfl <- dfl[i:j]

  sum <- 0L
  dfl <- list()
  for( n in 1:length(current_dfl) ) {
      df <- current_dfl[[n]]
      rows <- nrow(df)
      if( rows >= 10 ) {
        Q    <- quantile(df$delay_d, probs=c(.25, .75), na.rm = TRUE)
        siqr <- IQR(df$delay_d, na.rm = TRUE)*s
        lb   <- Q[1] - siqr # Lower Range
        ub   <- Q[2] + siqr # Upper Range  
        df   <- df[ df$delay_d >= lb & df$delay_d <= ub ,]
        sum  <- rows - nrow(df) + sum
      }
      dfl[[n]] = df
  } 
  rs <- list()
  rs[[1]] = dfl
  rs[[2]] = sum
  rs
}

len <- length(dfl)

library(doParallel)  
no_cores <- detectCores() - 1 
cl <- makeCluster(no_cores)  
registerDoParallel(cl) 

chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)

x <- NULL
y <- NULL

delta = 0.015
s <- 1.5
for( loop in 1:100 ) {
  
  result <- c()
  result <- foreach(i=1:no_cores) %dopar% remove_outliers(first[i],last[i],s)
  
  curr_dfl <-  list()
  sum <- 0L
  for(i in 1:no_cores) {
    curr_dfl <- c(curr_dfl, result[[i]][[1]])
    sum <- sum + result[[i]][[2]]
  }
  
  ave_removed <- sum/length(curr_dfl)
  print(paste0("average removed: ",ave_removed))
  
  x <- c(x,s)
  y <- c(y,ave_removed)
  
  s <- s + delta
}

plot(y ~ x)
save(x,y, file="xy-outliers.rda")

obs <- which( y == min( y[y >= 2.5] ) )

result <- c()
result <- foreach(i=1:no_cores) %dopar% remove_outliers(first[i],last[i],x[obs])

curr_dfl <-  list()
sum <- 0L
for(i in 1:no_cores) {
  curr_dfl <- c(curr_dfl, result[[i]][[1]])
  sum <- sum + result[[i]][[2]]
}

ave_removed <- sum/length(curr_dfl)
print(paste0("average removed: ",ave_removed))

# [1] "average removed: 2.56098839189791"

for( i in 1:length(curr_dfl) ) {
   curr_dfl[[i]] <- na.omit(curr_dfl[[i]])
}  
save(curr_dfl, file="curr_dfl.rda")  

Partition into Test and Validate Lists

setwd("~/R/Flights")
load("curr_dfl.rda")

partition <- function() {
  
  setwd("~/R/Flights")
  load("curr_dfl.rda")

  set.seed(1960961)

  dfl <- list()
  valid_dfl <- list()
  k <- 0L
  for( n in 1:length(curr_dfl) ) {
      df <- curr_dfl[[n]]
      rows = nrow(df) 
      if( rows >= 30 ) {
        size <- floor(rows * 0.05)
        df <- df[ order( df[8] ) ,]
        indices  <- sort(sample(1:rows, size=size))
        remaining <- setdiff(1:rows, indices)
        df <- df[ remaining ,]
        k <- k + 1L
        valid_dfl[[k]] <- df[ indices ,]
      }
      dfl[[n]] = df
  } 
  rs <- list()
  rs[[1]] = dfl
  rs[[2]] = valid_dfl
  rs
}

result <- partition()

test_dfl  <- result[[1]]
valid_dfl <-  result[[2]]

ns <- NULL
for(i in 1:length(test_dfl)) {
  ns[i] = paste0(test_dfl[[i]]$code_o[1],".",test_dfl[[i]]$code_d[1],".",test_dfl[[i]]$airline[1],".",test_dfl[[i]]$index[1]) 
}
names(test_dfl) <- ns

ns <- NULL
for(i in 1:length(valid_dfl)) {
  ns[i] = paste0(valid_dfl[[i]]$code_o[1],".",valid_dfl[[i]]$code_d[1],".",valid_dfl[[i]]$airline[1],".",valid_dfl[[i]]$index[1]) 
}
names(valid_dfl) <- ns

save(test_dfl, file="test_dfl.rda")  
save(valid_dfl, file="valid_dfl.rda")  

Define Suitable Functions for Web App GUI

setwd("~/R/Flights")
load("test_dfl.rda")

pair <- NULL
airline <- NULL
frame <- NULL
for( i in 1:length(test_dfl) ) {
   s <- strsplit(names(test_dfl)[[i]],".",fixed=TRUE)
   pair[i] <- paste(s[[1]][1],s[[1]][2],sep = "-")
   airline[i] <- s[[1]][3]
   frame[i] <- s[[1]][4]
}
ndf <- data.frame(pair=pair,airline=airline,frame=frame)
ndf <- ndf[ order( ndf[1], ndf[2], ndf[3] ), ]
                  
city_pairs <- unique(ndf$pair)

# Given a city pair, provide a list of available airlines.
carriers <- NULL
for( i in 1:length(city_pairs)) {
  carriers[[i]] <- unique(ndf[ ndf$pair == city_pairs[i] ,]$airline)
}

# Fetch airline name in case user is not familiar with code
dfc <- read.csv("~/R/Flights/Data/L_UNIQUE_CARRIERS.csv")
dfc[!complete.cases(dfc),]
dfc <- na.omit(dfc) 

# Append descriptions to codes
for( i in 1:length(carriers) ) {
  for( j in 1:length(carriers[[i]])) {
    carriers[[i]][j] = paste(carriers[[i]][j],dfc$Description[ dfc$Code == carriers[[i]][j] ])
  }
}

save(carriers, file="carriers.rda")
save(city_pairs, file="city_pairs.rda")

Compare Departure Delay and Flight Time

By running various tests to study the relationship between air time and delays, e.g., by picking a random index using

set.seed(7410253) index <- sample(1:length(test_dfl),1)

and then plotting points, it is somewhat surprising that the departure delay does not seem to be well correlated to flight time. A delay of about 15 minutes is widely considered acceptable. It is reasonable to expect many factors affect travel time including queues and headwinds or tailwinds. While there does not appear to be a simple linear dependency on the departure delay, the behavior of the air times seems to depend on the magnitude of the delay. We are forced to recognize the limitations of our prediction model.

(Astaraky 2012) explains how to calculate the least squares estimation of regression lines that pass through the means of data:

  slope = cor(y,x)*sd(y)/sd(x)
  intercept = mean(y) - mean(x)*slope

where cor denotes the correlation and sd the standard deviation. To plot the points and the regression line we may use:

  plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Flight Time vs Delay at Origin", xlab = "Delay (min)", ylab = "Flight Time (minutes)")
  
  abline(coef = c(intercept,slope),col="gold")

Based on our work below, we partition the delays into four groups:

  delay <= -3 
  delay >  -3  &  delay <= 9
  delay >   9  &  delay <= 45
  delay >  45
  

and then choose (say) the “average air time” as determined by the regression line through the known points for the specified delay.

setwd("~/R/Flights")
load("test_dfl.rda")

# Compute MSE for partition of two
airtime_MSE <- function(s) {
  library(fastmatch)
  
  setwd("~/R/Flights")
  load("test_dfl.rda")
  
  mse_fun <- function(n,y,z) {
    residuals <- y - z 
    sum(residuals^2)/n
  }
  
  mse <- 0
  for(index in 1:length(test_dfl) ) { 

    n <- nrow(test_dfl[[index]])
    if( n > 1 ) {
      
      y <- test_dfl[[index]]$air_time  # dependent variable
      
      if( sd(y) > 0 ) {
        
        x <- test_dfl[[index]]$delay_o  # independent variable
        z <- vector(mode = "integer",length = n)
        
        if( sd(x) == 0 ) {
          for(i in 1:n) z[i] = mean(y)
        } else {
          I <- which( x <= s ) 
          remaining <- setdiff(1:n, I)
          
          constant <- rep(FALSE,n)
          
          if(length(I)==1) {
            constant[I] = TRUE
          } else if( length(I) ) {
            yi <- y[I]
            sdyi = sd(yi)
            if ( is.na(sdyi) ) {
              constant[I] = TRUE
            } else if( sdyi == 0) {
              constant[I] = TRUE
            } else {
              xi <- x[I]
              sdxi <- sd(xi)
              if( is.na(sdxi) ) {
                m_i <- 0
                y_i <- mean(yi)
              } else if( sdxi == 0 ) {
                m_i <- 0
                y_i <- mean(yi)
              } else {
                m_i <- cor(yi,xi)*sdyi/sdxi
                y_i <- mean(yi) - mean(xi)*m_i
              }
            }
          }

          if( length(remaining)==1 ) {
            constant[remaining] = TRUE
          } else if( length(remaining) ) {
            yr <- y[remaining]
            sdyr = sd(yr)
            if ( is.na(sdyr) ) {
              constant[remaining] = TRUE
            } else if ( sdyr == 0 ) {
              constant[remaining] = TRUE
            } else {
              xr <- x[remaining]
              sdxr <- sd(xr)
              if( is.na(sdxr) ) {
                m_r <- 0
                y_r <- mean(yr)
              } else if( sdxr == 0 ) {
                m_r <- 0
                y_r <- mean(yr)
              } else {
                m_r <- cor(yr,xr, use = "pairwise.complete.obs")*sdyr/sdxr
                y_r <- mean(yr) - mean(xr)*m_r
              }
            }
          }
          
          for( i in 1:n ) {
            if( constant[i] ) {
              z[i] = y[i]
            } else if( i %fin% I) {
              z[i] = x[i]*m_i + y_i
            } else {
              z[i] = x[i]*m_r + y_r
            }
          } 
          
        }
        
        mse <- mse_fun(n,y,z) + mse
      }
    }
  }
  mse
}

split <- c(10,13,15,17,20,25,30)
split <- c(2,3,5,6,7,8,9)
split <- c(-5,-4,-3,-2,-1,0,1)

# Results for airtime_MSE
# -5 1832269
# -4 1829355
# -3 1828241 **
# -2 1832354
# -1 1836346
#  0 1837128
#  1 1836505 *
#  2 1837126
#  3 1838657
#  5 1841121
#  6 1841846
#  7 1841874
#  8 1843245
#  9 1843812
# 10 1845601
# 13 1849871
# 15 1853149
# 17 1854660
# 20 1857350
# 25 1861798
# 30 1865797

# Compute MSE for partition of three
airtime_MSE_3 <- function(s) {
  library(fastmatch)
  
  setwd("~/R/Flights")
  load("test_dfl.rda")
  
  mse_fun <- function(n,y,z) {
    residuals <- y - z 
    sum(residuals^2)/n
  }
  
  mse <- 0
  for(index in 1:length(test_dfl)) {
    
    n <- nrow(test_dfl[[index]])
    if( n > 1 ) {
      
      y <- test_dfl[[index]]$air_time  # dependent variable
      
      if( sd(y) > 0 ) {
        
        x <- test_dfl[[index]]$delay_o  # independent variable
        z <- vector(mode = "integer",length = n)
        
        if( sd(x) == 0 ) {
          for(i in 1:n) z[i] = mean(y)
        } else {
          i1 <- which( x <= -3 ) 
          remaining <- setdiff(1:n, i1)
          
          constant <- rep(FALSE,n)
          
          if(length(i1)==1) {
            constant[i1] = TRUE
          } else if( length(i1) ) {
            yi <- y[i1]
            sdyi = sd(yi)
            if ( is.na(sdyi) ) {
              constant[i1] = TRUE
            } else if( sdyi == 0) {
              constant[i1] = TRUE
            } else {
              xi <- x[i1]
              sdxi <- sd(xi)
              if( is.na(sdxi) ) {
                m_i <- 0
                y_i <- mean(yi)
              } else if( sdxi == 0 ) {
                m_i <- 0
                y_i <- mean(yi)
              } else {
                m_i <- cor(yi,xi)*sdyi/sdxi
                y_i <- mean(yi) - mean(xi)*m_i
              }
            }
          }

          i2 <- NULL
          i3 <- NULL

          if( length(remaining)==1 ) {
            constant[remaining] = TRUE
          } else if( length(remaining) ) {
            
            i2 <- which( (x > -3) & (x <= s) ) 
            i3 <- setdiff(remaining, i2)
            
            ym <- y[i2]
            sdym = sd(ym)
            if ( is.na(sdym) ) {
              constant[i2] = TRUE
            } else if ( sdym == 0 ) {
              constant[i2] = TRUE
            } else {
              xm <- x[i2]
              sdxm <- sd(xm)
              if( is.na(sdxm) ) {
                m_m <- 0
                y_m <- mean(ym)
              } else if( sdxm == 0 ) {
                m_m <- 0
                y_m <- mean(ym)
              } else {
                m_m <- cor(ym,xm)*sdym/sdxm
                y_m <- mean(ym) - mean(xm)*m_m
              }
            }
          }

          if( length(i3)==1 ) {
            constant[i3] = TRUE
          } else if( length(i3) ) {
            
            yr <- y[i3]
            sdyr = sd(yr)
            if ( is.na(sdyr) ) {
              constant[i3] = TRUE
            } else if ( sdyr == 0 ) {
              constant[i3] = TRUE
            } else {
              xr <- x[i3]
              sdxr <- sd(xr)
              if( is.na(sdxr) ) {
                m_r <- 0
                y_r <- mean(yr)
              } else if( sdxr == 0 ) {
                m_r <- 0
                y_r <- mean(yr)
              } else {
                m_r <- cor(yr,xr)*sdyr/sdxr
                y_r <- mean(yr) - mean(xr)*m_r
              }
            }
          }
          
                    
          for( i in 1:n ) {
            if( constant[i] ) {
              z[i] = y[i]
            } else if( i %fin% i1) {
              z[i] = x[i]*m_i + y_i
            } else if( i %fin% i2) {
              z[i] = x[i]*m_m + y_m
            } else {
              z[i] = x[i]*m_r + y_r
            }
          } 
          
        }
        
        mse <- mse_fun(n,y,z) + mse
      }
    }
  }
  mse
}

split <- c(0,1,2,3,4,5,6)
split <- c(7,8,9,10,11,12,13)
split <- c(14,15,17,20,30,40,50)
#  0 1761049
#  1 1757005
#  2 1754582
#  3 1753424
#  4 1752308
#  5 1750663
#  6 1749384
#  7 1747244 *
#  8 1747603
#  9 1746866 **
# 10 1747005
# 14 1747717
# 15 1748537
# 17 1749051
# 20 1749932
# 30 1753722
# 40 1754047
# 50 1758126

# Compute MSE for partition of four
airtime_MSE_4 <- function(s) {
  library(fastmatch)
  
  setwd("~/R/Flights")
  load("test_dfl.rda")
  
  mse_fun <- function(n,y,z) {
    residuals <- y - z 
    sum(residuals^2)/n
  }
  
  mse <- 0
  for(index in 1:length(test_dfl)) {
    
    n <- nrow(test_dfl[[index]])
    if( n > 1 ) {
      
      y <- test_dfl[[index]]$air_time  # dependent variable
      
      if( sd(y) > 0 ) {
        
        x <- test_dfl[[index]]$delay_o  # independent variable
        z <- vector(mode = "integer",length = n)
        
        if( sd(x) == 0 ) {
          for(i in 1:n) z[i] = mean(y)
        } else {
          i1 <- which( x <= -3 ) 
          remaining <- setdiff(1:n, i1)
          
          constant <- rep(FALSE,n)
          
          if(length(i1)==1) {
            constant[i1] = TRUE
          } else if( length(i1) ) {
            yi <- y[i1]
            sdyi = sd(yi)
            if ( is.na(sdyi) ) {
              constant[i1] = TRUE
            } else if( sdyi == 0) {
              constant[i1] = TRUE
            } else {
              xi <- x[i1]
              sdxi <- sd(xi)
              if( is.na(sdxi) ) {
                m_i <- 0
                y_i <- mean(yi)
              } else if( sdxi == 0 ) {
                m_i <- 0
                y_i <- mean(yi)
              } else {
                m_i <- cor(yi,xi)*sdyi/sdxi
                y_i <- mean(yi) - mean(xi)*m_i
              }
            }
          }

          i2 <- NULL
          
          if( length(remaining)==1 ) {
            constant[remaining] = TRUE
          } else if( length(remaining) ) {
            
            i2 <- which( (x > -3) & (x <= 9) ) 
            remaining <- setdiff(remaining, i2)
            
            ym <- y[i2]
            sdym = sd(ym)
            if ( is.na(sdym) ) {
              constant[i2] = TRUE
            } else if ( sdym == 0 ) {
              constant[i2] = TRUE
            } else {
              xm <- x[i2]
              sdxm <- sd(xm)
              if( is.na(sdxm) ) {
                m_m <- 0
                y_m <- mean(ym)
              } else if( sdxm == 0 ) {
                m_m <- 0
                y_m <- mean(ym)
              } else {
                m_m <- cor(ym,xm)*sdym/sdxm
                y_m <- mean(ym) - mean(xm)*m_m
              }
            }
          }

          i3 <- NULL
          i4 <- NULL
          
          if( length(remaining)==1 ) {
            constant[remaining] = TRUE
          } else if( length(remaining) ) {
            
            i3 <- which( (x > 9) & (x <= s) ) 
            i4 <- setdiff(remaining, i3)
            
            yn <- y[i3]
            sdyn = sd(yn)
            if ( is.na(sdyn) ) {
              constant[i3] = TRUE
            } else if ( sdyn == 0 ) {
              constant[i3] = TRUE
            } else {
              xn <- x[i3]
              sdxn <- sd(xn)
              if( is.na(sdxn) ) {
                m_n <- 0
                y_n <- mean(yn)
              } else if( sdxn == 0 ) {
                m_n <- 0
                y_n <- mean(yn)
              } else {
                m_n <- cor(yn,xn)*sdyn/sdxn
                y_n <- mean(yn) - mean(xn)*m_n
              }
            }
          }

          if( length(i4)==1 ) {
            constant[i4] = TRUE
          } else if( length(i4) ) {
            
            yr <- y[i4]
            sdyr = sd(yr)
            if ( is.na(sdyr) ) {
              constant[i4] = TRUE
            } else if ( sdyr == 0 ) {
              constant[i4] = TRUE
            } else {
              xr <- x[i4]
              sdxr <- sd(xr)
              if( is.na(sdxr) ) {
                m_r <- 0
                y_r <- mean(yr)
              } else if( sdxr == 0 ) {
                m_r <- 0
                y_r <- mean(yr)
              } else {
                m_r <- cor(yr,xr)*sdyr/sdxr
                y_r <- mean(yr) - mean(xr)*m_r
              }
            }
          }
          
                    
          for( i in 1:n ) {
            if( constant[i] ) {
              z[i] = y[i]
            } else if( i %fin% i1) {
              z[i] = x[i]*m_i + y_i
            } else if( i %fin% i2) {
              z[i] = x[i]*m_m + y_m
            } else if( i %fin% i3) {
              z[i] = x[i]*m_n + y_n
            } else {
              z[i] = x[i]*m_r + y_r
            }
          } 
          
        }
        
        mse <- mse_fun(n,y,z) + mse
      }
    }
  }
  mse
}

split <- c(15,16,17,18,19,20,21)
split <- c(22,23,24,25,26,27,28)
split <- c(29,30,31,32,33,34,35)
split <- c(36,37,38,39,40,41,42)
split <- c(43,44,45,46,47,48,49)

# 15 1711090
# 16 1707864
# 17 1705538
# 18 1703373
# 19 1701842
# 20 1699819
# 21 1698438
# 22 1697610
# 23 1696122
# 24 1695470
# 25 1694692
# 26 1694029
# 27 1693694
# 28 1693407 *
# 29 1693529
# 30 1691840
# 31 1691322
# 32 1690059
# 33 1689852
# 34 1688384 *
# 35 1688408
# 36 1687753 *
# 37 1688155
# 38 1687681
# 39 1687097
# 40 1686973 *
# 41 1687046
# 42 1687295
# 43 1687130
# 44 1686404
# 45 1686116 **
# 46 1686341
# 47 1686705
# 48 1686996
# 49 1687258

library(doParallel)  
no_cores <- detectCores() - 1 
cl <- makeCluster(no_cores)  
registerDoParallel(cl) 

result <- c()
#result <- foreach(i=1:no_cores) %dopar% airtime_MSE(split[i])
#result <- foreach(i=1:no_cores) %dopar% airtime_MSE_3(split[i])
result <- foreach(i=1:no_cores) %dopar% airtime_MSE_4(split[i])

mse <- list()
for(i in 1:no_cores) {
    mse[i] <- result[[i]]
}
mse

Define functions to estimate air times given delay

airtimes <- function(i,j) {
  setwd("~/R/Flights")
  # load("test_dfl.rda")
  # test_dfl <- test_dfl[i:j]
  load("curr_dfl.rda")
  test_dfl <- curr_dfl[i:j]
  
  result <- list()
  zero <- rep(0,4)
  
  mat <- matrix (
    nrow =4,                   # determined by delay
    ncol =2,                   # 1=slope, 2=intercept
    byrow = TRUE,
    dimnames = list(1:4, c("slope","intercept")))
  
  for(index in 1:length(test_dfl)) {
     n <- nrow(test_dfl[[index]])
    
    if( n == 1 ) {
      mat[,1] = zero
      mat[,2] = rep(test_dfl[[index]]$air_time[1],4)
    } else {
      
      y <- test_dfl[[index]]$air_time
      
      if( sd(y) == 0 ) {
        mat[,1] = zero
        mat[,2] = rep(y[1],4)
      } else {
        x <- test_dfl[[index]]$delay_o
        
        if( sd(x) == 0 ) {
          mat[,1] = zero
          mat[,2] = rep(mean(y),4)
        } else {
          
          i1 <- which( x <= -3 )
          remaining <- setdiff(1:n, i1)
          
          if(length(i1)==0) {
            mat[1,] = c(0,mean(y))
          } else if(length(i1)==1) {
            mat[1,] = c(0,y[i1])
          } else {
            yi <- y[i1]
            sdyi = sd(yi)
            if ( sdyi == 0) {
              mat[1,] = c(0,yi[1])
            } else {
              xi <- x[i1]
              sdxi <- sd(xi)
              if( sdxi == 0 ) {
                mat[1,] = c(0,mean(yi))
              } else {
                slope <- cor(yi,xi)*sdyi/sdxi
                mat[1,] = c(slope, mean(yi) - mean(xi)*slope)
              }
            }
          }
          
          if( length(remaining)==0 ) {
            mat[2,] = c(0,mean(y))
            mat[3,] = mat[2,]
            mat[4,] = mat[2,]
          } else if( length(remaining)==1 ) {
            mat[2,] = c(0,y[remaining])
            mat[3,] = mat[2,]
            mat[4,] = mat[2,]
          } else {
            
            i2 <- which( (x > -3) & (x <= 9) )
            remaining <- setdiff(remaining, i2)
            
            if(length(i2)==0) {
              mat[2,] = mat[1,]
            } else if(length(i2)==1) {
              mat[2,] = c(0,y[i2])
            } else {
              
              ym <- y[i2]
              sdym = sd(ym)
              if ( sdym == 0 ) {
                mat[2,] = c(0,ym[1])
              } else {
                xm <- x[i2]
                sdxm <- sd(xm)
                if( sdxm == 0 ) {
                  mat[2,] = c(0, mean(ym))
                } else {
                  slope = cor(ym,xm)*sdym/sdxm
                  mat[2,] = c(slope, mean(ym) - mean(xm)*slope)
                }
              }
            }
            
            if( length(remaining)==0 ) {
              mat[3,] = mat[2,]
              mat[4,] = mat[2,]
            } else if( length(remaining)==1 ) {
              mat[3,] = c(0,y[remaining])
              mat[4,] = mat[3,]
            } else {
              
              i3 <- which( (x > 9) & (x <= 45) )
              i4 <- setdiff(remaining, i3)
              
              if(length(i3)==0) {
                mat[3,] =  mat[2,]
              } else if(length(i3)==1) {
                mat[3,] = c(0,y[i3])
              } else {
                yn <- y[i3]
                sdyn = sd(yn)
                if ( sdyn == 0 ) {
                  mat[3,] = c(0,yn[1])
                } else {
                  xn <- x[i3]
                  sdxn <- sd(xn)
                  if( sdxn == 0 ) {
                    mat[3,] = c(0, mean(yn))
                  } else {
                    slope = cor(yn,xn)*sdyn/sdxn
                    mat[3,] = c(slope, mean(yn) - mean(xn)*slope)
                  }
                }
              }
              
              if(length(i4)==0) {
                mat[4,] =  mat[3,]
              } else if( length(i4)==1 ) {
                mat[4,] = c(0,y[i4])
              } else {
                
                yr <- y[i4]
                sdyr = sd(yr)
                if ( sdyr == 0 ) {
                  mat[4,] = c(0,yr[1])
                } else {
                  xr <- x[i4]
                  sdxr <- sd(xr)
                  if( sdxr == 0 ) {
                    mat[4,] = c(0, mean(yr))
                  } else {
                    slope <- cor(yr,xr)*sdyr/sdxr
                    mat[4,] = c(slope, mean(yr) - mean(xr)*slope)
                  }
                }
              }
              
            }
          }
        }
      }
    }
    result[[index]] <- mat
  }
  result
}

len <- length(test_dfl)

library(doParallel)  
no_cores <- detectCores() - 1 
cl <- makeCluster(no_cores)  
registerDoParallel(cl) 

chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)

result <- c()
result <- foreach(i=1:no_cores) %dopar% airtimes(first[i],last[i])
  
airtimes <-  list()
for(i in 1:no_cores) {
    airtimes <- c(airtimes, result[[i]])
}

save(airtimes,file="airtimes.rda")

Compare Departure Delay and Arrival Delay

As expected, the arrival delay seems to be linearly dependent on departure delay.

load("test_dfl.rda")

index <- sample(1:length(test_dfl),1)

x <- test_dfl[[index]]$delay_o  # independent variable
y <- test_dfl[[index]]$delay_d  # dependent variable

obs <- which( x == min(x) )[1]

xx = x - x[obs]
yy = y - y[obs]
require(stats)
model <- # lm(xx ~ yy)  
         lm(x ~ y)


slope = cor(y,x)*sd(y)/sd(x)

plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Delay at Destination vs Delay at Origin", xlab = "Delay at Origin (min)", ylab = "Delay at Destination (minutes)")

abline(coefficients(model),col="gold")
abline(c(mean(y) - mean(x)*slope, slope ),col="red")
abline(c(mean(y), 0 ),col="purple")

Define functions to estimate arrival delay given departure delay, which we call “typical delay” for given departure delay

delays <- function(i,j) {
  setwd("~/R/Flights")
  # load("test_dfl.rda")
  # test_dfl <- test_dfl[i:j]
  load("curr_dfl.rda")
  test_dfl <- curr_dfl[i:j]
  
  result <- list()
  
  fn <- vector(mode = "numeric", length = 2)
  
  for(index in 1:length(test_dfl)) {
    
    n <- nrow(test_dfl[[index]])
    
    if( n == 1 ) {
      fn = c(0, test_dfl[[index]]$delay_d[1])
    } else {
      
      x <- test_dfl[[index]]$delay_o  # independent variable
      y <- test_dfl[[index]]$delay_d  # dependent variable
      
      sdy = sd(y)
      if( sdy == 0 ) {
        fn = c(0, y[1])  
      } else {
        
        sdx = sd(x)
        if( sdx == 0 ) {
          fn = c(0, mean(y))
        } else {
          slope <- cor(y,x)*sdy/sdx
          if( slope <= 0 ) {
            fn = c(0, mean(y))
          } else {
            fn = c(slope, mean(y) - mean(x)*slope )
          }
        }
      }
    }
    result[[index]] <- fn
  }
  result
}

len <- length(test_dfl)

library(doParallel)  
no_cores <- detectCores() - 1 
cl <- makeCluster(no_cores)  
registerDoParallel(cl) 

chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)

result <- c()
result <- foreach(i=1:no_cores) %dopar% delays(first[i],last[i])
  
delays <-  list()
for(i in 1:no_cores) {
    delays <- c(delays, result[[i]])
}

save(delays,file="delays.rda")

Compare Departure Delay and Taxi Out

As expected, the taxi out time seems to be linearly dependent on departure delay, although taxi times appear roughly constant over differences in delay at origin. The reason is that we expect queues depend on departure time, which depends on delay at origin. Leaving early may have a higher cost in terms of taxi out. Similarly, leaving late may have a beneficial effect in terms of taxi out.

load("test_dfl.rda")

index <- sample(1:length(test_dfl),1)

x <- test_dfl[[index]]$delay_o  # independent variable
y <- test_dfl[[index]]$taxi_o   # dependent variable

require(stats)
model <- lm(y ~ x)

slope = cor(y,x)*sd(y)/sd(x)

plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Taxi Out vs Delay at Origin", xlab = "Delay at Origin (min)", ylab = "Taxi Out (minutes)")

abline(coefficients(model),col="gold",lwd = 2 )
abline(c(mean(y) - mean(x)*slope, slope ),col="red",)
abline(c(mean(y), 0 ),col="purple")

Define Functions to Estimate Taxi Out Given Departure Delay

taxi <- function(i,j) {
  setwd("~/R/Flights")
  # load("test_dfl.rda")
  # test_dfl <- test_dfl[i:j]
  load("curr_dfl.rda")
  test_dfl <- curr_dfl[i:j]
  
  result <- list()
  
  fn <- vector(mode = "numeric", length = 3)
  
  for(index in 1:length(test_dfl)) {
    
    n <- nrow(test_dfl[[index]])
    
    if( n == 1 ) {
      v = test_dfl[[index]]$taxi_o[1]
      fn = c(0, v, v)
    } else {
      
      x <- test_dfl[[index]]$delay_o  # independent variable
      y <- test_dfl[[index]]$taxi_o   # dependent variable
      
      sdy = sd(y)
      if( sdy == 0 ) {
        fn = c(0, y[1], y[1])  
      } else {
        
        best = min(y)
        sdx = sd(x)
        if( sdx == 0 ) {
          fn = c(0, mean(y), best)
        } else {
          slope <- cor(y,x)*sdy/sdx
          fn = c(slope, mean(y) - mean(x)*slope, best)
        }
      }
    }
    result[[index]] <- fn
  }
  result
}

len <- length(test_dfl)

library(doParallel)  
no_cores <- detectCores() - 1 
cl <- makeCluster(no_cores)  
registerDoParallel(cl) 

chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)

result <- c()
result <- foreach(i=1:no_cores) %dopar% taxi(first[i],last[i])
  
taxi_out <-  list()
for(i in 1:no_cores) {
    taxi_out <- c(taxi_out, result[[i]])
}

save(taxi_out,file="taxi_out.rda")

Compare Departure Delay and Taxi In

As expected, the taxi in time seems to depend on departure delay, although taxi times appear roughly constant over differences in delay at origin. The reason is that we expect queues depend on arrival time, which depends on arrival delay, and this in turn depends on delay at origin.

index <- sample(1:length(test_dfl),1)

x <- test_dfl[[index]]$delay_o  # independent variable
y <- test_dfl[[index]]$taxi_d   # dependent variable

require(stats)
model <- lm(y ~ x)

slope = cor(y,x)*sd(y)/sd(x)

plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Taxi In vs Delay at Origin", xlab = "Delay at Origin (min)", ylab = "Taxi In (minutes)")

abline(coefficients(model),col="gold",lwd = 2 )
abline(c(mean(y) - mean(x)*slope, slope ),col="red",)
abline(c(mean(y), 0 ),col="purple")

Define Functions to Estimate Taxi In Given Departure Delay

taxi <- function(i,j) {
  setwd("~/R/Flights")
  # load("test_dfl.rda")
  # test_dfl <- test_dfl[i:j]
  load("curr_dfl.rda")
  test_dfl <- curr_dfl[i:j]
  
  result <- list()
  
  fn <- vector(mode = "numeric", length = 3)
  
  for(index in 1:length(test_dfl)) {
    
    n <- nrow(test_dfl[[index]])
    
    if( n == 1 ) {
      v = test_dfl[[index]]$taxi_d[1]
      fn = c(0, v, v)
    } else {
      
      x <- test_dfl[[index]]$delay_o
      y <- test_dfl[[index]]$taxi_d 
      
      sdy = sd(y)
      if( sdy == 0 ) {
        fn = c(0, y[1], y[1])  
      } else {
        
        best = min(y)
        sdx = sd(x)
        if( sdx == 0 ) {
          fn = c(0, mean(y), best)
        } else {
          slope <- cor(y,x)*sdy/sdx
          fn = c(slope, mean(y) - mean(x)*slope, best)
        }
      }
    }
    result[[index]] = fn
  }
  result
}

#len <- length(test_dfl)
len <- length(curr_dfl)

library(doParallel)  
no_cores <- detectCores() - 1 
cl <- makeCluster(no_cores)  
registerDoParallel(cl) 

chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)

result <- c()
result <- foreach(i=1:no_cores) %dopar% taxi(first[i],last[i])
  
taxi_in <-  list()
for(i in 1:no_cores) {
    taxi_in <- c(taxi_in, result[[i]])
}

save(taxi_in,file="taxi_in.rda")

Test using Test Data

We compute the average relative error for each of the following estimates:

We find our model objectively is slightly best

load("test_dfl.rda")
load("airtimes.rda")
load("taxi_in.rda")
load("taxi_out.rda")
load("delays.rda")

rel_err_model <- 0
rel_err_typical <- 0
rel_err_departure <- 0
N <- 0L
for(index in 1:length(test_dfl)) {
    df <- test_dfl[[index]]
    for(obs in 1:nrow(df)) {
      N <- N + 1L
      
      rel_err_departure = abs(df$delay_o[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_departure
      
      rel_err_typical = abs(delays[[index]][1]*df$delay_o[obs] + delays[[index]][2] - df$delay_d[obs])/df$schedule[obs] + rel_err_typical
      
      
      if( df$delay_o[obs] <= -3 ) {
        airtime = airtimes[[index]][1,1]*df$delay_o[obs] + airtimes[[index]][1,2]
      } else if( df$delay_o[obs] > -3  &  df$delay_o[obs] <= 9 ) {
        airtime = airtimes[[index]][2,1]*df$delay_o[obs] + airtimes[[index]][2,2]
      } else if( df$delay_o[obs] >  9  &  df$delay_o[obs] <= 45 ) {
        airtime = airtimes[[index]][3,1]*df$delay_o[obs] + airtimes[[index]][3,2]
      } else {
        airtime = airtimes[[index]][4,1]*df$delay_o[obs] + airtimes[[index]][4,2]
      }

      rel_err_model = abs(df$delay_o[obs] + airtime +
          max(taxi_out[[index]][1]*df$delay_o[obs] + taxi_out[[index]][2], taxi_out[[index]][3]) +
          max(taxi_in[[index]][1]*df$delay_o[obs] + taxi_in[[index]][2], taxi_in[[index]][3]) -
          df$schedule[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_model
      }
}

rel_err_model = rel_err_model/N*100
rel_err_typical = rel_err_typical/N*100
rel_err_departure = rel_err_departure/N*100

save(rel_err_model,rel_err_typical,rel_err_departure, file="rel_error_test.rda")
load("rel_error_test.rda")
print(paste0("rel_err_departure: ",rel_err_departure))
print(paste0("rel_err_typical:   ",rel_err_typical))
print(paste0("rel_err_model:     ",rel_err_model))
## [1] "rel_err_departure: 8.97785117738058"
## [1] "rel_err_typical:   6.18624834797005"
## [1] "rel_err_model:     6.05472806912474"

Test using Validate Data

We find the valid data yields the same objective ranking.

load("test_dfl.rda")
load("valid_dfl.rda")
load("airtimes.rda")
load("taxi_in.rda")
load("taxi_out.rda")
load("delays.rda")

rel_err_model <- 0
rel_err_typical <- 0
rel_err_departure <- 0
N <- 0L
for(index in 1:length(valid_dfl)) {
    df <-  valid_dfl[[index]]
    
    i <- which(names(test_dfl) == names(valid_dfl)[[index]], arr.ind = TRUE)
    
    for(obs in 1:nrow(df)) if( ! is.na(df$delay_d[obs]) ) {
      N <- N + 1L
      
      rel_err_departure = abs(df$delay_o[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_departure
      
      rel_err_typical = abs(delays[[i]][1]*df$delay_o[obs] + delays[[i]][2] - df$delay_d[obs])/df$schedule[obs] + rel_err_typical
      
      
      if( df$delay_o[obs] <= -3 ) {
        airtime = airtimes[[i]][1,1]*df$delay_o[obs] + airtimes[[i]][1,2]
      } else if( df$delay_o[obs] > -3  &  df$delay_o[obs] <= 9 ) {
        airtime = airtimes[[i]][2,1]*df$delay_o[obs] + airtimes[[i]][2,2]
      } else if( df$delay_o[obs] >  9  &  df$delay_o[obs] <= 45 ) {
        airtime = airtimes[[i]][3,1]*df$delay_o[obs] + airtimes[[i]][3,2]
      } else {
        airtime = airtimes[[i]][4,1]*df$delay_o[obs] + airtimes[[i]][4,2]
      }

      rel_err_model = abs(df$delay_o[obs] + airtime +
          max(taxi_out[[i]][1]*df$delay_o[obs] + taxi_out[[i]][2], taxi_out[[i]][3]) +
          max(taxi_in[[i]][1]*df$delay_o[obs] + taxi_in[[i]][2], taxi_in[[i]][3]) -
          df$schedule[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_model
      }
}

rel_err_model = rel_err_model/N*100
rel_err_typical = rel_err_typical/N*100
rel_err_departure = rel_err_departure/N*100

save(rel_err_model,rel_err_typical,rel_err_departure, file="rel_error_valid.rda")
load("rel_error_valid.rda")
print(paste0("rel_err_departure (valid): ",rel_err_departure))
print(paste0("rel_err_typical (valid):   ",rel_err_typical))
print(paste0("rel_err_model (valid):     ",rel_err_model))
## [1] "rel_err_departure (valid): 8.89015103059496"
## [1] "rel_err_typical (valid):   6.18376113880922"
## [1] "rel_err_model (valid):     6.09057138500245"

Test using full data set

load("curr_dfl.rda")
load("airtimes.rda")
load("taxi_in.rda")
load("taxi_out.rda")
load("delays.rda")

test_dfl <- curr_dfl

rel_err_model <- 0
rel_err_typical <- 0
rel_err_departure <- 0
N <- 0L
for(index in 1:length(test_dfl)) {
    df <- test_dfl[[index]]
    for(obs in 1:nrow(df)) {
      N <- N + 1L
      
      rel_err_departure = abs(df$delay_o[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_departure
      
      rel_err_typical = abs(delays[[index]][1]*df$delay_o[obs] + delays[[index]][2] - df$delay_d[obs])/df$schedule[obs] + rel_err_typical
      
      
      if( df$delay_o[obs] <= -3 ) {
        airtime = airtimes[[index]][1,1]*df$delay_o[obs] + airtimes[[index]][1,2]
      } else if( df$delay_o[obs] > -3  &  df$delay_o[obs] <= 9 ) {
        airtime = airtimes[[index]][2,1]*df$delay_o[obs] + airtimes[[index]][2,2]
      } else if( df$delay_o[obs] >  9  &  df$delay_o[obs] <= 45 ) {
        airtime = airtimes[[index]][3,1]*df$delay_o[obs] + airtimes[[index]][3,2]
      } else {
        airtime = airtimes[[index]][4,1]*df$delay_o[obs] + airtimes[[index]][4,2]
      }

      rel_err_model = abs(df$delay_o[obs] + airtime +
          max(taxi_out[[index]][1]*df$delay_o[obs] + taxi_out[[index]][2], taxi_out[[index]][3]) +
          max(taxi_in[[index]][1]*df$delay_o[obs] + taxi_in[[index]][2], taxi_in[[index]][3]) -
          df$schedule[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_model
      }
}

rel_err_model = rel_err_model/N*100
rel_err_typical = rel_err_typical/N*100
rel_err_departure = rel_err_departure/N*100

save(rel_err_model,rel_err_typical,rel_err_departure, file="rel_error_full.rda")
load("rel_error_full.rda")
print(paste0("rel_err_departure (full): ",rel_err_departure))
print(paste0("rel_err_typical (full):   ",rel_err_typical))
print(paste0("rel_err_model (full):     ",rel_err_model))
## [1] "rel_err_departure (full): 8.97658214040976"
## [1] "rel_err_typical (full):   6.18652700244265"
## [1] "rel_err_model (full):     6.05640167465032"

Define Average Delays and Range of Flight Times

load("curr_dfl.rda")
len <- length(curr_dfl)

ave_delay_o <- vector(mode = "integer",length = len)
ave_delay_d <- vector(mode = "integer",length = len)
max_air_time <- vector(mode = "integer",length = len)
min_air_time <- vector(mode = "integer",length = len)

for(n in 1:length(curr_dfl)) {
  ave_delay_o[n] = sum(curr_dfl[[n]]$delay_o)/length(curr_dfl[[n]]$delay_o)
  ave_delay_d[n] = sum(curr_dfl[[n]]$delay_d)/length(curr_dfl[[n]]$delay_d)
  max_air_time[n] = max(curr_dfl[[n]]$air_time)
  min_air_time[n] = min(curr_dfl[[n]]$air_time)
}
save(ave_delay_d,ave_delay_o,max_air_time,min_air_time, file="statsByindex.rda")

Conclusion

We introduced a new model and shown that it yields the best estimate of the delay at arrival based on relative error. RStudio provides a fairly easy and fast interface to perform all needed database operations. The chosen BTS database is suitable to demonstrate the power of R because the data is in multiple tables and it is necessary to clean the data to handle missing data.

References

Astaraky, and Davood. 2012. “Least Squares and Linear Regression.” June 2012. https://rstudio-pubs-static.s3.amazonaws.com/84980_e31e48c2b510473997425ac43021e830.html.
Clewlow, Regina R., Ioannis Simaiakis, and Hamsa Balakrishnan. 2010. “Impact of Arrivals on Departure Taxi Operations at Airports.” https://doi.org/10.2514/6.2010-7698.
Deshpande2012, Vinayak, and Mazhar Arıkan. 2012. “The Impact of Airline Flight Schedules on Flight Delays.” INFORMS Manufacturing & Service Operations Management. https://doi.org/10.1287/msom.1120.0379.