© Copyright 2021. All Rights Reserved.
We seek to develop a useful Shiny Web app to compare both delays and total elapsed times with different airlines and to estimate likely delay at the destination, given only the delay at the origin and the scheduled elapsed time for nonstop U.S. flights. Length of queues for both taxi-in and taxi-out will impact delays. Moreover, there are other factors such as weather. Rather than study the causal relationships for such delays as in (Clewlow, Simaiakis, and Balakrishnan 2010), we focus on grouping data using suitable variables to create a good predictive model.
(Clewlow, Simaiakis, and Balakrishnan 2010) emphasize the importance of considering the lengths of taxi-in and taxi-out queues. Surely, such matters are considered by airlines and airports. Moreover, it is straightforward that for network optimization, to increase or decrease departures, there are good reasons for a corresponding change in arrivals. Furthermore, there are airline hubs to minimize network inefficiencies, which add additional restraints. There are also other important considerations such as consumer demand.
While it is important to study causal relationships, it is also useful to create good predictive models. Why is a predictive model needed? Because we do not expect simply adding the known delay at the origin to the scheduled arrival time will be a good model. The reason is the scheduled arrival time has padding to allow the airlines to achieve better performance. So our goal is simply to produce a better model than just using the delay and the scheduled elapsed time. We propose the following model:
\[\text{delay}_{arrival} = \text{delay}_{departure}+\overline{\text{airtime}}+\overline{\text{taxi-in}}+\overline{\text{taxi-out}}-\text{elapsed-time}\]
where
The scheduled elapsed time is typically fairly constant, but does change. For example the time of departure might change slightly but sufficiently to affect taxi queues, or perhaps a larger aircraft is used on certain days to allow for increased demand, or the airline “bakes in” additional time to improve on-time performance, which is known as schedule padding. Indeed, the delay at arrival is a biased measure of performance since the airlines can modify the elapsed time to improve their measure of on-time performance. A better measure for the passenger would be the average elapsed time. If the elapsed time is not constant, the user is provided an option to modify this time within the observed range. We avoid using scheduled departure and arrival times to simplify the interface and implementation.
The trick is choosing the variables and importantly how the grouping is done in order to compute estimates to form a good predictive model. We expect our model will benefit from “error cancellation” due to the fact that we are using multiple variables.
We suspect that the taxi-out time will largely depend on the time of departure, which is supported by (Clewlow, Simaiakis, and Balakrishnan 2010). Similarly, we expect the taxi-in time will depend on the time of arrival; however, we expect a good estimation is possible using only the departure time.
In particular, we also consider that the arrival time will have some dependency on the departure time. We guess, however, that changing the flight time is not done easily as there are many considerations such as fuel costs, weather, and so on. For example, is it possible that decreasing the flight time is more significant when the departure delay is within a certain range so that there is an expectation to reduce or even eliminate the arrival delay, and less significant otherwise. So we will take departure delay into consideration when computing averages.
We know there are significant variances in the variables, especially in the airtimes. Obviously, there are limitations on the accuracy of such models. Notwithstanding, we expect our model will be more accurate than simply using the departure delay and the scheduled elapsed time.
We look only recent data as there are changes in the industry over time and so prior data may not be comparable to recent data. Unfortunately, the data was largely collected during the pandemic. Hence, the app should be updated during the next year approximately monthly since data is collected on a monthly basis. Subsequently, it would probably be sufficient to update quarterly or biannually.
Data was downloaded from the Bureau of Transportation Statistics website. In particular, suitable fields were selected for each month for the year 2020 and the first two months of 2021, which is the most recent data available. Performance data was downloaded using the link
Data is not all contained in one table. Additional tables that are useful include the following lookup tables:
unzip_fn <- function(year,month,filename) {
folder <- paste0("~/R/Flights/Data/",year,"/",month)
setwd(folder)
name <- paste0(filename,".zip")
if (file.exists(name)) {
print(paste0("Unzipping ",name,"..."))
unzip(name)
file.remove(name)
}
}
filename <- "874451450_T_ONTIME_REPORTING"
year <- "2020"
for(month in 1:12) {
unzip_fn(year=year,month=month,filename=filename)
}
year = "2021"
for(month in 1:2) {
unzip_fn(year=year,month=month,filename=filename)
}
read_csv_fn <- function(year,month,filename) {
folder <- paste0("~/R/Flights/Data/",year,"/",month)
setwd(folder)
name <- paste0(filename,".csv")
if (file.exists(name)) {
print(paste0("Reading ",name,"..."))
dt <- read.csv(name)
file.remove(name)
}
dt
}
filename <- "874451450_T_ONTIME_REPORTING"
dtm <- list()
m <- 0L
year <- "2020"
for(month in 1:12) {
m <- m + 1L
dtm[[m]] <- read_csv_fn(year=year,month=month,filename=filename)
}
year = "2021"
for(month in 1:2) {
m <- m + 1L
dtm[[m]] <- read_csv_fn(year=year,month=month,filename=filename)
}
save(dtm, file="dtm.rda")
folder <- "~/R/Flights/Data/"
setwd(folder)
name_airport_id <- "L_AIRPORT_ID.csv"
if (file.exists(name_airport_id)) {
print(paste0("Reading ",name_airport_id,"..."))
dtf_airport_id <- read.csv(name_airport_id)
}
name_airport <- "L_AIRPORT.csv"
if (file.exists(name_airport)) {
print(paste0("Reading ",name_airport,"..."))
dtf_airport <- read.csv(name_airport)
}
dtf_airport_code <- merge(dtf_airport,dtf_airport_id,by="Description")
dtf_airport_code$Description <- dtf_airport_code$Code.y
dtf_airport_code$Code.y <- NULL
colnames(dtf_airport_code) <- c("ID","code")
dtf_airport_code <- dtf_airport_code[ order( dtf_airport_code[1]) ,]
setwd("C:/Users/BiqueS/Documents/R/Flights")
library(dplyr)
dtf_airport_code <- dtf_airport_code %>% distinct(ID, .keep_all= TRUE)
rownames(dtf_airport_code) <- 1:nrow(dtf_airport_code)
save(dtf_airport_code, file="dtf_airport_code.rda")
It makes sense to compute statistics for each > origin and destination pair, > airline, and > time frame of departure.
For example, length of queues for both taxi-in and taxi-out will impact delays. We expect such delays will largely depending on the time of departure. We use to time frames as shown in Figure 4 of (Deshpande2012 and Arıkan 2012, 435).
get_dfa <- function(i,j) {
setwd("~/R/Flights")
source("get_index.R")
load("dtm.rda")
dtm <- dtm[i:j]
load("dtf_airport_code.rda")
len <- 0L
for(li in 1:length(dtm)) {
len <- len + nrow(dtm[[li]])
}
index <- vector(mode = "integer",length = len)
code_o <- vector(mode = "character",length = len)
city_o <- vector(mode = "character",length = len)
code_d <- vector(mode = "character",length = len)
city_d <- vector(mode = "character",length = len)
airline <- vector(mode = "character",length = len)
delay_o <- vector(mode = "integer",length = len)
delay_d <- vector(mode = "integer",length = len)
air_time <- vector(mode = "integer",length = len)
taxi_o <- vector(mode = "integer",length = len)
taxi_d <- vector(mode = "integer",length = len)
schedule <- vector(mode = "integer",length = len)
n <- 0L
for(df in dtm) {
for(obs in 1:nrow(df)) {
dep_delay <- df$DEP_DELAY[obs]
if( !is.na(dep_delay) & !is.na(df$AIR_TIME[obs]) & !is.na(df$ARR_DELAY[obs]) ) {
n <- n + 1L
index[n]=get_index(df$DEP_TIME[obs])
code_o[n]=dtf_airport_code$code[ which(dtf_airport_code$ID == df$ORIGIN_AIRPORT_ID[obs], arr.ind = TRUE) ]
city_o[n]=df$ORIGIN_CITY_NAME[obs]
code_d[n]=dtf_airport_code$code[ which(dtf_airport_code$ID == df$DEST_AIRPORT_ID[obs], arr.ind = TRUE) ]
city_d[n]=df$DEST_CITY_NAME[obs]
airline[n]=df$OP_UNIQUE_CARRIER[obs]
delay_o[n]=as.integer(dep_delay)
delay_d[n]=as.integer(df$ARR_DELAY[obs])
air_time[n]=as.integer(df$AIR_TIME[obs])
taxi_o[n]=as.integer(df$TAXI_OUT[obs])
taxi_d[n]=as.integer(df$TAXI_IN[obs])
schedule[n]=as.integer(df$CRS_ELAPSED_TIME[obs])
}
}
}
data.frame( index=index[1:n], # time index
code_o=code_o[1:n], # code of origin
city_o=city_o[1:n], # city and state abbreviation of origin
code_d=code_d[1:n], # code of destination
city_d=city_d[1:n], # city and state abbreviation of destination
airline=airline[1:n], # code of airline
delay_o=delay_o[1:n], # delay at origin until pushing out from gate
delay_d=delay_d[1:n], # delay at destination
air_time=air_time[1:n], # time above ground
taxi_o=taxi_o[1:n], # taxi out
taxi_d=taxi_d[1:n], # taxi in
schedule=schedule[1:n], # scheduled time
row.names = 1:n )
}
load("dtm.rda")
# Use all data except for most recent month
len <- length(dtm) - 1
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)
# library(varhandle)
# varhandle::rm.all.but(keep = c("cl","chunk","first","last","no_cores"))
result <- c()
result <- foreach(i=1:no_cores) %dopar% get_dfa(first[i],last[i])
dfa <- data.frame(matrix(ncol = 12, nrow = 0))
colnames(dfa) = c("index", # time index
"code_o", # code of origin
"city_o", # city and state abbreviation of origin
"code_d", # code of destination
"city_d", # city and state abbreviation of destination
"airline", # code of airline
"delay_o", # delay at origin until pushing out from gate
"delay_d", # delay at destination
"air_time", # time above ground
"taxi_o", # taxi out
"taxi_d", # taxi in
"schedule") # scheduled time
for(i in 1:no_cores) {
dfa <- rbind(dfa, result[[i]])
}
dfa <- dfa[ order(dfa[2], dfa[4], dfa[6], dfa[1]) ,]
# Split Data.Frame Using Factors
split.by <- c('code_o','code_d','airline','index')
df1 <- split(dfa, dfa[, split.by], drop=TRUE)
keys1 <- names(df1)
# Find data for last month
dfb <- get_dfa(length(dtm),length(dtm))
dfb <- dfb[ order(dfb[2], dfb[4], dfb[6], dfb[1]) ,]
# Split Data.Frame Using Factors
dfl <- split(dfb, dfb[, split.by], drop=TRUE)
# Merge data from prior months, provided there is data for last month
library(fastmatch)
keys <- names(dfl)
for( i in 1:length(dfl) ) {
j <- which( keys1 %fin% keys[i], arr.ind = TRUE)
if( length(j) ) {
dfl[[i]] = rbind(df1[[j]],dfl[[i]])
}
}
save(dfl, file = "dfl.rda")
setwd("~/R/Flights")
load("dfl.rda")
remove_outliers <- function(i,j,s) {
setwd("~/R/Flights")
load("dfl.rda")
current_dfl <- dfl[i:j]
sum <- 0L
dfl <- list()
for( n in 1:length(current_dfl) ) {
df <- current_dfl[[n]]
rows <- nrow(df)
if( rows >= 10 ) {
Q <- quantile(df$delay_d, probs=c(.25, .75), na.rm = TRUE)
siqr <- IQR(df$delay_d, na.rm = TRUE)*s
lb <- Q[1] - siqr # Lower Range
ub <- Q[2] + siqr # Upper Range
df <- df[ df$delay_d >= lb & df$delay_d <= ub ,]
sum <- rows - nrow(df) + sum
}
dfl[[n]] = df
}
rs <- list()
rs[[1]] = dfl
rs[[2]] = sum
rs
}
len <- length(dfl)
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)
x <- NULL
y <- NULL
delta = 0.015
s <- 1.5
for( loop in 1:100 ) {
result <- c()
result <- foreach(i=1:no_cores) %dopar% remove_outliers(first[i],last[i],s)
curr_dfl <- list()
sum <- 0L
for(i in 1:no_cores) {
curr_dfl <- c(curr_dfl, result[[i]][[1]])
sum <- sum + result[[i]][[2]]
}
ave_removed <- sum/length(curr_dfl)
print(paste0("average removed: ",ave_removed))
x <- c(x,s)
y <- c(y,ave_removed)
s <- s + delta
}
plot(y ~ x)
save(x,y, file="xy-outliers.rda")
obs <- which( y == min( y[y >= 2.5] ) )
result <- c()
result <- foreach(i=1:no_cores) %dopar% remove_outliers(first[i],last[i],x[obs])
curr_dfl <- list()
sum <- 0L
for(i in 1:no_cores) {
curr_dfl <- c(curr_dfl, result[[i]][[1]])
sum <- sum + result[[i]][[2]]
}
ave_removed <- sum/length(curr_dfl)
print(paste0("average removed: ",ave_removed))
# [1] "average removed: 2.56098839189791"
for( i in 1:length(curr_dfl) ) {
curr_dfl[[i]] <- na.omit(curr_dfl[[i]])
}
save(curr_dfl, file="curr_dfl.rda")
setwd("~/R/Flights")
load("curr_dfl.rda")
partition <- function() {
setwd("~/R/Flights")
load("curr_dfl.rda")
set.seed(1960961)
dfl <- list()
valid_dfl <- list()
k <- 0L
for( n in 1:length(curr_dfl) ) {
df <- curr_dfl[[n]]
rows = nrow(df)
if( rows >= 30 ) {
size <- floor(rows * 0.05)
df <- df[ order( df[8] ) ,]
indices <- sort(sample(1:rows, size=size))
remaining <- setdiff(1:rows, indices)
df <- df[ remaining ,]
k <- k + 1L
valid_dfl[[k]] <- df[ indices ,]
}
dfl[[n]] = df
}
rs <- list()
rs[[1]] = dfl
rs[[2]] = valid_dfl
rs
}
result <- partition()
test_dfl <- result[[1]]
valid_dfl <- result[[2]]
ns <- NULL
for(i in 1:length(test_dfl)) {
ns[i] = paste0(test_dfl[[i]]$code_o[1],".",test_dfl[[i]]$code_d[1],".",test_dfl[[i]]$airline[1],".",test_dfl[[i]]$index[1])
}
names(test_dfl) <- ns
ns <- NULL
for(i in 1:length(valid_dfl)) {
ns[i] = paste0(valid_dfl[[i]]$code_o[1],".",valid_dfl[[i]]$code_d[1],".",valid_dfl[[i]]$airline[1],".",valid_dfl[[i]]$index[1])
}
names(valid_dfl) <- ns
save(test_dfl, file="test_dfl.rda")
save(valid_dfl, file="valid_dfl.rda")
setwd("~/R/Flights")
load("test_dfl.rda")
pair <- NULL
airline <- NULL
frame <- NULL
for( i in 1:length(test_dfl) ) {
s <- strsplit(names(test_dfl)[[i]],".",fixed=TRUE)
pair[i] <- paste(s[[1]][1],s[[1]][2],sep = "-")
airline[i] <- s[[1]][3]
frame[i] <- s[[1]][4]
}
ndf <- data.frame(pair=pair,airline=airline,frame=frame)
ndf <- ndf[ order( ndf[1], ndf[2], ndf[3] ), ]
city_pairs <- unique(ndf$pair)
# Given a city pair, provide a list of available airlines.
carriers <- NULL
for( i in 1:length(city_pairs)) {
carriers[[i]] <- unique(ndf[ ndf$pair == city_pairs[i] ,]$airline)
}
# Fetch airline name in case user is not familiar with code
dfc <- read.csv("~/R/Flights/Data/L_UNIQUE_CARRIERS.csv")
dfc[!complete.cases(dfc),]
dfc <- na.omit(dfc)
# Append descriptions to codes
for( i in 1:length(carriers) ) {
for( j in 1:length(carriers[[i]])) {
carriers[[i]][j] = paste(carriers[[i]][j],dfc$Description[ dfc$Code == carriers[[i]][j] ])
}
}
save(carriers, file="carriers.rda")
save(city_pairs, file="city_pairs.rda")
By running various tests to study the relationship between air time and delays, e.g., by picking a random index using
set.seed(7410253) index <- sample(1:length(test_dfl),1)
and then plotting points, it is somewhat surprising that the departure delay does not seem to be well correlated to flight time. A delay of about 15 minutes is widely considered acceptable. It is reasonable to expect many factors affect travel time including queues and headwinds or tailwinds. While there does not appear to be a simple linear dependency on the departure delay, the behavior of the air times seems to depend on the magnitude of the delay. We are forced to recognize the limitations of our prediction model.
(Astaraky 2012) explains how to calculate the least squares estimation of regression lines that pass through the means of data:
slope = cor(y,x)*sd(y)/sd(x)
intercept = mean(y) - mean(x)*slope
where cor denotes the correlation and sd the standard deviation. To plot the points and the regression line we may use:
plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Flight Time vs Delay at Origin", xlab = "Delay (min)", ylab = "Flight Time (minutes)")
abline(coef = c(intercept,slope),col="gold")
Based on our work below, we partition the delays into four groups:
delay <= -3
delay > -3 & delay <= 9
delay > 9 & delay <= 45
delay > 45
and then choose (say) the “average air time” as determined by the regression line through the known points for the specified delay.
setwd("~/R/Flights")
load("test_dfl.rda")
# Compute MSE for partition of two
airtime_MSE <- function(s) {
library(fastmatch)
setwd("~/R/Flights")
load("test_dfl.rda")
mse_fun <- function(n,y,z) {
residuals <- y - z
sum(residuals^2)/n
}
mse <- 0
for(index in 1:length(test_dfl) ) {
n <- nrow(test_dfl[[index]])
if( n > 1 ) {
y <- test_dfl[[index]]$air_time # dependent variable
if( sd(y) > 0 ) {
x <- test_dfl[[index]]$delay_o # independent variable
z <- vector(mode = "integer",length = n)
if( sd(x) == 0 ) {
for(i in 1:n) z[i] = mean(y)
} else {
I <- which( x <= s )
remaining <- setdiff(1:n, I)
constant <- rep(FALSE,n)
if(length(I)==1) {
constant[I] = TRUE
} else if( length(I) ) {
yi <- y[I]
sdyi = sd(yi)
if ( is.na(sdyi) ) {
constant[I] = TRUE
} else if( sdyi == 0) {
constant[I] = TRUE
} else {
xi <- x[I]
sdxi <- sd(xi)
if( is.na(sdxi) ) {
m_i <- 0
y_i <- mean(yi)
} else if( sdxi == 0 ) {
m_i <- 0
y_i <- mean(yi)
} else {
m_i <- cor(yi,xi)*sdyi/sdxi
y_i <- mean(yi) - mean(xi)*m_i
}
}
}
if( length(remaining)==1 ) {
constant[remaining] = TRUE
} else if( length(remaining) ) {
yr <- y[remaining]
sdyr = sd(yr)
if ( is.na(sdyr) ) {
constant[remaining] = TRUE
} else if ( sdyr == 0 ) {
constant[remaining] = TRUE
} else {
xr <- x[remaining]
sdxr <- sd(xr)
if( is.na(sdxr) ) {
m_r <- 0
y_r <- mean(yr)
} else if( sdxr == 0 ) {
m_r <- 0
y_r <- mean(yr)
} else {
m_r <- cor(yr,xr, use = "pairwise.complete.obs")*sdyr/sdxr
y_r <- mean(yr) - mean(xr)*m_r
}
}
}
for( i in 1:n ) {
if( constant[i] ) {
z[i] = y[i]
} else if( i %fin% I) {
z[i] = x[i]*m_i + y_i
} else {
z[i] = x[i]*m_r + y_r
}
}
}
mse <- mse_fun(n,y,z) + mse
}
}
}
mse
}
split <- c(10,13,15,17,20,25,30)
split <- c(2,3,5,6,7,8,9)
split <- c(-5,-4,-3,-2,-1,0,1)
# Results for airtime_MSE
# -5 1832269
# -4 1829355
# -3 1828241 **
# -2 1832354
# -1 1836346
# 0 1837128
# 1 1836505 *
# 2 1837126
# 3 1838657
# 5 1841121
# 6 1841846
# 7 1841874
# 8 1843245
# 9 1843812
# 10 1845601
# 13 1849871
# 15 1853149
# 17 1854660
# 20 1857350
# 25 1861798
# 30 1865797
# Compute MSE for partition of three
airtime_MSE_3 <- function(s) {
library(fastmatch)
setwd("~/R/Flights")
load("test_dfl.rda")
mse_fun <- function(n,y,z) {
residuals <- y - z
sum(residuals^2)/n
}
mse <- 0
for(index in 1:length(test_dfl)) {
n <- nrow(test_dfl[[index]])
if( n > 1 ) {
y <- test_dfl[[index]]$air_time # dependent variable
if( sd(y) > 0 ) {
x <- test_dfl[[index]]$delay_o # independent variable
z <- vector(mode = "integer",length = n)
if( sd(x) == 0 ) {
for(i in 1:n) z[i] = mean(y)
} else {
i1 <- which( x <= -3 )
remaining <- setdiff(1:n, i1)
constant <- rep(FALSE,n)
if(length(i1)==1) {
constant[i1] = TRUE
} else if( length(i1) ) {
yi <- y[i1]
sdyi = sd(yi)
if ( is.na(sdyi) ) {
constant[i1] = TRUE
} else if( sdyi == 0) {
constant[i1] = TRUE
} else {
xi <- x[i1]
sdxi <- sd(xi)
if( is.na(sdxi) ) {
m_i <- 0
y_i <- mean(yi)
} else if( sdxi == 0 ) {
m_i <- 0
y_i <- mean(yi)
} else {
m_i <- cor(yi,xi)*sdyi/sdxi
y_i <- mean(yi) - mean(xi)*m_i
}
}
}
i2 <- NULL
i3 <- NULL
if( length(remaining)==1 ) {
constant[remaining] = TRUE
} else if( length(remaining) ) {
i2 <- which( (x > -3) & (x <= s) )
i3 <- setdiff(remaining, i2)
ym <- y[i2]
sdym = sd(ym)
if ( is.na(sdym) ) {
constant[i2] = TRUE
} else if ( sdym == 0 ) {
constant[i2] = TRUE
} else {
xm <- x[i2]
sdxm <- sd(xm)
if( is.na(sdxm) ) {
m_m <- 0
y_m <- mean(ym)
} else if( sdxm == 0 ) {
m_m <- 0
y_m <- mean(ym)
} else {
m_m <- cor(ym,xm)*sdym/sdxm
y_m <- mean(ym) - mean(xm)*m_m
}
}
}
if( length(i3)==1 ) {
constant[i3] = TRUE
} else if( length(i3) ) {
yr <- y[i3]
sdyr = sd(yr)
if ( is.na(sdyr) ) {
constant[i3] = TRUE
} else if ( sdyr == 0 ) {
constant[i3] = TRUE
} else {
xr <- x[i3]
sdxr <- sd(xr)
if( is.na(sdxr) ) {
m_r <- 0
y_r <- mean(yr)
} else if( sdxr == 0 ) {
m_r <- 0
y_r <- mean(yr)
} else {
m_r <- cor(yr,xr)*sdyr/sdxr
y_r <- mean(yr) - mean(xr)*m_r
}
}
}
for( i in 1:n ) {
if( constant[i] ) {
z[i] = y[i]
} else if( i %fin% i1) {
z[i] = x[i]*m_i + y_i
} else if( i %fin% i2) {
z[i] = x[i]*m_m + y_m
} else {
z[i] = x[i]*m_r + y_r
}
}
}
mse <- mse_fun(n,y,z) + mse
}
}
}
mse
}
split <- c(0,1,2,3,4,5,6)
split <- c(7,8,9,10,11,12,13)
split <- c(14,15,17,20,30,40,50)
# 0 1761049
# 1 1757005
# 2 1754582
# 3 1753424
# 4 1752308
# 5 1750663
# 6 1749384
# 7 1747244 *
# 8 1747603
# 9 1746866 **
# 10 1747005
# 14 1747717
# 15 1748537
# 17 1749051
# 20 1749932
# 30 1753722
# 40 1754047
# 50 1758126
# Compute MSE for partition of four
airtime_MSE_4 <- function(s) {
library(fastmatch)
setwd("~/R/Flights")
load("test_dfl.rda")
mse_fun <- function(n,y,z) {
residuals <- y - z
sum(residuals^2)/n
}
mse <- 0
for(index in 1:length(test_dfl)) {
n <- nrow(test_dfl[[index]])
if( n > 1 ) {
y <- test_dfl[[index]]$air_time # dependent variable
if( sd(y) > 0 ) {
x <- test_dfl[[index]]$delay_o # independent variable
z <- vector(mode = "integer",length = n)
if( sd(x) == 0 ) {
for(i in 1:n) z[i] = mean(y)
} else {
i1 <- which( x <= -3 )
remaining <- setdiff(1:n, i1)
constant <- rep(FALSE,n)
if(length(i1)==1) {
constant[i1] = TRUE
} else if( length(i1) ) {
yi <- y[i1]
sdyi = sd(yi)
if ( is.na(sdyi) ) {
constant[i1] = TRUE
} else if( sdyi == 0) {
constant[i1] = TRUE
} else {
xi <- x[i1]
sdxi <- sd(xi)
if( is.na(sdxi) ) {
m_i <- 0
y_i <- mean(yi)
} else if( sdxi == 0 ) {
m_i <- 0
y_i <- mean(yi)
} else {
m_i <- cor(yi,xi)*sdyi/sdxi
y_i <- mean(yi) - mean(xi)*m_i
}
}
}
i2 <- NULL
if( length(remaining)==1 ) {
constant[remaining] = TRUE
} else if( length(remaining) ) {
i2 <- which( (x > -3) & (x <= 9) )
remaining <- setdiff(remaining, i2)
ym <- y[i2]
sdym = sd(ym)
if ( is.na(sdym) ) {
constant[i2] = TRUE
} else if ( sdym == 0 ) {
constant[i2] = TRUE
} else {
xm <- x[i2]
sdxm <- sd(xm)
if( is.na(sdxm) ) {
m_m <- 0
y_m <- mean(ym)
} else if( sdxm == 0 ) {
m_m <- 0
y_m <- mean(ym)
} else {
m_m <- cor(ym,xm)*sdym/sdxm
y_m <- mean(ym) - mean(xm)*m_m
}
}
}
i3 <- NULL
i4 <- NULL
if( length(remaining)==1 ) {
constant[remaining] = TRUE
} else if( length(remaining) ) {
i3 <- which( (x > 9) & (x <= s) )
i4 <- setdiff(remaining, i3)
yn <- y[i3]
sdyn = sd(yn)
if ( is.na(sdyn) ) {
constant[i3] = TRUE
} else if ( sdyn == 0 ) {
constant[i3] = TRUE
} else {
xn <- x[i3]
sdxn <- sd(xn)
if( is.na(sdxn) ) {
m_n <- 0
y_n <- mean(yn)
} else if( sdxn == 0 ) {
m_n <- 0
y_n <- mean(yn)
} else {
m_n <- cor(yn,xn)*sdyn/sdxn
y_n <- mean(yn) - mean(xn)*m_n
}
}
}
if( length(i4)==1 ) {
constant[i4] = TRUE
} else if( length(i4) ) {
yr <- y[i4]
sdyr = sd(yr)
if ( is.na(sdyr) ) {
constant[i4] = TRUE
} else if ( sdyr == 0 ) {
constant[i4] = TRUE
} else {
xr <- x[i4]
sdxr <- sd(xr)
if( is.na(sdxr) ) {
m_r <- 0
y_r <- mean(yr)
} else if( sdxr == 0 ) {
m_r <- 0
y_r <- mean(yr)
} else {
m_r <- cor(yr,xr)*sdyr/sdxr
y_r <- mean(yr) - mean(xr)*m_r
}
}
}
for( i in 1:n ) {
if( constant[i] ) {
z[i] = y[i]
} else if( i %fin% i1) {
z[i] = x[i]*m_i + y_i
} else if( i %fin% i2) {
z[i] = x[i]*m_m + y_m
} else if( i %fin% i3) {
z[i] = x[i]*m_n + y_n
} else {
z[i] = x[i]*m_r + y_r
}
}
}
mse <- mse_fun(n,y,z) + mse
}
}
}
mse
}
split <- c(15,16,17,18,19,20,21)
split <- c(22,23,24,25,26,27,28)
split <- c(29,30,31,32,33,34,35)
split <- c(36,37,38,39,40,41,42)
split <- c(43,44,45,46,47,48,49)
# 15 1711090
# 16 1707864
# 17 1705538
# 18 1703373
# 19 1701842
# 20 1699819
# 21 1698438
# 22 1697610
# 23 1696122
# 24 1695470
# 25 1694692
# 26 1694029
# 27 1693694
# 28 1693407 *
# 29 1693529
# 30 1691840
# 31 1691322
# 32 1690059
# 33 1689852
# 34 1688384 *
# 35 1688408
# 36 1687753 *
# 37 1688155
# 38 1687681
# 39 1687097
# 40 1686973 *
# 41 1687046
# 42 1687295
# 43 1687130
# 44 1686404
# 45 1686116 **
# 46 1686341
# 47 1686705
# 48 1686996
# 49 1687258
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
result <- c()
#result <- foreach(i=1:no_cores) %dopar% airtime_MSE(split[i])
#result <- foreach(i=1:no_cores) %dopar% airtime_MSE_3(split[i])
result <- foreach(i=1:no_cores) %dopar% airtime_MSE_4(split[i])
mse <- list()
for(i in 1:no_cores) {
mse[i] <- result[[i]]
}
mse
airtimes <- function(i,j) {
setwd("~/R/Flights")
# load("test_dfl.rda")
# test_dfl <- test_dfl[i:j]
load("curr_dfl.rda")
test_dfl <- curr_dfl[i:j]
result <- list()
zero <- rep(0,4)
mat <- matrix (
nrow =4, # determined by delay
ncol =2, # 1=slope, 2=intercept
byrow = TRUE,
dimnames = list(1:4, c("slope","intercept")))
for(index in 1:length(test_dfl)) {
n <- nrow(test_dfl[[index]])
if( n == 1 ) {
mat[,1] = zero
mat[,2] = rep(test_dfl[[index]]$air_time[1],4)
} else {
y <- test_dfl[[index]]$air_time
if( sd(y) == 0 ) {
mat[,1] = zero
mat[,2] = rep(y[1],4)
} else {
x <- test_dfl[[index]]$delay_o
if( sd(x) == 0 ) {
mat[,1] = zero
mat[,2] = rep(mean(y),4)
} else {
i1 <- which( x <= -3 )
remaining <- setdiff(1:n, i1)
if(length(i1)==0) {
mat[1,] = c(0,mean(y))
} else if(length(i1)==1) {
mat[1,] = c(0,y[i1])
} else {
yi <- y[i1]
sdyi = sd(yi)
if ( sdyi == 0) {
mat[1,] = c(0,yi[1])
} else {
xi <- x[i1]
sdxi <- sd(xi)
if( sdxi == 0 ) {
mat[1,] = c(0,mean(yi))
} else {
slope <- cor(yi,xi)*sdyi/sdxi
mat[1,] = c(slope, mean(yi) - mean(xi)*slope)
}
}
}
if( length(remaining)==0 ) {
mat[2,] = c(0,mean(y))
mat[3,] = mat[2,]
mat[4,] = mat[2,]
} else if( length(remaining)==1 ) {
mat[2,] = c(0,y[remaining])
mat[3,] = mat[2,]
mat[4,] = mat[2,]
} else {
i2 <- which( (x > -3) & (x <= 9) )
remaining <- setdiff(remaining, i2)
if(length(i2)==0) {
mat[2,] = mat[1,]
} else if(length(i2)==1) {
mat[2,] = c(0,y[i2])
} else {
ym <- y[i2]
sdym = sd(ym)
if ( sdym == 0 ) {
mat[2,] = c(0,ym[1])
} else {
xm <- x[i2]
sdxm <- sd(xm)
if( sdxm == 0 ) {
mat[2,] = c(0, mean(ym))
} else {
slope = cor(ym,xm)*sdym/sdxm
mat[2,] = c(slope, mean(ym) - mean(xm)*slope)
}
}
}
if( length(remaining)==0 ) {
mat[3,] = mat[2,]
mat[4,] = mat[2,]
} else if( length(remaining)==1 ) {
mat[3,] = c(0,y[remaining])
mat[4,] = mat[3,]
} else {
i3 <- which( (x > 9) & (x <= 45) )
i4 <- setdiff(remaining, i3)
if(length(i3)==0) {
mat[3,] = mat[2,]
} else if(length(i3)==1) {
mat[3,] = c(0,y[i3])
} else {
yn <- y[i3]
sdyn = sd(yn)
if ( sdyn == 0 ) {
mat[3,] = c(0,yn[1])
} else {
xn <- x[i3]
sdxn <- sd(xn)
if( sdxn == 0 ) {
mat[3,] = c(0, mean(yn))
} else {
slope = cor(yn,xn)*sdyn/sdxn
mat[3,] = c(slope, mean(yn) - mean(xn)*slope)
}
}
}
if(length(i4)==0) {
mat[4,] = mat[3,]
} else if( length(i4)==1 ) {
mat[4,] = c(0,y[i4])
} else {
yr <- y[i4]
sdyr = sd(yr)
if ( sdyr == 0 ) {
mat[4,] = c(0,yr[1])
} else {
xr <- x[i4]
sdxr <- sd(xr)
if( sdxr == 0 ) {
mat[4,] = c(0, mean(yr))
} else {
slope <- cor(yr,xr)*sdyr/sdxr
mat[4,] = c(slope, mean(yr) - mean(xr)*slope)
}
}
}
}
}
}
}
}
result[[index]] <- mat
}
result
}
len <- length(test_dfl)
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)
result <- c()
result <- foreach(i=1:no_cores) %dopar% airtimes(first[i],last[i])
airtimes <- list()
for(i in 1:no_cores) {
airtimes <- c(airtimes, result[[i]])
}
save(airtimes,file="airtimes.rda")
As expected, the arrival delay seems to be linearly dependent on departure delay.
load("test_dfl.rda")
index <- sample(1:length(test_dfl),1)
x <- test_dfl[[index]]$delay_o # independent variable
y <- test_dfl[[index]]$delay_d # dependent variable
obs <- which( x == min(x) )[1]
xx = x - x[obs]
yy = y - y[obs]
require(stats)
model <- # lm(xx ~ yy)
lm(x ~ y)
slope = cor(y,x)*sd(y)/sd(x)
plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Delay at Destination vs Delay at Origin", xlab = "Delay at Origin (min)", ylab = "Delay at Destination (minutes)")
abline(coefficients(model),col="gold")
abline(c(mean(y) - mean(x)*slope, slope ),col="red")
abline(c(mean(y), 0 ),col="purple")
delays <- function(i,j) {
setwd("~/R/Flights")
# load("test_dfl.rda")
# test_dfl <- test_dfl[i:j]
load("curr_dfl.rda")
test_dfl <- curr_dfl[i:j]
result <- list()
fn <- vector(mode = "numeric", length = 2)
for(index in 1:length(test_dfl)) {
n <- nrow(test_dfl[[index]])
if( n == 1 ) {
fn = c(0, test_dfl[[index]]$delay_d[1])
} else {
x <- test_dfl[[index]]$delay_o # independent variable
y <- test_dfl[[index]]$delay_d # dependent variable
sdy = sd(y)
if( sdy == 0 ) {
fn = c(0, y[1])
} else {
sdx = sd(x)
if( sdx == 0 ) {
fn = c(0, mean(y))
} else {
slope <- cor(y,x)*sdy/sdx
if( slope <= 0 ) {
fn = c(0, mean(y))
} else {
fn = c(slope, mean(y) - mean(x)*slope )
}
}
}
}
result[[index]] <- fn
}
result
}
len <- length(test_dfl)
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)
result <- c()
result <- foreach(i=1:no_cores) %dopar% delays(first[i],last[i])
delays <- list()
for(i in 1:no_cores) {
delays <- c(delays, result[[i]])
}
save(delays,file="delays.rda")
As expected, the taxi out time seems to be linearly dependent on departure delay, although taxi times appear roughly constant over differences in delay at origin. The reason is that we expect queues depend on departure time, which depends on delay at origin. Leaving early may have a higher cost in terms of taxi out. Similarly, leaving late may have a beneficial effect in terms of taxi out.
load("test_dfl.rda")
index <- sample(1:length(test_dfl),1)
x <- test_dfl[[index]]$delay_o # independent variable
y <- test_dfl[[index]]$taxi_o # dependent variable
require(stats)
model <- lm(y ~ x)
slope = cor(y,x)*sd(y)/sd(x)
plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Taxi Out vs Delay at Origin", xlab = "Delay at Origin (min)", ylab = "Taxi Out (minutes)")
abline(coefficients(model),col="gold",lwd = 2 )
abline(c(mean(y) - mean(x)*slope, slope ),col="red",)
abline(c(mean(y), 0 ),col="purple")
taxi <- function(i,j) {
setwd("~/R/Flights")
# load("test_dfl.rda")
# test_dfl <- test_dfl[i:j]
load("curr_dfl.rda")
test_dfl <- curr_dfl[i:j]
result <- list()
fn <- vector(mode = "numeric", length = 3)
for(index in 1:length(test_dfl)) {
n <- nrow(test_dfl[[index]])
if( n == 1 ) {
v = test_dfl[[index]]$taxi_o[1]
fn = c(0, v, v)
} else {
x <- test_dfl[[index]]$delay_o # independent variable
y <- test_dfl[[index]]$taxi_o # dependent variable
sdy = sd(y)
if( sdy == 0 ) {
fn = c(0, y[1], y[1])
} else {
best = min(y)
sdx = sd(x)
if( sdx == 0 ) {
fn = c(0, mean(y), best)
} else {
slope <- cor(y,x)*sdy/sdx
fn = c(slope, mean(y) - mean(x)*slope, best)
}
}
}
result[[index]] <- fn
}
result
}
len <- length(test_dfl)
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)
result <- c()
result <- foreach(i=1:no_cores) %dopar% taxi(first[i],last[i])
taxi_out <- list()
for(i in 1:no_cores) {
taxi_out <- c(taxi_out, result[[i]])
}
save(taxi_out,file="taxi_out.rda")
As expected, the taxi in time seems to depend on departure delay, although taxi times appear roughly constant over differences in delay at origin. The reason is that we expect queues depend on arrival time, which depends on arrival delay, and this in turn depends on delay at origin.
index <- sample(1:length(test_dfl),1)
x <- test_dfl[[index]]$delay_o # independent variable
y <- test_dfl[[index]]$taxi_d # dependent variable
require(stats)
model <- lm(y ~ x)
slope = cor(y,x)*sd(y)/sd(x)
plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Taxi In vs Delay at Origin", xlab = "Delay at Origin (min)", ylab = "Taxi In (minutes)")
abline(coefficients(model),col="gold",lwd = 2 )
abline(c(mean(y) - mean(x)*slope, slope ),col="red",)
abline(c(mean(y), 0 ),col="purple")
taxi <- function(i,j) {
setwd("~/R/Flights")
# load("test_dfl.rda")
# test_dfl <- test_dfl[i:j]
load("curr_dfl.rda")
test_dfl <- curr_dfl[i:j]
result <- list()
fn <- vector(mode = "numeric", length = 3)
for(index in 1:length(test_dfl)) {
n <- nrow(test_dfl[[index]])
if( n == 1 ) {
v = test_dfl[[index]]$taxi_d[1]
fn = c(0, v, v)
} else {
x <- test_dfl[[index]]$delay_o
y <- test_dfl[[index]]$taxi_d
sdy = sd(y)
if( sdy == 0 ) {
fn = c(0, y[1], y[1])
} else {
best = min(y)
sdx = sd(x)
if( sdx == 0 ) {
fn = c(0, mean(y), best)
} else {
slope <- cor(y,x)*sdy/sdx
fn = c(slope, mean(y) - mean(x)*slope, best)
}
}
}
result[[index]] = fn
}
result
}
#len <- length(test_dfl)
len <- length(curr_dfl)
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
chunk <- floor(len/no_cores)
first <- sapply(1:no_cores, function(i) (i-1)*chunk + 1L)
last <- sapply(1:no_cores, function(i) if(i < no_cores) i*chunk else len)
result <- c()
result <- foreach(i=1:no_cores) %dopar% taxi(first[i],last[i])
taxi_in <- list()
for(i in 1:no_cores) {
taxi_in <- c(taxi_in, result[[i]])
}
save(taxi_in,file="taxi_in.rda")
We compute the average relative error for each of the following estimates:
We find our model objectively is slightly best
load("test_dfl.rda")
load("airtimes.rda")
load("taxi_in.rda")
load("taxi_out.rda")
load("delays.rda")
rel_err_model <- 0
rel_err_typical <- 0
rel_err_departure <- 0
N <- 0L
for(index in 1:length(test_dfl)) {
df <- test_dfl[[index]]
for(obs in 1:nrow(df)) {
N <- N + 1L
rel_err_departure = abs(df$delay_o[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_departure
rel_err_typical = abs(delays[[index]][1]*df$delay_o[obs] + delays[[index]][2] - df$delay_d[obs])/df$schedule[obs] + rel_err_typical
if( df$delay_o[obs] <= -3 ) {
airtime = airtimes[[index]][1,1]*df$delay_o[obs] + airtimes[[index]][1,2]
} else if( df$delay_o[obs] > -3 & df$delay_o[obs] <= 9 ) {
airtime = airtimes[[index]][2,1]*df$delay_o[obs] + airtimes[[index]][2,2]
} else if( df$delay_o[obs] > 9 & df$delay_o[obs] <= 45 ) {
airtime = airtimes[[index]][3,1]*df$delay_o[obs] + airtimes[[index]][3,2]
} else {
airtime = airtimes[[index]][4,1]*df$delay_o[obs] + airtimes[[index]][4,2]
}
rel_err_model = abs(df$delay_o[obs] + airtime +
max(taxi_out[[index]][1]*df$delay_o[obs] + taxi_out[[index]][2], taxi_out[[index]][3]) +
max(taxi_in[[index]][1]*df$delay_o[obs] + taxi_in[[index]][2], taxi_in[[index]][3]) -
df$schedule[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_model
}
}
rel_err_model = rel_err_model/N*100
rel_err_typical = rel_err_typical/N*100
rel_err_departure = rel_err_departure/N*100
save(rel_err_model,rel_err_typical,rel_err_departure, file="rel_error_test.rda")
load("rel_error_test.rda")
print(paste0("rel_err_departure: ",rel_err_departure))
print(paste0("rel_err_typical: ",rel_err_typical))
print(paste0("rel_err_model: ",rel_err_model))
## [1] "rel_err_departure: 8.97785117738058"
## [1] "rel_err_typical: 6.18624834797005"
## [1] "rel_err_model: 6.05472806912474"
We find the valid data yields the same objective ranking.
load("test_dfl.rda")
load("valid_dfl.rda")
load("airtimes.rda")
load("taxi_in.rda")
load("taxi_out.rda")
load("delays.rda")
rel_err_model <- 0
rel_err_typical <- 0
rel_err_departure <- 0
N <- 0L
for(index in 1:length(valid_dfl)) {
df <- valid_dfl[[index]]
i <- which(names(test_dfl) == names(valid_dfl)[[index]], arr.ind = TRUE)
for(obs in 1:nrow(df)) if( ! is.na(df$delay_d[obs]) ) {
N <- N + 1L
rel_err_departure = abs(df$delay_o[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_departure
rel_err_typical = abs(delays[[i]][1]*df$delay_o[obs] + delays[[i]][2] - df$delay_d[obs])/df$schedule[obs] + rel_err_typical
if( df$delay_o[obs] <= -3 ) {
airtime = airtimes[[i]][1,1]*df$delay_o[obs] + airtimes[[i]][1,2]
} else if( df$delay_o[obs] > -3 & df$delay_o[obs] <= 9 ) {
airtime = airtimes[[i]][2,1]*df$delay_o[obs] + airtimes[[i]][2,2]
} else if( df$delay_o[obs] > 9 & df$delay_o[obs] <= 45 ) {
airtime = airtimes[[i]][3,1]*df$delay_o[obs] + airtimes[[i]][3,2]
} else {
airtime = airtimes[[i]][4,1]*df$delay_o[obs] + airtimes[[i]][4,2]
}
rel_err_model = abs(df$delay_o[obs] + airtime +
max(taxi_out[[i]][1]*df$delay_o[obs] + taxi_out[[i]][2], taxi_out[[i]][3]) +
max(taxi_in[[i]][1]*df$delay_o[obs] + taxi_in[[i]][2], taxi_in[[i]][3]) -
df$schedule[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_model
}
}
rel_err_model = rel_err_model/N*100
rel_err_typical = rel_err_typical/N*100
rel_err_departure = rel_err_departure/N*100
save(rel_err_model,rel_err_typical,rel_err_departure, file="rel_error_valid.rda")
load("rel_error_valid.rda")
print(paste0("rel_err_departure (valid): ",rel_err_departure))
print(paste0("rel_err_typical (valid): ",rel_err_typical))
print(paste0("rel_err_model (valid): ",rel_err_model))
## [1] "rel_err_departure (valid): 8.89015103059496"
## [1] "rel_err_typical (valid): 6.18376113880922"
## [1] "rel_err_model (valid): 6.09057138500245"
Test using full data set
load("curr_dfl.rda")
load("airtimes.rda")
load("taxi_in.rda")
load("taxi_out.rda")
load("delays.rda")
test_dfl <- curr_dfl
rel_err_model <- 0
rel_err_typical <- 0
rel_err_departure <- 0
N <- 0L
for(index in 1:length(test_dfl)) {
df <- test_dfl[[index]]
for(obs in 1:nrow(df)) {
N <- N + 1L
rel_err_departure = abs(df$delay_o[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_departure
rel_err_typical = abs(delays[[index]][1]*df$delay_o[obs] + delays[[index]][2] - df$delay_d[obs])/df$schedule[obs] + rel_err_typical
if( df$delay_o[obs] <= -3 ) {
airtime = airtimes[[index]][1,1]*df$delay_o[obs] + airtimes[[index]][1,2]
} else if( df$delay_o[obs] > -3 & df$delay_o[obs] <= 9 ) {
airtime = airtimes[[index]][2,1]*df$delay_o[obs] + airtimes[[index]][2,2]
} else if( df$delay_o[obs] > 9 & df$delay_o[obs] <= 45 ) {
airtime = airtimes[[index]][3,1]*df$delay_o[obs] + airtimes[[index]][3,2]
} else {
airtime = airtimes[[index]][4,1]*df$delay_o[obs] + airtimes[[index]][4,2]
}
rel_err_model = abs(df$delay_o[obs] + airtime +
max(taxi_out[[index]][1]*df$delay_o[obs] + taxi_out[[index]][2], taxi_out[[index]][3]) +
max(taxi_in[[index]][1]*df$delay_o[obs] + taxi_in[[index]][2], taxi_in[[index]][3]) -
df$schedule[obs] - df$delay_d[obs])/df$schedule[obs] + rel_err_model
}
}
rel_err_model = rel_err_model/N*100
rel_err_typical = rel_err_typical/N*100
rel_err_departure = rel_err_departure/N*100
save(rel_err_model,rel_err_typical,rel_err_departure, file="rel_error_full.rda")
load("rel_error_full.rda")
print(paste0("rel_err_departure (full): ",rel_err_departure))
print(paste0("rel_err_typical (full): ",rel_err_typical))
print(paste0("rel_err_model (full): ",rel_err_model))
## [1] "rel_err_departure (full): 8.97658214040976"
## [1] "rel_err_typical (full): 6.18652700244265"
## [1] "rel_err_model (full): 6.05640167465032"
load("curr_dfl.rda")
len <- length(curr_dfl)
ave_delay_o <- vector(mode = "integer",length = len)
ave_delay_d <- vector(mode = "integer",length = len)
max_air_time <- vector(mode = "integer",length = len)
min_air_time <- vector(mode = "integer",length = len)
for(n in 1:length(curr_dfl)) {
ave_delay_o[n] = sum(curr_dfl[[n]]$delay_o)/length(curr_dfl[[n]]$delay_o)
ave_delay_d[n] = sum(curr_dfl[[n]]$delay_d)/length(curr_dfl[[n]]$delay_d)
max_air_time[n] = max(curr_dfl[[n]]$air_time)
min_air_time[n] = min(curr_dfl[[n]]$air_time)
}
save(ave_delay_d,ave_delay_o,max_air_time,min_air_time, file="statsByindex.rda")
We introduced a new model and shown that it yields the best estimate of the delay at arrival based on relative error. RStudio provides a fairly easy and fast interface to perform all needed database operations. The chosen BTS database is suitable to demonstrate the power of R because the data is in multiple tables and it is necessary to clean the data to handle missing data.