Introduction

The Problem

Flight delay is one of the most important indicators of airport capacities, airline services, and air transportation system. According to U.S. Bureau of Transportation statistics, approximately 20% of airline flights are delayed or cancelled every year in US, costing travellers over 20 billion dollars in lost time and money. Meanwhile, airlines also suffer penalties, fines and additional operation costs from flight delay. Delays also jeopardize airlines marketing strategies, since carriers rely on customers’ loyalty to support their frequent-flyer programs and the consumer’s choice is also affected by reliable performance.

Furthermore, many factors can affect flight delay. Unsafe weather condition may be the most obivious and commonly known cause of flight delay. In addition, flight departure and arrival schedules are also subject to the air traffic control at airports and operations of airlines. Moreover, when the previous flight is delayed or arriving late, the following flight may experience higher risk of departure delay.

Thus, it is important to analyze the spatio-temporal patterns of flight delays in United States and understand which factors can cause flight delay and how significantly they can affect. Visulization of flight delay spatio-temporal patterns can help travellers make choices between different airlines and airports as well as different time periods. Meanwhile, the prediction of flight delay also reveals some interesting facts of flight delay causes and can help improve airports operations and airline performance.

The Objective

Project Goal 1: Visualizing and analyzing the spatio-temporal patterns of flight delays in United States

Project Goal 2: Using supervised machine learning techniques and builing logistic regression models to predict flight departure and arrival delays

Dataset

Data Source

This project uses the historical airline on-time performance data provided by the Bureau of Transportation Statistics (BTS) in the US Department of Transportation (DOT). BTS provides very detailed airline on-time performance datasets from 1987 to January 2018 with various fields including airline information, flight origin and destination information, departure and arrival performance, and causes of delay. The dataset can be access through the link.

Dataset Overview

Considering the scope and effectiveness of the data, this project uses the airline on-time performance data from July 2017 to December 2017 with selected attributes of flights for analysis. There are 2,876,412 entries of flights in total and each entry includes details of flights from different airlines.

The attributes for each entry(flight) are:

  • Time Period
    • Year
    • Month
    • Day of Month
    • Day of Week
  • Airline Information
    • Unique Air Carrier
  • Origin & Destination Information
    • Origin/Destination Airport
    • Origin/Destination City
    • Origin/Destination State
    • Origin/Destination State Fips Code
  • Departure & Arrival Performance
    • Departure/Arrival Delay Indicator (If actual departure/arrival time is 15 minutes or more later than scheduled time, classified as 1; else, 0)
    • Departure/Arrival Delay Minutes (Difference in minutes between scheduled and actual departure/arrival time
    • CRS Departure/Arrival Time Block (Hourly Interval)
  • Delay Minutes of Five Causes
    • Carrier Delay
    • Weather Delay
    • National Air System Delay
    • Security Delay
    • Late Aircraft Delay
  • Cancelled and Diverted

To better understand the information presented in the dataset, here are detailed explainations of some attributes above.

Which airlines report on on-time performance

Carriers that have 0.5 percent of total domestic scheduled-service passenger revenue report on-time data and the causes of delay. Specifically, there are 12 air carriers in total in the on-time performance dataset from July 2017 to December 2017.

The airlines required to report are

  • Alaska Airlines (AS)

  • American Airlines (AA)

  • Delta Air Lines (DL)

  • ExpressJet Airlines (EV)

  • Frontier Airlines (F9)

  • Hawaiian Airlines (HA)

  • JetBlue Airways (B6)

  • SkyWest Airlines (OO)

  • Southwest Airlines (WN)

  • Spirit Airlines (NK)

  • United Airlines (UA)

  • Virgin America (VX)

How is flight delay defined?

Given by Federal Aviation Administration (FAA), a flight is defined as “delayed” if it is 15 minutes later than its scheduled time shown in the carriers’ Computerized Reservations Systems (CRS). Arrival performance is based on arrival at the gate. Departure performance is based on departure from the gate.

What types of flight delays are reported to BTS by the airlines?

The airlines report the causes of delays in five broad categories:

  • Air Carrier: The cause of the cancellation or delay was due to circumstances within the airline’s control (e.g. maintenance or crew problems, aircraft cleaning, baggage loading, fueling, etc.).

  • Extreme Weather: Significant meteorological conditions (actual or forecasted) that, in the judgment of the carrier, delays or prevents the operation of a flight such as tornado, blizzard or hurricane.

  • National Aviation System (NAS): Delays and cancellations attributable to the national aviation system that refer to a broad set of conditions, such as non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control.

  • Late-arriving aircraft: A previous flight with same aircraft arrived late, causing the present flight to depart late.

  • Security: Delays or cancellations caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.

Methodology

Spatio-Temporal Analysis of Flight Departure Delay

The first part of this project focuses on visualizing and analyzing the spatio-temporal patterns of flight departure delay in US. Some interesting facts revealed from the analysis may provides insights into airline services, airport capacities, and the aviation system.

Topics will cover:

  • Total Flight Departure Rate from July 2017 to December 2017

  • Flight Departure Delay Rate by Flight Origin Airports

    • Origin Airports with Top 10 Flight Departure delay Rate

    • Flight Departure Rate at Top 10 Popular Origin Airports

    • What causes some airports to have high flight departure delay rate?

  • Flight Departure Delay Rate by Flight Origin State

    • Aggregate Flight Departure Delay Rate by Flight Origin State from July 2017 to December 2017

    • Monthly Change of Flight Departure Delay Rate by Flight Origin State

  • How Flight Departure Delay Rate Changes by Time?

    • Average & Maximum Flight Departure Delay Rate by Month

    • What causes some months to have high flight departure delay rate?

  • Will Holiday Affect Flight Departure Delay?

  • Flight Departure Delay Rate by Air Carriers

Flight Departure & Arrival Delay Prediction

The second part of this project includes the implementaion and analysis of three logistic regression models for predicting flight departure and arrival delay. This project conducts one logistic regression model to forecast the flight departure delay and two models to generate flight arrival delay prediction. The main idea of these predictions is to generate a binary prediction of whether a flight will experience departure or arrival delay by using limited historical information related to flight departure/arrival delay. The accuracy and efficiency of all models will be assessed by confusion matrix and ROC curve.

For the flight departure delay prediction, this projects evaluates the performance of the preditive model by applying the model to one training set and one test set. The training set contains flight on-time performance data from August 1st to August 7th (the first week of August 2017), whereas the test set is the flight on-time performance data on August 8th (the next day after the first week of August 2017). Hence, this project uses one-week historical data to generate one-day flight departure delay prediction.

Moreover, this projects performs two different logistic regression models for flight arrival delay prediction. The major difference of these two models is that one includes the indicator of flight departure delay as a predictor in the model whereas the other excludes this factor. The projects will use both confusion matrix and ROC curve to compare the performance of these two regression models. Thus, the main purpose of flight arrival prediction is to better understand how significantly the late aircrafts can affect flight delay.

Outcomes

Since the datasets downloaded from the Bureau of Transportation Statistics (BTS) are monthly airline on-time performance data from July 2017 to December 2017, we first import the 6 datasets into R and combine them into one big dataset.

#import original datasets
July_2017 <- read.csv("July_2017.csv")
Aug_2017 <- read.csv("August_2017.csv")
Sep_2017 <- read.csv("Sep_2017.csv")
Oct_2017 <- read.csv("Oct_2017.csv")
Nov_2017 <- read.csv("Nov_2017.csv")
Dec_2017 <- read.csv("Dec_2017.csv")

#bind all flight data of six monthes into one data frame
total_2017 <- rbind(July_2017, Aug_2017)
total_2017 <- rbind(total_2017, Sep_2017)
total_2017 <- rbind(total_2017, Oct_2017)
total_2017 <- rbind(total_2017, Nov_2017)
total_2017 <- rbind(total_2017, Dec_2017)

#delete the extra column
total_2017 <- select(total_2017, -X)

Part 1: Spatio-Temporal Analysis of Flight Departure Delay

Section 1: Total Flight Departure Delay Rate from July 2017 to December 2017

First, it is good to get a basic idea of how many flights delayed in total from July 2017 to December 2017. In our sample dataset, if the value of the variable “DEP_DEL15” equals to 1, then the flight is classified as departure delayed; otherwise, 0 suggests on-time departure. The following codes conduct data wrangling and transformation of the original sample dataset and generates a new dataset ready for visualizing the total flight departure delay rate.

#count the number of on-time and delayed flight departures for all entries in the original dataset
dep_15 <- total_2017 %>% 
  group_by(DEP_DEL15) %>% 
  summarise(count = n())

#delete NA  
dep_15 <- na.omit(dep_15)

#add a column with the percentages of flights departure on time and delayed
dep_15_ready <- dep_15 %>% mutate(percentage = dep_15$count /sum(dep_15[1,2],dep_15[2,2]))

#edit the cell values for easily plotting the chart later
dep_15_ready[1, 1] <- "On Time"
dep_15_ready[2, 1] <- "Delayed"
dep_15_ready[1, 3] <- "83.1%"
dep_15_ready[2, 3] <- "16.9%"

We use a pie chart to visualize the percentage of flights departure on-time vs. delayed. The codes below shows how to create a visually appealing pie chart in R.

#first plot a barplot
bar_dep <- ggplot(dep_15_ready, aes(x="", y=percentage, fill=DEP_DEL15))+
  geom_bar(width = 1, stat = "identity")
bar_dep

#then a simple pie plot
pie_dep <- bar_dep + coord_polar("y", start=0)
pie_dep

#create a blank theme for the aesthetics of the pie chart
blank_theme <- theme_minimal()+
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_blank(),
    legend.text = element_text(size = 12, face="bold", color = "#000000", hjust = 0, vjust = 0),
    panel.border = element_blank(),
    panel.grid=element_blank(),
    axis.ticks = element_blank(),
    plot.title=element_text(size=16, face="bold"),
    plot.subtitle=element_text(size=14, hjust = 0.8)
  )

#create the final pie chart
pie_dep + scale_fill_manual(values=c("#FF8C00", "#66CDAA"))+ blank_theme +
  theme(axis.text.x=element_blank())+
  geom_text(aes(y = c(1, 2) , 
                label = percentage), size=5) +
  labs(title = "Percentage of Flight Departure On-Time VS Delayed",
       subtitle = "July 2017 - December 2017, USA")

Here is the pie chart showing the percentage of flight departure on-time VS delayed.

We have a sense that the total departure delay rate is 16.9% for all flights in out sample dataset. Now we want to explore how the flight departure delay rate is spatially distributed.

Section 2: Flight Departure Delay Rate by Flight Origin Airports

There are three topics the project explores in this section:

  • Which airports have top 10 departure delay rates among 310 airports in total?

  • What are the departure delay rates at top 10 popular airports in US?

  • What causes those airports to have high departure delay rate?

First, we conducts some data manipulation of the original dataset to create a dataset ready for visualizing the top 10 origin airports with high departure delay rate.

#count the number of on-time and delayed flight departures for all origin airports
dep_origin_airport <- total_2017 %>% 
  group_by(ORIGIN, DEP_DEL15) %>% 
  summarise(count= n())

#delete NA
dep_origin_airport <- na.omit(dep_origin_airport )

#transform the dataset to wide format for calculating the delay percentages for each airport
wide_origin_airport <- dep_origin_airport %>% spread(key = DEP_DEL15, value = count)

#modify the column names to make them more understandable
colnames(wide_origin_airport)[colnames(wide_origin_airport)== 0 ] <- "On_Time"
colnames(wide_origin_airport)[colnames(wide_origin_airport)== 1 ] <- "Delayed"

#add a column with the percentages of flight departure on-time and delayed
wide_origin_airport <- wide_origin_airport %>% 
  mutate(percent_delay = Delayed/(On_Time+Delayed))

#delete NA
wide_origin_airport <- na.omit(wide_origin_airport )

#order the airports by % of flight departure delayed
wide_origin_airport <- arrange(wide_origin_airport,desc(percent_delay))

With the dataset ready, we select the top 10 origin airports with high flight departure delay rate and create a bar chart to visualize the airports and their departure delay rates.

#top 10 origin airports 
top10_origin_airport <- wide_origin_airport[1:10,]
as.numeric(top10_origin_airport$percentage_format)

#create a bar plot  
ggplot(top10_origin_airport, aes(x= reorder(ORIGIN, percent_delay), 
                                 y = percent_delay, fill = percent_delay,
                                 group = 1)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = percent)+
  geom_text(data= top10_origin_airport,
            aes(label=percent(percent_delay)),
            hjust= 0.5, vjust = 1, color = "#424242") +
  ylab("Percentage of Flights Delayed") +
  xlab("Origin Airports") +
  labs(title = "Top 10 Origin Airports by Percentage of Flights Departure Delay",
       subtitle = "July 2017 - December 2017, USA")+
  scale_fill_gradient(low = "#E6EE9C", high= "#80CBC4", name="Percent", 
                      breaks = 0.025*12:16, labels = percent(0.025*12:16)) +
  theme(axis.text.x = element_text(size=12))+ 
  theme(axis.text.y = element_text(size = 12))+
  theme(axis.title.x = element_text(size=12,color = "#424242"))+ 
  theme(axis.title.y = element_text(size = 12,color = "#424242"))+
  theme(plot.title = element_text(lineheight=3, color="black", face = "bold", size=16, hjust =0.5),
        plot.subtitle=element_text(size=14, hjust = 0.5))

The chart below shows the top 10 origin airports with high flight departure delay rate.

The abbreviation of these airport names may look unfamiliar with many people, so here are all abbreviations stand for.

  • MVY: Martha’s Vineyard Airport (Massachusetts, United States)

  • UST: Northeast Florida Regional Airport (Florida, United States)

  • BQN: Rafael Hernández Airport (Aguadilla, Puerto Rico)

  • OTH: Southwest Oregon Regional Airport (Oregon, United States)

  • MMH: Mammoth Yosemite Airport (California, United States)

  • HYA: Barnstable Municipal Airport (Massachusetts, United States)

  • ACK: Nantucket Memorial Airport (Massachusetts, United States)

  • PSE: Mercedita Airport (Ponce, Puerto Rico)

  • UIN: Quincy Regional Airport (Illinois, United States)

  • MEI: Meridian Regional Airport (Mississippi, United States)

As we can see from the list, the top 10 airports with high departure delay rate are not popular airports in US. In fact, most airports on the list are very small airports and only operate a few flights from July 2017 to December 2017. Specifically, the statistics shows that airports like UST and MMH operate flights less than 30 durind the half year period. Hence, the departure delay rates at those airports may be biased due to the small size of the airports.

We provide the data frame in R as the table below to show the number of flights operated at the top 10 airports with high departure delay rate.

However, it would be more interesting to know and understand the departure delay rate at some popular airports in US, such as JFK in New York City, ORD in Chicago or SFO in San Francisco. Hence, we select the top 10 airports with most flights among the 310 airports in US and see how they performed as origin airports for the flight departure.

#select top 10 popular airports
pop_origin_airport <- total_2017 %>% 
  group_by(ORIGIN) %>% 
  summarise(flights_count= n()) %>%
  arrange(desc(flights_count))
pop_origin_airport <- pop_origin_airport[1:10,]

The top 10 popular airports with most flights usually have more than 50k flights from July 2017 to December 2017. The number of flights operated at these airports is shown in the table below.

Then we perform data wrangling as shown below to get the flight departure delay rate at the top 10 popular airports in U and use “ggplot” to create a bar chart showing the airports and their departure delay rates.

#match top 10 popular origin airports with their flights delay information
pop_airport_match <- total_2017
pop_airport <- (pop_airport_match$ORIGIN %in% pop_origin_airport$ORIGIN)
pop_airport_match <- pop_airport_match[pop_airport,]

#select the flight delay information of the top 10 popular airport
pop_origin_airport <- pop_airport_match %>% 
  group_by(ORIGIN, DEP_DEL15) %>% 
  summarise(count= n()) %>%
  arrange(desc(count))

#omit NA data
pop_origin_airport <- na.omit(pop_origin_airport )

#transform the dataset to wide format for calculating the percentages
wide_pop_airport <- pop_origin_airport %>% spread(key = DEP_DEL15, value = count)

#change the column name
colnames(wide_pop_airport)[colnames(wide_pop_airport)== 0 ] <- "On_Time"
colnames(wide_pop_airport)[colnames(wide_pop_airport)== 1 ] <- "Delayed"

#add a column with the percentages of flights on time and delayed
wide_pop_airport <- wide_pop_airport %>% 
  mutate(percent_delay = Delayed/(On_Time+Delayed))

#order the airports by % of flights delayed
wide_pop_airport <- arrange(wide_pop_airport,desc(percent_delay))

#plot the bar chart 
ggplot(wide_pop_airport, aes(x= reorder(ORIGIN, percent_delay), 
                                 y = percent_delay, fill = percent_delay,
                                 group = 1)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = percent)+
  geom_text(data= wide_pop_airport,
            aes(label=percent(percent_delay)),
            hjust= 0.5, vjust = 1, color = "#424242") +
  ylab("Flight Delay Rate") +
  xlab("Origin Airports") +
  labs(title = "Flight Departure Delay Rate at Top 10 Popular Airports in US",
       subtitle = "July 2017 - December 2017")+
  scale_fill_gradient(low = "#FFCDD2", high= "#FFEE58", name="Departure Delay %",
                      breaks = 0.025*5:10, labels = percent(0.025*5:10)) +
  theme(axis.text.x = element_text(size=12))+ 
  theme(axis.text.y = element_text(size = 12))+
  theme(axis.title.x = element_text(size=12,color = "#424242"))+ 
  theme(axis.title.y = element_text(size = 12,color = "#424242"))+
  theme(plot.title = element_text(lineheight=3, color="black", face = "bold", size=20, hjust =0.5),
        plot.subtitle=element_text(size=14, hjust = 0.5))

The outcome is shown as below:

Now the abbreviations of airport name look more familiar with us, such as ATL, LAS and SFO. In order to visualize the the flight departure delay rate at different airports spatially, we also create a map showing the top 10 popular origin airports.

Since the original dataset does not contain any geospatial information of the airports such as geometry, or longitude and latitude, we need to first geocode the 10 airports for ploting their locations on the map.

We use the “googleway” package in R to geocode the airports. This package accesses Google Maps APIs to retrieve data and get locations of searching variables. The R documentation of “googleway” package can be accessed here.

The codes below show how to geocode the airports using “googleway” package.

#import the library
library(googleway)

#create a new data frame to store the geocoded address of the top 10 popular Airports
geoCode_pop.airport <- data.frame() 

#loop through the 10 airports
for (i in 1:nrow(geoCode_pop.airport)){
  
  thiscode <- data.frame()
  
  airports <- geoCode_pop.airport[i, "ORIGIN"]
  
  #give a correct format of address for geocoding
  codedAddr <- paste0(airports, " Airport, USA")
  
  #geocode using google_geocode
  geoAddr <- google_geocode(address = codedAddr,
                            key = "AIzaSyDpkn5sgr-4KXC0x4Pq2m5FMK1CmCUmyV8",
                            simplify = TRUE)
  
  geoAddr$results$geometry$location
  
  #add geocoded information to the data frame
  thiscode <- data.frame(airports, geoAddr$results$geometry$location)
  geoCode_pop.airport <- rbind(geoCode_pop.airport,thiscode)
}

#join the geocoded airport data frame to the percent error data frame for each airport
pop_airports_join <- left_join(wide_pop_airport, geoCode_pop.airport, by = "ORIGIN")

Now we are ready to create a map that shows flight departure delay rate at the top 10 popular airport.

For the effectiveness and aesthetics of the map visualization, we use the “Leaflet” library in R to generate an interactive Leaflet map showing the departure delay rates at different airports in US. The package description can be assecced here.

Here are codes for creating a Leaflet in R.

#import the Leaflet library
library(leaflet)

#set the color palette for the points on the map
pop_paletteContinuous <- colorNumeric(palette = "viridis", domain = pop_airports_join$percent_delay)

#set the scale and color for the legend
pop_palbin = colorBin('viridis', tall_airports_join$percent_error, bins = c(0, .05, .1, .15, .2, .25))

#Plot the percentage of departure delay on a Leaflet map
leaflet(pop_airports_join) %>%
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addCircleMarkers(data = pop_airports_join,
                   lng = ~lng,
                   lat = ~lat,
                   radius = ~(percent_delay)*120,
                   fillOpacity = 0.7,
                   fillColor = ~pop_paletteContinuous(percent_delay),
                   color = "white",
                   #opacity = 1,
                   stroke=T,
                   #stroke=T,
                   weight = 1,
                   popup= ~paste0("Percentage of Flight Departure Delay: ",percent(percent_delay)),
                   label = ~ORIGIN) %>%
  addLegend(pal = pop_palbin, 
            values = ~percent_delay, 
            position = "topright", 
            title = "% of Flight Departure Delay",
            labFormat = labelFormat(prefix = '', suffix = '%', between = ' - ',
                                    transform = function(percent_delay) 100 * percent_delay),
            opacity = 1) 

The interactive online Leaflet map can be accessed here.

Some pages and functions of the map are also shown as below for easily get a sense of the interactive map.

Here is the initial page of the map. Each circle represents one of the 10 popular airports. The bigger the circle, the greater departure delay rate the airport has from July 2017 to December 2017. Meanwhile, the color also indicates the value of departure delay rates, which can be refered in the legend of the map.

When the mouse move to the circle, it shows the name of the airport. Below is Denver Internation Airport.

When clicking the circle, it shows the flight departure delay rate at the airport selected.

You can also zoom in and zoom out…

As we can see from the result, SFO, LAS and ORD have top 3 departure delay rates among the 10 popular airports. Hence, we want to know what causes these 3 airport to have relatively high departure delay rate.

Since we have delay minutes of five delay causes, we can see the percentage of delay minutes by each cause over the total delay minutes to get a idea of which causes are more influential to flight delays at these 3 airports.

We will create a stacked bar chart to show the percentage of delay minutes by each cause over the total delay minutes at SFO, LAS and ORD airports. We first conduct data manipulation as below to get the dataset ready to plot and use “ggplot” to create the stacked bar chart.

pop_top_3 <- subset(total_2017, ORIGIN == "SFO"|
                                ORIGIN == "LAS"|
                                ORIGIN == "ORD")

#delete NA
pop_top_3 <- na.omit(pop_top_3)

#get the total minutes by causes
pop_top_3 <- pop_top_3 %>% 
  group_by(ORIGIN) %>%
  summarise(sum_carrier_dep= sum(CARRIER_DELAY),
            sum_weather_dep= sum(WEATHER_DELAY),
            sum_nas_dep= sum(NAS_DELAY),
            sum_security_dep= sum(SECURITY_DELAY),
            sum_late_aircraft= sum(LATE_AIRCRAFT_DELAY)) 

#add a column of total delay minutes of all causes by month
pop_top_3 <- pop_top_3 %>% 
  mutate(sum_mins = sum_carrier_dep+
           sum_weather_dep+
           sum_nas_dep+
           sum_security_dep+
           sum_late_aircraft)

#calculate the percentage of total delay minutes for each cause by month
pop_top3_percent <- pop_top_3 %>% 
  mutate(percent_carrier = sum_carrier_dep/sum_mins) %>%
  mutate(percent_weather = sum_weather_dep/sum_mins) %>%
  mutate(percent_nas = sum_nas_dep/sum_mins) %>%
  mutate(percent_security = sum_security_dep/sum_mins) %>%
  mutate(percent_late = sum_late_aircraft/sum_mins)

#select the percentage of delay causes only
pop_top3_percent <- select(pop_top3_percent,ORIGIN, percent_carrier,percent_weather,percent_nas,
                           percent_security,percent_late)
                           
#transform wide format to tall format
tall_top3_percent <- gather(pop_top3_percent, del_causes, percent_causes, percent_carrier:percent_late, factor_key=TRUE)

The stacked bar chart shows the percentage of delay minutes by each cause over the total delay minutes at SFO, LAS and ORD airports.

We see that the late aircraft delay has the largest amount of time out of the total flight delay time at all 3 airports, especially at SFO where late aircraft delay causes half of the total delay time. In addition, national aviation delay and carrier delay are also two major causes of flight delay. Moreover, it is interesting to note that ORD has much higher percentage of extreme weather delay than the other two airports. That is, extreme weather conditions have more influence on flight delay at ORD than SFO or LAS. This can be explained by the fact that Chicago experiences bad weather condiction more oftenly than San Francisco or Las Vegas due to the geospatial location.

Section 3: Flight Departure Delay Rate by Flight Origin State

Now let’s explore the flight departure delay rate on a larger geographic scale - by US states.

Two maps will be created in this section:

  • Aggregate Flight Departure Delay Rate by Flight Origin State from July 2017 to December 2017

  • Monthly Change of Flight Departure Delay Rate by Flight Origin State

First, we create a map that shows flight departure delay rates of all US states during the period from July 2017 to December 2017. We will use Albers USA map provided in the “albersusa” library in R to visualize the spatial distribution of flight departure delay by origin states.

Some data manipulation is performed as below in order to get a dataset ready for map plot.

#get flight on-time and delay information by states
State_dep_del <- total_2017 %>% 
  group_by(ORIGIN_STATE_FIPS, ORIGIN_STATE_NM, DEP_DEL15) %>%
  summarise( count= n()) 

#delete NA
State_dep_del <- na.omit(State_dep_del)

#transform the dataset to wide format
wide_state_dep <- State_dep_del %>% spread(key= DEP_DEL15, value= count)

#change the column name
colnames(wide_state_dep)[colnames(wide_state_dep)== 0 ] <- "On_Time"
colnames(wide_state_dep)[colnames(wide_state_dep)== 1 ] <- "Delayed"

#add a column with the percentages of flights on time and delayed
wide_state_dep <- wide_state_dep %>% 
  mutate(percent_delay = Delayed/(On_Time+Delayed))

#order the states by % of flights delayed
wide_state_dep <- arrange(wide_state_dep,desc(percent_delay))

Now we import the shapefile of Albers USA map in “albersUSA” library. Since we include the field of state fips when we download the original dataset from BTS, we can join the dataset we create before with the Albers USA map by state fips. Then we can create a map showing flight departure delay rate at all states in US.

# **** US state map - Albers USA Composite projection ****
states <- usa_sf()
states <- mutate(states,fips=as.character(fips_state))

# make "fips" a string
wide_state_dep <- mutate(wide_state_dep,fips=as.character(ORIGIN_STATE_FIPS)) 
# add a leading "0" if it has a length of 4
wide_state_dep <- mutate(wide_state_dep,fips = ifelse(nchar(fips)==1,paste0("0",fips),fips)) 

#join the flight delay data with the state shapefile and transform the joined dataset to shapefile
originState_join <- left_join(states,wide_state_dep,c("fips","fips")) %>%
  st_as_sf()

#add the map theme for aesthetics of the US state map
myTheme <- function() {
  theme_void() + 
    theme(
      text = element_text(size = 7),
      plot.title = element_text(size = 16, color = "#111111", hjust = 0.5, vjust = 0, face = "bold"), 
      plot.subtitle = element_text(size = 14, color = "#333333", hjust = 0.5, vjust = 0),
      plot.caption = element_text(size = 10, color = "#333333", hjust = 0.5, vjust = 0),
      axis.ticks = element_blank(),
      panel.grid.major = element_line(colour = "white"),
      legend.direction = "vertical", 
      legend.position = "right",
      #plot.margin = margin(0, 0, 0, 0, 'cm'),
      legend.key.height = unit(1, "cm"), legend.key.width = unit(0.4, "cm"),
      legend.title = element_text(size = 12, color = "#111111", hjust = 0, vjust = 0, face = "bold"),
      legend.text = element_text(size = 10, color = "#111111", hjust = 0, vjust = 0)
    ) 
}

#map the flight delay rate at US states
ggplot() +
  geom_sf(data = originState_join, aes(fill = percent_delay),
          color = alpha("white",0.2), size=0.1) + 
  scale_fill_gradient(low = "#FFE082", high= "#FF8F00", name="% of Flight \nDeparture Delay",
                      breaks = 0.025*4:8, labels = percent(0.025*4:8)) +
  coord_sf(crs = st_crs(102003)) +
  labs(
    title = 'Flight Departure Delay Rate in US States',
    subtitle = "July 2017 - December 2017",
    caption = "Source: United States Department of Transportation") +
  myTheme()

Here is the map showing flight departure delay rate at all states in US from July 2107 to December 2017.

We see that states in the north part along the east coast have relatively high percentage of flight departure delay. Illinois also has high departure delay rate. High flight departure delay rate in these two area can be explained by the weather conditions in the area, especially during winter time. However, Florida, California and Nevada also experience high departure delay rate from July to December 2017. Hence, other delay causes like carrier delay or late aircraft delay are also important factors influencing flight delays.

Since aggregate flight delay rates during a whole period of the time may lose some specificities and charateristics of spatio-temporal patterns of flight delays. Hence, the next part of this section will create an animated map that shows the monthly change of departure delay rate in all US states. The codes below show how to create the map for one month.

#July: Flight Departure Delay Rate by Origin States 

#get flight on time and delay information by states
State_July <- July_2017 %>% 
  group_by(ORIGIN_STATE_FIPS, ORIGIN_STATE_NM, DEP_DEL15) %>%
  summarise( count= n()) 

#delete NA
State_July <- na.omit(State_July)

#transform the dataset to wide format
wide_state_July <- State_July %>% spread(key= DEP_DEL15, value= count)

#change the column name
colnames(wide_state_July)[colnames(wide_state_July)== 0 ] <- "On_Time"
colnames(wide_state_July)[colnames(wide_state_July)== 1 ] <- "Delayed"

#add a column with the percentages of flights on time and delayed
wide_state_July <- wide_state_July %>% 
  mutate(percent_delay = Delayed/(On_Time+Delayed))

# make "fips" a string
wide_state_July <- mutate(wide_state_July,fips=as.character(ORIGIN_STATE_FIPS)) 
# add a leading "0" if it has a length of 4
wide_state_July <- mutate(wide_state_July,fips = ifelse(nchar(fips)==1,paste0("0",fips),fips)) 

#join the flight delay data with the state shapefile
wide_state_July <- left_join(states,wide_state_July,c("fips","fips")) %>%
  st_as_sf()

#order the states by % of flights delayed
wide_state_July <- arrange(wide_state_July,desc(percent_delay))
#delete NA
wide_state_July <- na.omit(wide_state_July)

#create a map theme for the monthly change map
myTheme2 <- function() {
  theme_void() + 
    theme(
      text = element_text(size = 7),
      plot.title = element_text(size = 16, color = "#111111", hjust = 0.5, vjust = 0), 
      plot.subtitle = element_text(size = 26, color = "#111111", hjust = 0.5,vjust= -1, face = "bold"),
      plot.caption = element_text(size = 10, color = "#333333", hjust = 0.5, vjust = 0),
      axis.ticks = element_blank(),
      panel.grid.major = element_line(colour = "white"),
      legend.direction = "vertical", 
      legend.position = "right",
      #plot.margin = margin(0, 0, 0, 0, 'cm'),
      legend.key.height = unit(1, "cm"), legend.key.width = unit(0.4, "cm"),
      legend.title = element_text(size = 12, color = "#111111", hjust = 0, vjust = 0),
      legend.text = element_text(size = 10, color = "#111111", hjust = 0, vjust = 0)
    ) 
}

#create the map
ggplot() +
  geom_sf(data = wide_state_July, aes(fill = percent_delay),
          color = alpha("white",0.2), size=0.1) +   
  scale_fill_gradient(low = "#FFDAB9", high= "#FF8C00", name = "% of Flight \nDeparture Delay",
                      breaks = 0.05*2:5, labels = percent(0.05*2:5)) +
  coord_sf(crs = st_crs(102003)) +
  labs(
    title = 'Flight Departure Delay Rate in US States',
    subtitle = "July 2017") +
  myTheme2()

After combining the maps of six months from July to December 2017, we get this animated map shown below.

Spatially, we see that the west and east coasts generally have higher flight departure delay rates than states in the middle of US. However, it is noticeable that the legend scale changes by months. While July, August and December have a legend scale ranging from 10% to 25%, September, October and November have a legend ranging from 10% to 15%. This indicates that the general scale of flight departure delay rate are different for different months. Some months may have higher average departure delay rates than other months. Thus, in the next section we look at the temporal patterns in more details.

Section 4: How Flight Departure Delay Rate Changes by Time?

In this section, we want to see:

  • Average & Maximum Flight Departure Delay Rate by Month

  • What causes some months to have high flight departure delay rate?

We first create a line chart to show how departure delay rate varies by month from July to December 2017. Then another line chart of delays causes for each month will help us better understand why some months have higher flight departure delay rate.

To see how flight departure rate changes over time, we first create a data frame that contains the average and maximum departure delay rate for each month and use “ggplot” to visualize the changes.

#create a data frame of average/max percentage of delay in each month
month <- c("July", "August", "September", "October", "November", "December")

average_delay_percent <-c(mean(wide_state_July$percent_delay),
                         mean(wide_state_Aug$percent_delay),
                         mean(wide_state_Sep$percent_delay),
                         mean(wide_state_Oct$percent_delay),
                         mean(wide_state_Nov$percent_delay),
                         mean(wide_state_Dec$percent_delay))

max_delay_percent <- c(max(wide_state_July$percent_delay),
                       max(wide_state_Aug$percent_delay),
                       max(wide_state_Sep$percent_delay),
                       max(wide_state_Oct$percent_delay),
                       max(wide_state_Nov$percent_delay),
                       max(wide_state_Dec$percent_delay))
percent_month <- data.frame(month, average_delay_percent, max_delay_percent)

#plot the line chart
ggplot(data = percent_month , aes(month),colour =variable) + 
  geom_line(aes(y = average_delay_percent, colour = "Average Delay %",group = 1), size=1.5) + 
  geom_line(aes(y = max_delay_percent, colour = " Maximum Delay %",  group = 1), size=1.5) +
  geom_point(aes(y = average_delay_percent,group = 1),color = "#48D1CC", size = 3) +
  geom_point(aes(y = max_delay_percent,group = 1),color = "#FF6347",size = 3) +
  geom_text(aes(y = average_delay_percent, label=percent(average_delay_percent)),hjust=0.5, vjust=-0.5, group = 1)+
  geom_text(aes(y = max_delay_percent, label=percent(max_delay_percent)),hjust=0.5, vjust=-0.5,group = 1)+
  scale_x_discrete(limit=c("July", "August", "September", "October", "November","December")) +
  scale_y_continuous(labels = percent) +
  labs(colour='', 
       title="Average & Maximum Flight Departure Delay Percentage by Month",
       x="Month",
       y="Percentage of Departure Delay",
       subtitle = "July 2017 - December 2017, USA") + 
  theme(axis.title=element_text(size = 12), 
        axis.text=element_text(size = 12),
        legend.text=element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 14))

The chart below shows how average and maximum departure delay rate changes from July to December 2017.

The observation from the line chart above corresponds to the changes of legend scale shown on the animated map. That is, July, August and December have higher average and maximum departure delay rate than the other 3 months. Bad weather conditions may cause higher delay rate in December, but it is interesting that July and August have relatively high departure delay rate. We specualte that high volumes of flights during summer holiday may contribute to the high delay rate in July and August, as increasing airport operations and airline services can lead to more air carrier delay, national avaition system delay and also late aircraft delay.

Therefore, we also create a line chart to visualize how the percentage of minutes by each delay cause over total delay minutes changes from July 2017 to December 2017. The codes below how to do the data wrangling and chart plotting.

month_cause <- subset(total_2017, ARR_DEL15==1)

#delete NA
month_cause <- na.omit(month_cause)

#get the total minutes by causes
month_cause <- month_cause %>% 
  group_by(MONTH) %>%
  summarise(sum_carrier_dep= sum(CARRIER_DELAY),
            sum_weather_dep= sum(WEATHER_DELAY),
            sum_nas_dep= sum(NAS_DELAY),
            sum_security_dep= sum(SECURITY_DELAY),
            sum_late_aircraft= sum(LATE_AIRCRAFT_DELAY)) 

#add a column of total delay minutes of all causes by month
month_cause <- month_cause %>% 
               mutate(sum_mins = sum_carrier_dep+
                                     sum_weather_dep+
                                     sum_nas_dep+
                                     sum_security_dep+
                                     sum_late_aircraft)

#calculate the percentage of total delay minutes for each cause by month
month_cause <- month_cause %>% 
               mutate(percent_carrier = sum_carrier_dep/sum_mins) %>%
               mutate(percent_weather = sum_weather_dep/sum_mins) %>%
               mutate(percent_nas = sum_nas_dep/sum_mins) %>%
               mutate(percent_security = sum_security_dep/sum_mins) %>%
               mutate(percent_late = sum_late_aircraft/sum_mins)

#plot the line chart
ggplot(data = month_cause , aes(MONTH),colour =variable) + 
  geom_line(aes(y = percent_carrier, colour = "% of Carrier Delay",group = 1), size=1.5) + 
  geom_line(aes(y = percent_weather, colour = "% of Extreme Weather Delay",  group = 1), size=1.5) +
  geom_line(aes(y = percent_nas, colour = "% of National Aviation System Delay",group = 1), size=1.5) + 
  geom_line(aes(y = percent_security, colour = "% of Security Delay",  group = 1), size=1.5) +
  geom_line(aes(y = percent_late, colour = "% of Late Aircraft Delay",group = 1), size=1.5) + 
  scale_y_continuous(labels = percent) +
  labs(colour='', 
       title="Delay Causes by Month: as a Percent of Total Delay Minutes",
       x="Month",
       y="Percent of Total Delay Minutes",
       subtitle = "July 2017 - December 2017, USA") + 
  theme(axis.title=element_text(size = 12), 
        axis.text=element_text(size = 12),
        legend.text=element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 14))

Here is the line chart showing how the percentage of minutes by each delay cause over total delay minutes changes from July 2017 to December 2017

We see that late aircraft delay has the highest percentage of total delay minutes in all months. This indicates that aircrafts arriving late may be one of the major causes of flight delays. Carrier delay and national aviation system delay also causes a lot of flight delays. However, it is noticeable that some delay time caused by weather conditions is actually classified as national aviation system delay instead of extreme weather delay. Hence, the result may not reflect some facts due to the method of classification of the original dataset. Nevertheless, we see higher percentage of late aircraft delay in July and August and higher carrier delay and extreme weather in December.

Section 5: Will Holiday Affect Flight Departure Delay?

We see from the last section that July, August and December have higher flight departure delay rates than other months. Then we have the assumption that increasing volumes of flights in summer holiday may contribute to the high delay rates in July and August. Hence, in this section, we want to explore whether holiday will affect flight departure flight. Instead of choosing time period in July or August as our sample test dataset(since the dates of summer holiday varies from schools and companies), we choose flight data from December 1st to December 30th as the sample set as Christmas is a national-wide holiday in US.

We first conduct some data wrangling work to get the Christmas dataset in correct format. Then we plot the flight departure delay rate by the dates in December to see if there is a trend over time.

Christmas_time <- subset(Dec_2017, DAY_OF_MONTH == 1:30)

#delete the X column
Christmas_time <- select(Christmas_time, -X)

#get flight delay information for each data in December
Christmas_del <- Christmas_time %>%
                 group_by(DEP_DEL15,DAY_OF_MONTH) %>%
                 summarise(count = n())

#delete NA
Christmas_del <- na.omit(Christmas_del)

#transform the dataset to wide format for calculating the percentages
wide_Christmas_del <- Christmas_del %>% spread(key = DEP_DEL15, value = count)

colnames(wide_Christmas_del)[colnames(wide_Christmas_del)== 0 ] <- "On_Time"
colnames(wide_Christmas_del)[colnames(wide_Christmas_del)== 1 ] <- "Delayed"

#add a column with the percentages of flights on time and delayed
wide_Christmas_del <- wide_Christmas_del %>% 
  mutate(percent_delay = Delayed/(On_Time+Delayed))

ggplot(data = wide_Christmas_del , aes(DAY_OF_MONTH)) + 
  geom_line(aes(y = percent_delay,group = 1), size=1.2, color = "#FF8C00") + 

  geom_point(aes(y = percent_delay,group = 1),color = "#FF8C00", size = 3) +

  geom_text(aes(y = percent_delay, label=percent(percent_delay)),
            hjust=0, vjust=-0.5, color = "#424242", group = 1)+

  scale_y_continuous(labels = percent) +
  labs(colour='', 
       title="Percentage of Flight Departure Delay by Dates in December",
       x="Date in December",
       y="Percentage of Departure Delay",
       subtitle = "December 2017, USA") + 
  theme(axis.title=element_text(size = 12), 
        axis.text=element_text(size = 12),
        legend.text=element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 14))

The chart below shows how the flight departure delay rate changes by the dates in December.

In general, we see that the flight departure delay rates increase as the dates are approaching Christmas, though there are some spikes on several specific dates. The delay rate reaches the highest value on December 30, 2017 in this dataset. Dates between December 20th and 30th have relatively high percentages of flight departure delay. Meanwhile, January 1st is also the New Year Holiday. Hence, it is reasonable to conclude that holidays have influence on flight departure delay according to the sample dataset.

Section 6: Flight Departure Delay Rate by Air Carriers

The last section of Part 1 will explore how different airlines perform on flights on-time vs. delayed. We will create a bar chart to compare the departure delay rate between the 12 airlines in the dataset. The codes below shows the data manipulation and chart plotting.

#get flight delay information for all air carriers
del_airlines <- total_2017 %>%
  group_by(DEP_DEL15,UNIQUE_CARRIER) %>%
  summarise(count = n())

#delete NA
del_airlines <- na.omit(del_airlines)

#transform the dataset to wide format for calculating the percentages
wide_del_airlines <- del_airlines %>% spread(key = DEP_DEL15, value = count)

colnames(wide_del_airlines)[colnames(wide_del_airlines)== 0 ] <- "On_Time"
colnames(wide_del_airlines)[colnames(wide_del_airlines)== 1 ] <- "Delayed"

#add a column with the percentages of flights on time and delayed
wide_del_airlines <- wide_del_airlines %>% 
  mutate(percent_delay = Delayed/(On_Time+Delayed)) %>%
  arrange(desc(percent_delay))

#plot bar chart
ggplot(wide_del_airlines, aes(x= reorder(UNIQUE_CARRIER, percent_delay), y = percent_delay, fill = percent_delay)) +
  geom_bar(stat = "identity") +
  geom_text(data = wide_del_airlines,
            aes(label=percent(percent_delay)),
            hjust= 0.5, vjust =1, 
            color = "#424242", size=4) +
  scale_y_continuous(labels = percent)+
  ylab("Percentage Departure Delay") +
  xlab("Air Carriers") +
  #coord_flip() +
  labs(title = "Percentage of Flight Departure Delay by Air Carriers",
       subtitle = "July 2017 - December 2017, USA")+
  scale_fill_gradient(low = "#FFE082", high= "#FFAB91", name="% of Flight \nDeparture Delay",
                      breaks = 0.05*1:5, labels = percent(0.05*1:5)) +
  theme(axis.text.x = element_text(size=12,color="black"),
        axis.text.y = element_text(size = 12,color="black"),
        axis.title.x = element_text(size = 12, color="black"),
        axis.title.y = element_text(size = 12, color="black"),
        plot.title = element_text(color="black", size=18, hjust =0.5, face ="bold"),
        plot.subtitle = element_text(color="black", size=14, hjust =0.5))

Here is the bar chart showing the percentage of flight departure delay by different air carriers from July 2017 to December 2017.

For the reference of the abbreviations of the 12 air carriers:

  • Alaska Airlines (AS)

  • American Airlines (AA)

  • Delta Air Lines (DL)

  • ExpressJet Airlines (EV)

  • Frontier Airlines (F9)

  • Hawaiian Airlines (HA)

  • JetBlue Airways (B6)

  • SkyWest Airlines (OO)

  • Southwest Airlines (WN)

  • Spirit Airlines (NK)

  • United Airlines (UA)

  • Virgin America (VX)

Now we see that while Jetblue has the highest departure delay rate from July 2017 to December 2017, Hawaiian Airlines has the lowest. Generally, low-cost airlines have relatively higher departure delay rate than other airlines. United Airlines and American Airlines have very similar delay rate, whereas Delta has the departure delay rate 3% less than the other two major airlines in US.

Part 2: Flight Departure & Arrival Delay Prediction

In this part, we conduct three logistic regression models for predicting flight departure and arrival delay. One logistic regression model is to forecast the flight departure delay and the other two models will generate flight arrival delay prediction. The main idea of these predictions is to generate a binary prediction of whether a flight will experience departure or arrival delay by using limited historical information related to flight departure/arrival delay. The accuracy and efficiency of all models will be assessed by confusion matrix and ROC curve.

Moreover, in order to imporve the accuracy and efficiency of the regression models, we select top 20 popular flight routes in US and use their flight inforamtion to generate both departure and arrival delay predictions. Since we will predict the flight delay of the test data based on the model built by the training set, the levels of variables in both datasets have to be the same to ensure the test set prediction. That is, if we use flights’ origin airport as one of the predictor for flight departure flight, origin airports in the test set have to be matched with those in the the training set. The same reasoning applies to the destination airports for flight arrival delay. Thus, we select top 20 popular flight routes in US to ensure both of the flights’ origin and destination will be the same in the training set and the test set for both departure and arrival delay prediction.

Section 1: Flight Departure Delay Prediction

In this section, we will perform the flight departure delay prediction. We will evaluate the performance of the preditive model by applying the model to one training set and one test set. The training set contains flight on-time performance data from August 1st to August 7th (the first week of August 2017), whereas the test set is the flight on-time performance data on August 8th (the next day after the first week of August 2017). Hence, this project uses one-week historical data to generate one-day flight departure delay prediction.

We first select the top 20 popular flight routes in the original dataset.

#find the popular flight routes
pop_route <- total_2017 %>%
  group_by(ORIGIN,DEST) %>%
  summarise(count= n()) %>%
  arrange(desc(count))

#select the top 20 popular flight routes
pop_route_20 <- pop_route[1:20,]

#match top 20 popular flight routes with their flights delay information
pop_route_match <- total_2017
pop_route <- (pop_route_match$ORIGIN %in% pop_route_20$ORIGIN)
pop_route_match <- pop_route_match[pop_route,]

Then we prepare the training set and test set for both flight departure and arrival prediction. Different variables for different predictions(departure/arrival) will be selected later from the training and test set.

##### Prepare the training dataset
#select the first week of August
pop_route_Aug <- filter(pop_route_match, MONTH == 8)
pop_route_Aug <- filter(pop_route_Aug,  DAY_OF_MONTH == 1 |
                                        DAY_OF_MONTH == 2 |
                                        DAY_OF_MONTH == 3 |
                                        DAY_OF_MONTH == 4 |
                                        DAY_OF_MONTH == 5 |
                                        DAY_OF_MONTH == 6 |
                                        DAY_OF_MONTH == 7 )
levels(pop_route_Aug$ORIGIN)
#training set: the dataset of top 20 popular flight routes in the first week of August
training_set <- pop_route_Aug

#select variables needed for the regression
training_set <- select(training_set, DAY_OF_WEEK, UNIQUE_CARRIER, ORIGIN, DEST, 
                       DEP_TIME_BLK,ARR_TIME_BLK ,DEP_DEL15,ARR_DEL15)

training_set$DEP_TIME_BLK <- substr(training_set$DEP_TIME_BLK, 0, 2)
training_set$ARR_TIME_BLK <- substr(training_set$ARR_TIME_BLK, 0, 2)

#see the class of all variables in the training dataset
unlist(lapply(training_set, class))

#change the class of all virables to factor
training_set$UNIQUE_CARRIER <- as.factor(training_set$UNIQUE_CARRIER)
training_set$DAY_OF_WEEK <- as.factor(training_set$DAY_OF_WEEK )
training_set$ORIGIN <- as.factor(training_set$ORIGIN)
training_set$DEST <- as.factor(training_set$DEST)
training_set$DEP_TIME_BLK <- as.factor(training_set$DEP_TIME_BLK)
training_set$ARR_TIME_BLK <- as.factor(training_set$ARR_TIME_BLK)
training_set$DEP_DEL15 <-as.factor(training_set$DEP_DEL15)
training_set$ARR_DEL15 <-as.factor(training_set$ARR_DEL15)

#delete NA data 
training_set <- na.omit(training_set)

###### Prepare the test data set
#select the day after the first week of August
pop_route_Aug2 <- filter(pop_route_match, MONTH == 8, DAY_OF_MONTH == 8)

#training set: the dataset of top 20 popular flight routes in the first week of August
test_set <- pop_route_Aug2

#select variables needed for the regression
test_set <- select(test_set, DAY_OF_WEEK, UNIQUE_CARRIER, ORIGIN, DEST, 
                       DEP_TIME_BLK,ARR_TIME_BLK ,DEP_DEL15,ARR_DEL15)

test_set$DEP_TIME_BLK <- substr(test_set$DEP_TIME_BLK, 0, 2)
test_set$ARR_TIME_BLK <- substr(test_set$ARR_TIME_BLK, 0, 2)

#see the class of all variables in the training dataset
unlist(lapply(test_set, class))

#change the class of all virables to factor
test_set$UNIQUE_CARRIER <- as.factor(test_set$UNIQUE_CARRIER)
test_set$DAY_OF_WEEK <- as.factor(test_set$DAY_OF_WEEK )
test_set$ORIGIN <- as.factor(test_set$ORIGIN)
test_set$DEST <- as.factor(test_set$DEST)
test_set$DEP_TIME_BLK <- as.factor(test_set$DEP_TIME_BLK)
test_set$ARR_TIME_BLK <- as.factor(test_set$ARR_TIME_BLK)
test_set$DEP_DEL15 <-as.factor(test_set$DEP_DEL15)
test_set$ARR_DEL15 <-as.factor(test_set$ARR_DEL15)

#delete NA data 
test_set <- na.omit(test_set)

Now we are ready to perform the flight departure delay prediction. The predictors we choose for this predition are Day of Week, Air Carrier, Origin Airport, and Departure Time Block(Hourly Interval).

reg_dep <- glm(formula = DEP_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER + ORIGIN + DEP_TIME_BLK, data = training_set, family = "binomial")

Here is the summary of the regression result.

In-Sample Performance of the Departure Delay Prediction

We see that most of the predictors are significant, as marked by “*“. However, the summary of the regression does not provide any information of how accurate the model predicts for the flight departure delay using the training dataset. Hence, we evaluate the in-sample performance by confusion matrix and ROC curve.

Furthermore, we also need to determine the cut-off value for classifying the predicted values of flight departure delay, since the predicted values generated by the regression model are not binary. That is, the predicted values are not classified as 0 or 1; they are all real numbers between 0 to 1, such as “0.1356” or “0.8572”. A cut-off value can be a threshold to determine which values are classified as 0 and which are 1.

More importantly, the cut-off value needs to maximize both true postive rates(sensitivity) and true negative rate(specificity) of the model. Therefore, we create a function to determine the optimal cut-off value for the regression model.

#create a variable of the predicted values 
fit_dep <- reg_dep$fitted
hist(fit_dep)
#create a data frame of in-sample reference value and predicted value
in_sample_dep <- data.frame(labels = training_set$DEP_DEL15, predictions = fit_dep)

#get the performance of the in-sample prediction
pred.in_sample_dep  <- prediction(in_sample_dep$predictions, in_sample_dep$labels)
perf.in_sample_dep = performance(pred.in_sample_dep, measure = "tpr", x.measure="fpr")

#This function generates the optimal cutoff value that maximizes the sensitivity and specificity simutaneously
opt.cut = function(perf, pred){
  cut.ind = mapply(FUN=function(x, y, p){
    d = (x - 0)^2 + (y-1)^2
    ind = which(d == min(d))
    c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}
print(opt.cut(perf.in_sample_dep, pred.in_sample_dep))

The optimal cut-off value generate by this function is 0.501263, so we classify the predicted value of flight departure delay to 0 or 1 according to the optimal cut-off value. Then we get the result from the confusion matrix.

#classify the predicted values below the cutoff values to 0, above to 1
in_sample_dep$predClass  = ifelse(in_sample_dep$predictions >  0.501263,1,0)

# Confusion matrix 
confusionMatrix(in_sample_dep$labels ,in_sample_dep$predClass)

Here is the result of the confusion matrix.

We see that the model of flight departure delay prediction has a general accuracy of 74.51%, with a sensitivity of 75.97% and specificity of 64.24%. That is, the model predicts 75.95% correct for flight delayed(true postives) and 64.24% correct for flight on-time(true negatives). We also create a ROC curve plot to visualize the accuracy of the model.

#add a column of original delay values with numeric class
in_sample_dep <- in_sample_dep %>%
  mutate(labels_numeric = ifelse(in_sample_dep$labels == "1" ,1,0) )

#plot the ROC curve
ggplot(in_sample_dep, aes(d = labels_numeric, m = predictions)) + 
  geom_roc(n.cuts = 60, labels = FALSE, color = "#FF8C00") + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve of In-Sample Depature Delay Prediction")
  theme(plot.title = element_text(color="black", size=16, hjust =0.5))

The area under the ROC curve represents the accuracy of the model prediction.

We also create a Leaflet map that shows the spatial distribution of model residuals. Since there is no geospatial information of the origin airports, we also use “googleway” package in R to geocode the airports’ locations. The R documentation of “googleway” package can be accessed here.

We first conduct some data wrangling to get the percentage of prediction errors for each origin airport. The percentage of error equals to the correct predictions over total predictions.

#create a column of predicted values in the training dataset 
res_training <- training_set %>%
  mutate(reg_dep_predict = reg_dep$fitted)

#classify the predicted values below the cutoff values to 0, above to 1
res_training$predClass  = ifelse(res_training$reg_dep_predict > 0.5,1,0)

#add a column of original delay values with numeric class
res_training <- res_training %>%
  mutate(dep_del_numeric = ifelse(res_training$DEP_DEL15 == "1" ,1,0) )

#create a variable of errors for each flight prediction
res_training$error =  res_training$dep_del_numeric - res_training$predClass

#calculate the percentage of error for each origin airport
geo_airports <- res_training %>%
  group_by(ORIGIN,error) %>%
  summarise(count = n())

#transform the dataset to tall format
tall_airports <- spread(geo_airports, key = error, value= count)

#set NA to 0
tall_airports[is.na(tall_airports)] <- 0

#create a column of total error(both false positive and false negative)
tall_airports$total_error = tall_airports$`1`+tall_airports$`-1`

#create a column of percent error for each airport
tall_airports$percent_error = tall_airports$total_error / tall_airports$`0`

Then we use “googleway” to get the longitude and latitude of the origin airports and then join the geospatial information of the airports to the percentage of prediction errors of the airports.

#create a new data frame to store the geocoded address and the error of each flight
geoCode_training <- data.frame() 

geoAddr <- google_geocode(address = "JFK Airport, USA",
                          key = "AIzaSyDpkn5sgr-4KXC0x4Pq2m5FMK1CmCUmyV8",
                          simplify = TRUE)
geoAddr$results$geometry$location

#select the column of airports name only
geocode_airport <- tall_airports[,c(1)]

#loop through all buildings
for (i in 1:nrow(geocode_airport)){
  
  thiscode <- data.frame()
  
  airports <- geocode_airport[i, "ORIGIN"]
  
  #give a correct format of address for geocoding
  codedAddr <- paste0(airports, " Airport, USA")
  
  #geocode using google_geocode
  geoAddr <- google_geocode(address = codedAddr,
                            key = "AIzaSyDpkn5sgr-4KXC0x4Pq2m5FMK1CmCUmyV8",
                            simplify = TRUE)
  
  geoAddr$results$geometry$location
  
  
  thiscode <- data.frame(airports, geoAddr$results$geometry$location)
  geoCode_training <- rbind(geoCode_training,thiscode)
}

#change the column name in geocoded airport data frame for joining the percent error data frame
colnames(geoCode_training)[colnames(geoCode_training)=="Airports"] <- "ORIGIN"

#join the geocoded airport data frame to the percent error data frame for each airport
tall_airports_join <- left_join(tall_airports, geoCode_training, by = "ORIGIN")

We also create an interactive Leaflet map to visualize the percent error of the flight departure delay prediction. The description of “Leaflet” package can be found in the section 2 of Part 1. The interateve map can be accessed here.

The codes below shows how to create the Leaflet map.

#Plot the percent error on a Leaflet map

#set the color palette for the points on the map
paletteContinuous <- colorNumeric(palette = "viridis", domain = tall_airports_join$percent_error)

#set the scale and color for the legend
palbin = colorBin('viridis', tall_airports_join$percent_error, bins = c(0, .1, .2, .3, .4, .5))

library(leaflet)
#Leaflet
leaflet(tall_airports_join) %>%
  #addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addCircleMarkers(data=tall_airports_join,
                   lng = ~lng,
                   lat = ~lat,
                   radius = ~(percent_error)*50,
                   fillOpacity = 0.8,
                   fillColor = ~paletteContinuous(percent_error),
                   color = "white",
                   #opacity = 1,
                   stroke=T,
                   #stroke=T,
                   weight = 1,
                   popup= ~paste0("Percent Error: ",percent(percent_error)),
                   label = ~ORIGIN) %>%
  addLegend(pal = palbin, 
            values = ~percent_error, 
            position = "bottomright", 
            title = "Percent Error",
            labFormat = labelFormat(prefix = '', suffix = '%', between = ' - ',
                                    transform = function(percent_error) 100 * percent_error),
            opacity = 1)

There are a few pages showing the interface of the map. Below is the initial view of the map and there are 12 airports in total for plotting the percent error of flight departure delay prediction.

Honolulu International Airport(HNL) and Kahului Airport (OGG) in Hawaii have the least percent error. This indicates the model performs well with flight departures at these airports.

Most airports have percent error between 20% to 35%, such as LAX, DEN and ATL.

Airports in New York City like LGA and JFK have high percent errors, suggesting the model predicts badly of the flights departed at these airports.

In-Sample Performance of the Departure Delay Prediction

Now let’s look at how the model performs for the test set departure delay prediction. We first use the “predict” function to get the predicted values of the test set departure on-time performance using the regression model we build before. Then we apply the same methodology of the in-sample performance evaluation to assess the test set prediction performance.

#Use the departure delay prediction model to forcast the departure delay of the test set
test_dep_predict <- predict(reg_dep, test_set, type="response")

#create a data frame of out-sample(test set) reference value and predicted value
out_sample_dep <- data.frame(labels = test_set$DEP_DEL15, predictions = test_dep_predict)

#get the performance of the in-sample prediction
pred.out_sample_dep  <- prediction(out_sample_dep$predictions, out_sample_dep$labels)
perf.out_sample_dep = performance(pred.out_sample_dep, measure = "tpr", x.measure="fpr")

Function to generate the optimal cut-off value.

#This function generates the optimal cutoff value that maximizes the sensitivity and specificity simutaneously
opt.cut = function(perf, pred){
  cut.ind = mapply(FUN=function(x, y, p){
    d = (x - 0)^2 + (y-1)^2
    ind = which(d == min(d))
    c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}
print(opt.cut(perf.out_sample_dep, pred.out_sample_dep))

The optimal cut-off value for the test set prediction is 0.1691754. That is, if the predicted value of the test test is greater than 0.1691754, then we classify it to 1(delayed). If the predicted value is smaller than 0.1691754, then it is classified as 0 (on-time). After classifying all predicted values, we apply the confusion matrix to the test set prediction.

#classify the predicted values below the cutoff values to 0, above to 1
out_sample_dep$predClass  = ifelse(out_sample_dep$predictions > 0.1691754,1,0)

# Confusion matrix 
confusionMatrix(out_sample_dep$labels ,out_sample_dep$predClass)

Here is the confusion matrix of the departure delay prediction on the test set.

While the model has 74.51% accuracy of prediction with the training set, it only performs 62.15% accuracy on the test set overall. However, the (sensitivity) true positive rate of the test set performance is higher than that of the training set. This suggests that the model predicts more actual flight delayed correct in the test set than in the training set. However, the model loses more accuracy when predicting the flight on-time for the test set.

ROC curve below visulaizes the overall accuracy of the model predicting flight departure delay.

Section 2: Flight Arrival Delay Prediction

In this section, we perform two different logistic regression models for flight arrival delay prediction. The major difference of these two models is that one includes the indicator of flight departure delay as a predictor in the model whereas the other excludes this factor. Both models of flight arrival predictions will be evaluated by their performace of predicting the test set arrival delays. We will also use both confusion matrix and ROC curve to compare the performance of these two regression models. The main purpose of flight arrival prediction is to better understand how significantly the late aircrafts can affect flight delay.

Regression Model Excluding the Indicator of Departure Delay

The predictors of this model are Day of Week, Air Carrier, Origin Airport, Destination Airport and Flight Departure Time Block(hourly interval).

#Regression: Arrival Delay Prediction including the factor of Departure Delay 
reg_arr_noDEP <- glm(formula = ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER + ORIGIN + DEST + DEP_TIME_BLK, 
                     data = training_set, family = "binomial")

Then we apply the same methodology shown in last section to see how this model performs for flight arrival delay.

#Use the departure delay prediction model to forcast the departure delay of the test set
test_arr_predict <- predict(reg_arr_noDEP, test_set, type="response")


#create a data frame of out-sample(test set) reference value and predicted value
out_sample_arr <- data.frame(labels = test_set$ARR_DEL15, predictions = test_arr_predict)

#get the performance of the in-sample prediction
pred.out_sample_arr  <- prediction(out_sample_arr$predictions, out_sample_arr$labels)
perf.out_sample_arr = performance(pred.out_sample_arr, measure = "tpr", x.measure="fpr")

#This function generates the optimal cutoff value that maximizes the sensitivity and specificity simutaneously
opt.cut = function(perf, pred){
  cut.ind = mapply(FUN=function(x, y, p){
    d = (x - 0)^2 + (y-1)^2
    ind = which(d == min(d))
    c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}
print(opt.cut(perf.out_sample_arr, pred.out_sample_arr))

#classify the predicted values below the cutoff values to 0, above to 1
out_sample_arr$predClass  = ifelse(out_sample_arr$predictions >  0.1540963,1,0)

# Confusion matrix 
confusionMatrix(out_sample_arr$labels ,out_sample_arr$predClass)

This function suggests the optimal cut-off value for this model is 0.1540963 and the corresponding confusion matrix is shown as below.

The model without the indicator of flight departure delay only performs 59.92% accuracy of the flight arrival predcition. While the model has 86.78% accuracy of predicting the true positive rates(flights delayed), the model performs poorly for predicting the on-time flights as it only predicts 28.43% correct of the true negatives.

Regression Model Including the Indicator of Departure Delay

Now we build a model including the indicator of flight departure delay. The predictors of this model are Day of Week, Air Carrier, Origin Airport, Destination Airport, Flight Departure Time Block(hourly interval), and the indicator of Flight Departure Delay.

reg_arr_DEP <- glm(formula = ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER + ORIGIN + DEST + DEP_TIME_BLK + DEP_DEL15, data = training_set, family = "binomial")

Then we use the function we create before to generate the cut-off value for getting the confusion matrix.

#Use the departure delay prediction model to forcast the departure delay of the test set
test_arr_predict2 <- predict(reg_arr_DEP, test_set, type="response")

#create a data frame of out-sample(test set) reference value and predicted value
out_sample_arr2 <- data.frame(labels = test_set$ARR_DEL15, predictions = test_arr_predict2)

#get the performance of the in-sample prediction
pred.out_sample_arr2  <- prediction(out_sample_arr2$predictions, out_sample_arr$labels)
perf.out_sample_arr2 = performance(pred.out_sample_arr2,measure = "tpr", x.measure="fpr")

#This function generates the optimal cutoff value that maximizes the sensitivity and specificity simutaneously
opt.cut = function(perf, pred){
  cut.ind = mapply(FUN=function(x, y, p){
    d = (x - 0)^2 + (y-1)^2
    ind = which(d == min(d))
    c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}
print(opt.cut(perf.out_sample_arr, pred.out_sample_arr))

#classify the predicted values below the cutoff values to 0, above to 1
out_sample_arr2$predClass  = ifelse(out_sample_arr2$predictions > 0.1351520,1,0)

# Confusion matrix 
confusionMatrix(out_sample_arr2$labels ,out_sample_arr2$predClass)

The optimal cut-off value for this model is 0.1351520 and the corresponding confusion matrix is shown as below.

We see that the model has an overall accurary of 88.53% when it includes the indicator of flight departure delay. In addition, the true positive rate of this model is quite high, which is 94.73%, but the model still loses some accuracy of predicting the on-time flights.

Finally, we create this ROC plot to visualize the difference in prediction efficiency between the two models predicting flight arrival delays.

#create a data frame containing the refence arrival delay information and predicted values of the two models
bothModels <- rbind(
  data.frame(
    Reference = test_set$ARR_DEL15,
    Probs = test_arr_predict,
    Model = "Model 1"
  ),
  data.frame(           
    Reference = test_set$ARR_DEL15,
    Probs = test_arr_predict2,
    Model = "Model 2"
  )
)          

#add a column of original delay values with numeric class
bothModels <- bothModels %>%
  mutate(Reference_numeric = ifelse(bothModels$Reference == "1" ,1,0) )

#plot the ROC curve for the two models
ggplot(bothModels, aes(d = Reference_numeric, m = Probs, color=Model)) + 
  geom_roc(n.cuts = 50, labels = FALSE) + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') 

Here is the plot of the ROC curves of the two models. Model 1 is the regression model that excludes the indicator of departure delay as a predictor, whereas model 2 is the regression model that includes this indicator. The area under the ROC curve represents the prediction accuracy of the models.

Discussion

The scope of this project tends to be holistic in analyzing the spatio-temporal patterns of departure flight delays. Although this project covers many topics, there are still a lot of aspects and details of flight departure delay can be analyzed and researched further. In addition, since this project mainly focuses on the country scale for flight delay analysis, some regional variations and insights may not be revealed through the project analysis. Nevertheless, the data analytics and visualization techniques used in this project can be applied to a wide range of topics and scales. More detailed regional flight delay analysis will be expected in future work.

Furthermore, the model of flight departure/arrival delay prediction can also be improved in future study. For example, adding the weather factor as a predictor in the model may have some influence on the model results. Moreover, considering the data science aspect of the project, other model algorithms such as random forest or weighted logistic regression can also be performed in an attempt to compare different model predictions. Future work and study will help enhance the project analysis.