The following analysis assesses the data in flight delays during 2015 for flights in the USA. This data is publicly available here: https://www.kaggle.com/usdot/flight-delays#airports.csv. The information was published by the U.S. Department of Transportation’s Bureau of Transportation Statistics to track the on-time performance of domestic flights operated by large carriers. I took this opportunity to understand the data and figure out what could be the culprits of flight delays in 2015.
After reviewing the intial items in the data, I decided on a couple of variables to focus on to include in this analysis.
##Preprocessing Data Below is code to load libraries and merge data
#Set libraries
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
library(scales)
library(ggplot2)
library(RColorBrewer)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ tidyr 1.0.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ dplyr::between() masks data.table::between()
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ lubridate::date() masks base::date()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ lubridate::hour() masks data.table::hour()
## ✖ lubridate::intersect() masks base::intersect()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ lubridate::mday() masks data.table::mday()
## ✖ lubridate::minute() masks data.table::minute()
## ✖ lubridate::month() masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second() masks data.table::second()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ purrr::transpose() masks data.table::transpose()
## ✖ lubridate::union() masks base::union()
## ✖ lubridate::wday() masks data.table::wday()
## ✖ lubridate::week() masks data.table::week()
## ✖ lubridate::yday() masks data.table::yday()
## ✖ lubridate::year() masks data.table::year()
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/modules/R_X11.so
## Reason: image not found
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/
## Resources/library/tcltk/libs//tcltk.so'' had status 1
## Could not load tcltk. Will use slower R code instead.
## Loading required package: RSQLite
#Set files
getwd()
## [1] "/Users/erhoades"
filename <- "flight delays.csv"
airlines <- "airlines.csv"
airports <- "airports.csv"
#Read in files
df <- fread(filename, na.strings=c("",NA))
df_airlines <- fread(airlines)
df_airports <- fread(airports)
#Rename variables so that merging can take place between files
df2 <- df %>%
rename(
AIRLINE_CODE = AIRLINE,
AIRPORT_ORIGIN_CODE = ORIGIN_AIRPORT,
AIRPORT_DEST_CODE = DESTINATION_AIRPORT
)
#Rename variables so that merging can take place between files
df3 <- df_airlines %>%
rename(
AIRLINE_CODE = IATA_CODE,
AIRLINE_NAME = AIRLINE
)
#Check to make sure columns properly renamed
colnames(df3)
## [1] "AIRLINE_CODE" "AIRLINE_NAME"
#merge df2 and df3
df4 <- merge(x = df2, y = df3, by = "AIRLINE_CODE", all.x = TRUE)
#Rename variables so that merging can take place between files
df5 <- df_airports %>%
rename(
AIRPORT_ORIGIN_CODE = IATA_CODE,
AIRPORT_ORIGIN_NAME = AIRPORT,
AIRPORT_ORIGIN_CITY = CITY,
AIRPORT_ORIGIN_STATE = STATE,
AIRPORT_ORIGIN_COUNTRY = COUNTRY,
AIRPORT_ORIGIN_LAT = LATITUDE,
AIRPORT_ORIGIN_LONG = LONGITUDE
)
#merge df4 and df5
df6 <- merge(x = df4, y = df5, by = "AIRPORT_ORIGIN_CODE", all.x = TRUE)
#Rename variables so that merging can take place between files
df7 <- df_airports %>%
rename(
AIRPORT_DEST_CODE = IATA_CODE,
AIRPORT_DEST_NAME = AIRPORT,
AIRPORT_DEST_CITY = CITY,
AIRPORT_DEST_STATE = STATE,
AIRPORT_DEST_COUNTRY = COUNTRY,
AIRPORT_DEST_LAT = LATITUDE,
AIRPORT_DEST_LONG = LONGITUDE
)
#merge df6 and df7
df8 <- merge(x = df6, y = df7, by = "AIRPORT_DEST_CODE", all.x = TRUE)
#Include only relevant columns for analysis
df9 <- df8 %>% select(MONTH, DEPARTURE_DELAY, AIRLINE_NAME, AIRPORT_ORIGIN_NAME)
Below is a snapshot of the different variables included in the data that will be used for the analysis below.
summary(df9)
## MONTH DEPARTURE_DELAY AIRLINE_NAME AIRPORT_ORIGIN_NAME
## Min. : 1.000 Min. : -82.00 Length:5819079 Length:5819079
## 1st Qu.: 4.000 1st Qu.: -5.00 Class :character Class :character
## Median : 7.000 Median : -2.00 Mode :character Mode :character
## Mean : 6.524 Mean : 9.37
## 3rd Qu.: 9.000 3rd Qu.: 7.00
## Max. :12.000 Max. :1988.00
## NA's :86153
Below is code to create a summary dataframe by Departure Delay and Airline Name
#Create summary dataframe of Departure Delay, Airline Name
df_delay <- df9 %>% select(DEPARTURE_DELAY, AIRLINE_NAME) %>%
filter(!is.na(DEPARTURE_DELAY) & !is.na(AIRLINE_NAME)) %>%
group_by(AIRLINE_NAME) %>%
summarise(n = length(AIRLINE_NAME), total = sum(DEPARTURE_DELAY), avrdelay = mean(DEPARTURE_DELAY)) %>%
arrange(desc(n)) %>%
data.frame() %>%
top_n(20, wt=n)
The histogram below shows the airlines with the highest total of delays. This is a summary of Delay Time for the year in total by descending order from the top airline down. As you can see Southwest Airlines has the most delays by a considerable amount. Although they have the highest total of delays, the do not have the highest average, meaning they are running a lot of flights with delays but on average they do not have the largest delay times. That would go to Spirit Air Lines, who has a much lower total amount of delays, but the average a higher delay per flight.
We do not know if this is a total of small delays or large delays so we will look further into the data to get more detail. In the previous graphic of the summary of the data, Departure Delays can have negative values and go from a range of -82 all the way to 1,988.
ggplot(df_delay, aes(x=reorder(AIRLINE_NAME, n), y=n, label=round(avrdelay, digits = 0))) +
geom_bar(stat="identity", fill='red') +
coord_flip() +
geom_text(position=position_dodge(width=0.5), hjust=-0.25) +
labs(title = "Total Departure Delays by Airline",
x = "Airline",
y = "Total Delay",
subtitle = "Average Delay Value shown to the right of bar",
caption = "(based on data from https://www.kaggle.com/usdot/flight-delays#airports.csv)")
#Only include delays that are over 100 to get a better view of the data
df10 <- df9 %>%
filter(DEPARTURE_DELAY>100)
Now we are going to use a box plot to show these same airlines and get a sense for the lengths of delays they are generating. Even though Southwest might have the most in total, if they are relatively small delays, it might not be a significant issue. From our overall population of data, I filtered the data to only use Departure Delays that were greater than 100, as there were some negative values as well as smaller values that don’t provide much insight into which airline carriers are having significant delay issues.
From the boxplots below, we can see that most airlines have generally the same quartile ranges and means. The outliers, which are the dots, are what is interesting, as you can see that American Airlines has the most concentrated outliers especially above 1,000, which is concerning.
ggplot(df10, aes(x=AIRLINE_NAME, y=DEPARTURE_DELAY)) +
geom_boxplot() +
coord_flip() +
labs(title = "Boxplot of Delays by Airline for Delays Greater than 100",
x = "Airline",
y = "Delay Value",
caption = "(based on data from https://www.kaggle.com/usdot/flight-delays#airports.csv)")
Now we will take this information and look at it by month during 2015 to see if there is any seasonilty to the delays by airlines. From the graphs below which run January to December left to right by airline, you can see that again, American Airlines has the most delays by far, but they are really spread all over the year.
I would have expected delays around the winter months when weather can be the worst and cause significant delays but there does not seem to be any explicit trends in the graphs below.
ggplot(df10) +
geom_point(mapping = aes(x=MONTH, y = DEPARTURE_DELAY, color = AIRLINE_NAME)) +
facet_wrap(~AIRLINE_NAME, nrow=2) +
labs(title = "Total Departure Delays by Airline by Month",
x = "Month",
y = "Total Delay",
colour = "Airline Name",
caption = "(based on data from https://www.kaggle.com/usdot/flight-delays#airports.csv)")
## Create a Dataframe with just American Airlines data greater than 100 delay value
#Modify dataframe to only include American Airlines flights without NAs
df11 <- df10 %>%
filter(AIRLINE_NAME=="American Airlines Inc." & !is.na(AIRPORT_ORIGIN_NAME)& !is.na(AIRPORT_ORIGIN_NAME)) %>%
group_by(AIRPORT_ORIGIN_NAME) %>%
summarise(n = length(AIRPORT_ORIGIN_NAME), total = sum(DEPARTURE_DELAY), avrdelay = mean(DEPARTURE_DELAY)) %>%
arrange(desc(n)) %>%
data.frame() %>%
top_n(20, wt=n)
Now, focusing just on American Airlines, let’s see if the Origin Airport might be the cause of issues.
You can see the Dallas/Fort Worth is by far the highest amount. This should be expected since this is a hub for American Airlines and it flies a significant amount of their airplanes from this airport.
The average values were included to show that just because in total it might seem like an issue, if the average delay is not as significant, there might not be an issue. As we look throuh the average values, the airports that do not have a high total of delays are actually showing a higher average. There could be some outliers here that are increasing the average values higher, but when looking at correct average delay times, a solution for American Airlines might be to not fly to some of the airports that it services but does not have a significant amount of flights to and from.
ggplot(df11, aes(x=reorder(AIRPORT_ORIGIN_NAME, n), y=n, label=round(avrdelay, digits = 0))) +
geom_bar(stat="identity", fill="blue") +
coord_flip() +
geom_text(position=position_dodge(width=0.5), hjust=-0.25) +
labs(title = "Total Departure Delays by Origin Airport for American Airlines",
x = "Origin Airport",
y = "Total Delay",
subtitle = "Average Delay Value shown to the right of bar",
caption = "(based on data from https://www.kaggle.com/usdot/flight-delays#airports.csv)")
This exercise was interesting to see the impact that the different variables have on the analysis. Depending on how you look at the data, both visually and based on filtered values, different answers can be achieved to your analysis. I would spend more time digging into the other variables not used in this analysis to see if there are other reasons for issues with American Airlines and what has been done since 2015 to correct these.