This is my first analysis of a data set using RStudio to produce informative graphs/visualizations, and was done as a part of my “Data Visualization” class at Loyola University Maryland. I hope you enjoy the graphs I put together, and find them to be a useful analysis of flight delays.
The data set that I chose to work with for this assignment is a Flight Delays data set. This data set covers about the first quarter (first 2 and a half months) of the year 2018. Even though this is a small fraction of the year, there were still over one million entries in this data set that I was able to work with.
Some of the main columns that I worked with in this data set, as well as ones that are important, were “DAY_OF_WEEK”, “AIRLINE”, “FLIGHT_NUMBER”, “ORIGIN_AIRPORT”, “DESTINATION_AIRPORT”, “DEPARTURE_TIME”, “DEPARTURE_DELAY”, and “ARRIVAL_DELAY”. Even though the data set I worked with is called “Flight Delays”, not all flight entries in this data set could be considered “delayed”, depending on how you interpret it. For my analysis, I only looked at flights that arrived to their destination airport after their scheduled arrival time. In terms of coding, this means that I only looked at flights that had an ARRIVAL_DELAY > 0. If a flight was delayed based on arrival time in the data set, it was given a positive number equal to the amount of minutes it was delayed (if it was early, it received a negative number). This data set includes airlines strictly based in the United States, but the airports involved are international.
For my visualizations, I instinctively wanted to see which airlines had the most delays. The airlines delays the most delays don’t necessarily translate to the worst airlines. The higher volume of flights an airline offers, the more delays it is bound to have. By seeing which airlines have the most delays, you can gain insight into which airlines are operating the most flights. I also wanted to see if there was a trend on which day of the week and which hour, or hours, of the day contained the most delays.
For your reference, the airline abbreviations used in my visualizations translate to:
UA: United Airlines
AA: American Airlines
US: US Airways
F9: Frontier Airlines
B6: JetBlue Airways
OO: Skywest Airlines
AS: Alaska Airlines
NK: Spirit Airlines
WN: Southwest Airlines
DL: Delta Airlines
EV: Atlantic Southeast Airlines
HA: Hawaiian Airlines
MQ: American Eagle Airlines
VX: Virgin America
Here is some general information regarding my findings before you view the individual charts in their respective tabs:
Some of the findings displayed in my graphs could be potentially inferred (ex: most delays don’t occur early in the morning), but it is better to see the evidence of the data in a visualization to catch any details that cannot be easily inferred.
The findings for each graph will be explained in further detail in their respective tabs.
For the first visualization, I created a bar graph to show the total number of delays for each airline. The airline is visible on the y axis, while the number of delays is visible on the x axis. This is not a complex visualization, and is simply to gain an idea of which airlines have their flights delayed the most. The top four most delayed airlines according to this graph are Southwest Airlines, Delta Airlines, Atlantic Southeast Airlines, and Skywest Airlines. Airlines that are cheaper and not as popular, such as Frontier and Spirit Airlines, had lower amounts of total delays. Again, these delayed flights for each airline are based on the flight arriving after the scheduled arrival time.
setwd("C:/Users/Cal/Documents/R/R_datafiles")
df <- read.csv("Flight_Delays.csv", header = TRUE, sep = ",")
library(dplyr)
library(scales)
library(ggplot2)
library(lubridate)
library(ggthemes)
library(RColorBrewer)
library(plotly)
delayed_df <- filter(df, ARRIVAL_DELAY > 0)
airlinecount <- data.frame(dplyr::count(delayed_df, AIRLINE))
airlinecount <- airlinecount[order(airlinecount$n, decreasing = TRUE), ]
ggplot(airlinecount, aes(x = reorder(AIRLINE, n), y = n)) +
geom_bar(colour="white", fill="red", stat="identity") +
geom_text(aes(label=scales::comma(n)), vjust=-0.1, hjust=0.45, color='black') +
labs(title = "Number of Flight Delays by Airline (Based on Arrival Time)", x = "Airline", y = "Number of Delays") +
scale_y_continuous(labels=comma) +
coord_flip() +
theme(plot.title = element_text(hjust = 0.5))
The second visualization in this analysis is a line plot that looks again at the flight delays for each airline, except it is split into each day of the week. The days of the week are listed on the x axis, while the number of delays is on the y axis. Each airline has its own color line that appears on the graph. You are able to tell once again which airlines have the most delays, based on how high their line is above the other lines in the visualization. What is fascinating about this visualization, is that pretty much every airline follows the same overall trend of delays for each day. As I expected there are not many delays for Monday and Tuesday, since it is the beginning of the week. The delays pick up in the middle of the week on Wednesday and Thursday, but they dip significantly on Friday, which is something that I was not expecting. This could mean that Friday is a less chaotic travel day, and a day that you might want to plan your flights on in the future. As I expected as well, the delays once again pick up on Saturday and Sunday, as the weekend is a popular day for any type of travelling.
weekdays_df <- delayed_df %>%
select(AIRLINE, DAY_OF_WEEK) %>%
group_by(AIRLINE, DAY_OF_WEEK) %>%
dplyr::summarise(n = length(DAY_OF_WEEK), .groups = 'keep') %>%
data.frame()
weekdays_df$DAY_OF_WEEK <- sub("1","Sun", weekdays_df$DAY_OF_WEEK)
weekdays_df$DAY_OF_WEEK <- sub("2","Mon", weekdays_df$DAY_OF_WEEK)
weekdays_df$DAY_OF_WEEK <- sub("3","Tue", weekdays_df$DAY_OF_WEEK)
weekdays_df$DAY_OF_WEEK <- sub("4","Wed", weekdays_df$DAY_OF_WEEK)
weekdays_df$DAY_OF_WEEK <- sub("5","Thu", weekdays_df$DAY_OF_WEEK)
weekdays_df$DAY_OF_WEEK <- sub("6","Fri", weekdays_df$DAY_OF_WEEK)
weekdays_df$DAY_OF_WEEK <- sub("7","Sat", weekdays_df$DAY_OF_WEEK)
weekdays_df$DAY_OF_WEEK <- as.factor(weekdays_df$DAY_OF_WEEK)
days_order <- factor(weekdays_df$DAY_OF_WEEK, level=c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))
ggplot(weekdays_df, aes(x = days_order, y = n, group = AIRLINE)) +
geom_line(aes(color=AIRLINE), size=2) +
labs(title = "Flight Delays by Airline (For Each Day of the Week)", x = "Day of the Week", y = "Number of Delays") +
theme_light() +
theme(plot.title = element_text(hjust=0.5)) +
geom_point(shape=21, size=5, color="black", fill="white") +
scale_y_continuous(labels=comma)
Since we have looked at which weekdays and airlines are responsible for the most delays, the next visualization is a heatmap that looks at the number of delays by each airline, except it is sectioned into each hour of the day. I wanted to see which hours contained the most delays, which could convey the times that travelling will be the most hectic and chaotic. In this heatmap, the airlines are listed on the x axis, while the hours of the day are listed on the y axis. The heatmap is “filled” with the total delays for each airline for each hour of the day. A useful concept with heatmaps is that the darker boxes represent a higher number of delays, so it is easier for the person viewing the graph to see where the hot zones are. This graph can be looked at as a whole, or can be looked at by viewing each airline individually. It is easier to view this heatmap as a whole to determine which hours contain the darker boxes, and therefore more flight delays. The next graph will provide an easier way to view the delays by hour for each airline individually. As shown in the graph, the hot zones for a majority of the airlines are in the middle of the day. It is probably more beneficial to plan a flight earlier in the morning, where there are less people travelling.
delayed_df$hour <- trunc(delayed_df$DEPARTURE_TIME/100)
flighthour_df <- delayed_df %>%
select(hour, ARRIVAL_DELAY, AIRLINE) %>%
group_by(hour, AIRLINE) %>%
dplyr::summarise(tot_arrival_delay = sum(ARRIVAL_DELAY)/60,
delaycount = length(hour), .groups = 'keep') %>%
data.frame()
breaks <- c(seq(0, max(flighthour_df$delaycount), by=1000))
yticks <- c(seq(0, max(flighthour_df$hour), by=1))
ggplot(flighthour_df, aes(x = AIRLINE, y = hour, fill=delaycount)) +
geom_tile(color="black") +
geom_text(aes(label=(round(delaycount, 0))), size=3) +
labs(title="Heatmap: Flight Delays by Airline by Hour of the Day", x = "Airline", y = "Hour of the Day", fill = "Delay Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_continuous(low="white", high="red", breaks=breaks) +
guides(fill = guide_legend(reverse = TRUE, override.aes=list(colour="black"))) +
scale_y_continuous(breaks = yticks)
The fourth visualization that I created is known as a “trellis chart”. In this trellis chart I have created a bar graph for each airline, that represents the total number of delays for each hour in the day. This visualization is essentially viewing the same information as the heatmap, but it provides each airline with its own graph, so it is more useful if you want to view the hourly delays of an airline individually. The hours of the day are again on the x axis, while the delay count is on the y axis. Once again, the airlines with the higher bars have the most delays, and the majority of delays are between the hours of the late morning and evening for each airline. This trellis chart with bar graphs provides another easier way to see which hour in the day the delays pick up. It is almost like looking online at busy times for a restaurant or gym. These bar graphs in the trellis chart display the peak hour of delays for each airline.
airline_df <- delayed_df %>%
select(hour, AIRLINE) %>%
group_by(hour, AIRLINE) %>%
dplyr::summarise(delaycount = length(hour), .groups = 'keep') %>%
data.frame()
xticks <- c(seq(0, max(airline_df$hour), by=1))
ggplot(airline_df, aes(x = hour, y = delaycount, fill=AIRLINE)) +
geom_bar(stat="identity", position="dodge") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = comma) +
labs(title = "Multiple Bar Charts - Total Delays by Hour by Airline",
x = "Hour of the Day",
y = "Delay Count",
fill = "Airline") +
scale_x_continuous(breaks = xticks) +
facet_wrap(~AIRLINE, ncol=2, nrow=7)
The final visualization in this analysis of the flight delays data set is a donut chart. For this donut chart, my goal was to figure out which times of the day were responsible for the highest number of total flight delays. In order to construct this chart, I grouped the hours the delays occurred into four times of the day: Early Morning, Late Morning, Afternoon, and Evening. Early Morning flights were considered between the hours of 0 and 6, Late Morning flights were considered between the hours of 6 and 12, Afternoon flights were considered between the hours of 12 and 18, and Evening flights were considered between the hours of 18 and 24. I started out by creating a pie chart with each time of day representing a section of the pie. The pie chart was an effective analysis and provided the percentage that each time of day represented of the total flight delays. I took once step further and turned the pie chart into a donut chart, so it is visible in the middle of the visualization how many total flight delays each time of the day is a fraction of. This donut chart demonstrates that the Late Morning and Afternoon are responsible for almost 75% of the total delays in this analysis. This confirms the conclusions that were able to be drawn from the previous two graphs.
This graph was created with a the library “plotly” in RStudio, which makes the graph highly interactive for the user. If you hover over each section of the pie chart, it informs you which time of day that slice represents, the total number of delays for that slice, and the percentage of the whole pie each slice represents. This visualization is also downloadable as a png file.
flighttypes_df <- delayed_df %>%
select(hour) %>%
dplyr::mutate(flighttype = ifelse(hour <= 6, "Early Morning", ifelse(hour <= 12, "Late Morning", ifelse(hour <= 18, "Afternoon", "Evening")))) %>%
group_by(flighttype) %>%
dplyr::summarise(n = length(flighttype), .groups = 'keep') %>%
group_by(flighttype) %>%
mutate(percent_of_total = round(n*100/sum(n),1)) %>%
ungroup %>%
data.frame()
plot_ly(flighttypes_df, labels = ~flighttype, values = ~n) %>%
add_pie(hole=0.6) %>%
layout(title="Total Delays of Flights by Time of Day") %>%
layout(annotations=list(text=paste0("Total Flight Delay Count: \n",
scales::comma(sum(flighttypes_df$n))),
"showarrow"=F))
We are finished looking at the charts. Here are some general takeaways from my output: