library(tidyverse)
library(knitr)
Flight Delays
Introduction
Let’s analyze some flight data!
Our initial input data is unfortunately not in tidy shape, so we’ll start by parsing it into a proper format. Then we’ll perform some exploratory data analysis to get an idea of delays for flights. Our data contains on-time & delayed flight counts across two airlines and five cities on the West Coast.
Reading the initial file
# Read in the data
<- read.csv('assign_5_input_data.csv', na.strings = c(""))
df
kable(df)
X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
NA | delayed | 62 | 12 | 20 | 102 | 305 |
NA | NA | NA | NA | NA | NA | NA |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
NA | delayed | 117 | 415 | 65 | 129 | 61 |
Our initial data isn’t too bad, but it is certainly not tidy. We’ll address this by dropping the empty row, renaming our columns, and pivoting the dataset longer to “flatten” it.
# Drop empty row, rename columns, fill empty cells, and pivot into long format
<- df |>
df slice(-3) |>
rename("airline" = "X", "flight_status" = "X.1") |>
fill(airline) |>
rename_with(~ str_to_lower(str_replace(.x, "\\.", "_")), .cols = 3:7) |>
pivot_longer(3:7, names_to = "city", values_to = "flight_count") |>
arrange(desc(flight_count))
kable(df)
airline | flight_status | city | flight_count |
---|---|---|---|
AM WEST | on time | phoenix | 4840 |
ALASKA | on time | seattle | 1841 |
AM WEST | on time | los_angeles | 694 |
ALASKA | on time | san_francisco | 503 |
ALASKA | on time | los_angeles | 497 |
AM WEST | delayed | phoenix | 415 |
AM WEST | on time | san_diego | 383 |
AM WEST | on time | san_francisco | 320 |
ALASKA | delayed | seattle | 305 |
ALASKA | on time | phoenix | 221 |
ALASKA | on time | san_diego | 212 |
AM WEST | on time | seattle | 201 |
AM WEST | delayed | san_francisco | 129 |
AM WEST | delayed | los_angeles | 117 |
ALASKA | delayed | san_francisco | 102 |
AM WEST | delayed | san_diego | 65 |
ALASKA | delayed | los_angeles | 62 |
AM WEST | delayed | seattle | 61 |
ALASKA | delayed | san_diego | 20 |
ALASKA | delayed | phoenix | 12 |
We now have a nice, flat, long formatted and tidy data set. We can now do some analysis!
Analysis
Let’s answer some basic questions.
We’ll start with a general overview of our data. Since it is a simple data set, we can visualize most of the features in one plot.
# Set this up so we can factor airline for our factor wrap
|>
df mutate() |>
# Plot data across two facets
ggplot(
aes(
x = flight_count,
y = reorder(
str_to_title( # Make city names pretty
str_replace_all(city, "_", " ")),
flight_count),fill = flight_status)) +
geom_bar(stat = 'identity') +
facet_wrap(~factor(airline, c('AM WEST', 'ALASKA'))) +
labs(
title = "Flight data overview",
y = "Flights",
x = "City",
fill = "Flight Status"
+
) scale_fill_manual(values = c('#E06565', '#0088CC')) +
theme_minimal()
What city has the highest amount of delayed flights?
# Plot amount of delayed flights by city
|>
df filter(flight_status == "delayed") |>
group_by(city) |>
summarize(total_flights = sum(flight_count)) |>
ggplot(aes
x = reorder(
(str_to_title(
str_replace_all(city, "_", " ")),
-total_flights),
y = total_flights)) +
geom_bar(stat = "identity", fill = '#E06565') +
labs(
title = "Delayed flights by city",
y = "City",
x = "Flights",
fill = "Flight Status"
+
) geom_text(aes(label = total_flights),vjust = 1.5, color = "#3b3e40" )+
theme_minimal()
We can see that Phoenix has the highest amount of delayed flights.
What airline has the higher amount of on-time flights?
# Plot proportion of flight status by airline
|>
df group_by(airline, flight_status) |>
summarize(total_flights = sum(flight_count)) |>
ggplot(aes
x = reorder(
(str_to_title(
str_replace_all(airline, "_", " ")),
-total_flights),
fill = flight_status,
y = total_flights)) +
geom_col(position = "fill") +
labs(
title = "Flight status by airline",
y = "Airline",
x = "Flights (Proportion)",
fill = "Flight Status"
+
) scale_fill_manual(values = c('#E06565', '#0088CC')) +
theme_minimal()
Interestingly, the rate of flights on time versus delayed is actually quite similar between the airlines, despite the significant differences in airports visited and overall number of flights we observed earlier.
Conclusion
In this article we took a simple dataset, tidied it into a proper long format for data analysis, then created visualizations to explore and understand the data better.
In summary, if you’re flying to Phoenix, fly Alaska!