This report demonstrates how to clean messy data on airline flight delays, and perform analysis to compare the number of delayed flights for two airlines across different cities. The analysis also includes visualizations that highlight delays by city and airline.
Below, I provide two different ways of importing the data into your R studio (MySQL or CSV). Once the data has been imported, we’ll clean the data using **tidyverse** functions, and perform transformations to reshape the dataset for analysis. The goal is to summarize and compare the delays for each airline across multiple cities and produce visualizations that aid in understanding trends.
You can download the MySQL script <https://raw.githubusercontent.com/awrubes/Assignment_5/main/flights_database.sql> in order to create the necessary table in your database, which I’ve named “tidy”. If you wish to name the database something else, make sure to change the r code accordingly. This method uses env variables for your MySQL user credentials.
user_name <- user_name <- Sys.getenv("MYSQL_USER", "default_user")
pass_word <- Sys.getenv("MYSQL_PASS", "default_password")
conn <- dbConnect(RMySQL::MySQL(), dbname = "tidy", host = "localhost", user = user_name, password = pass_word)
dbListTables(conn)
## [1] "flights" "messy"
df <- dbReadTable(conn, "messy")
head(df)
## unnamed_col status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
If you don’t care to create a MySQL database, you can go ahead and import the following CSV file.
#alternatively read in the CSV file
df_csv <- read_csv("https://raw.githubusercontent.com/awrubes/Assignment_5/main/mess_data.csv")
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los_Angeles, Phoenix, San_Diego, San_Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
head(df_csv)
## # A tibble: 5 × 7
## ...1 ...2 Los_Angeles Phoenix San_Diego San_Francisco Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
For the rest of the code, we will be using the df_csv variable. If you decided to create a MySQL database, you will need to switch out the database names accordingly.
The data imported contains some messy rows that need to be removed.
Specifically, there are rows with missing values that we should filter
out. Additionally, the first column contains the airline names in some
rows but is NA
in others. We can clean this by filling the
missing airline names using the mutate()
and
ifelse()
functions.
#get rid of empty row
df_csv_clean <- df_csv %>%
filter(rowSums(is.na(.)) != ncol(.))
head(df_csv_clean)
## # A tibble: 4 × 7
## ...1 ...2 Los_Angeles Phoenix San_Diego San_Francisco Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
colnames(df_csv_clean)
## [1] "...1" "...2" "Los_Angeles" "Phoenix"
## [5] "San_Diego" "San_Francisco" "Seattle"
#access the first col name and store in variable
first_col <- colnames(df_csv_clean)[1]
#add airline name and remove the first column
df_csv_clean <- df_csv_clean %>%
mutate(
airline = ifelse(is.na(!!sym(first_col)), lag(!!sym(first_col)), !!sym(first_col))
)%>%
select(-all_of(first_col))
head(df_csv_clean)
## # A tibble: 4 × 7
## ...2 Los_Angeles Phoenix San_Diego San_Francisco Seattle airline
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 on time 497 221 212 503 1841 ALASKA
## 2 delayed 62 12 20 102 305 ALASKA
## 3 on time 694 4840 383 320 201 AM WEST
## 4 delayed 117 415 65 129 61 AM WEST
With this cleaner representation of the data, we can pivot it from wide to a long format where each row represents a flight status and its corresponding city and airline. We also clean up the column names so that the data is easy to read and understand.
#pivot to long format
df_csv_clean <-df_csv_clean %>%
pivot_longer(
cols = "Los_Angeles":"Seattle",
names_to = "city",
values_to = "flight_count"
)
new_first_col <- colnames(df_csv_clean)[1]
df_csv_clean <- df_csv_clean %>%
rename(status = all_of(new_first_col))
head(df_csv_clean)
## # A tibble: 6 × 4
## status airline city flight_count
## <chr> <chr> <chr> <dbl>
## 1 on time ALASKA Los_Angeles 497
## 2 on time ALASKA Phoenix 221
## 3 on time ALASKA San_Diego 212
## 4 on time ALASKA San_Francisco 503
## 5 on time ALASKA Seattle 1841
## 6 delayed ALASKA Los_Angeles 62
Once the data is formatted properly, we can easily summarize the data to look at the total number of flights for each airline, both delayed and on time. However, these numbers alone don’t tell us much about how the airlines are performing. So, in order to actually get more insightful data on airline performance we’ll want to calculate percentages for on_time and delayed flights.
# Group data by airline and status (on time vs delayed)
summary_by_status <- df_csv_clean %>%
group_by(airline, status) %>%
summarize(total_flights = sum(flight_count, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
print(summary_by_status)
## # A tibble: 4 × 3
## airline status total_flights
## <chr> <chr> <dbl>
## 1 ALASKA delayed 501
## 2 ALASKA on time 3274
## 3 AM WEST delayed 787
## 4 AM WEST on time 6438
To understand the delay patterns, we summarize the total number of flights (both on time and delayed) by airline and look at the respective percentages for each, by airline and by city.
This percentage calculation is important because it normalizes the data, making it easier to compare the performance of the two airlines, even though they operate different numbers of flights.
For example:
ALASKA had 13.3% of its flights delayed and 86.7% of its flights on time.
AM WEST had 10.9% of its flights delayed and 89.1% on time.
These percentages give us a clear understanding of each airline’s overall performance in terms of delays. Even though AM WEST operates more flights overall, calculating percentages allows us to make meaningful comparisons between the two airlines.
#store total ALASKA flights
total_alaska <- summary_by_status %>%
filter(airline == "ALASKA") %>%
summarize(total_flights = sum(total_flights, na.rm = TRUE)) %>%
pull(total_flights) # `pull()` extracts the value as a scalar
#store total AM WEST flights
total_amwest <- summary_by_status %>%
filter(airline == "AM WEST") %>%
summarize(total_flights = sum(total_flights, na.rm = TRUE)) %>%
pull(total_flights) # `pull()` extracts the value as a scalar
#calculate the percent of delays and on time by total number of flights per airline
summary_by_status_percent_total <- summary_by_status %>%
mutate(
percent_of_total = case_when(
airline == "ALASKA" ~ (total_flights/total_alaska) * 100,
airline == "AM WEST" ~ (total_flights/total_amwest) * 100
)
)
print(summary_by_status_percent_total)
## # A tibble: 4 × 4
## airline status total_flights percent_of_total
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA delayed 501 13.3
## 2 ALASKA on time 3274 86.7
## 3 AM WEST delayed 787 10.9
## 4 AM WEST on time 6438 89.1
#create a wide table with on-time and delayed columns for added analysis
df_csv_clean_wide <- df_csv_clean %>%
pivot_wider(names_from = status, values_from = flight_count, names_prefix = "total_")
head(df_csv_clean_wide)
## # A tibble: 6 × 4
## airline city `total_on time` total_delayed
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los_Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San_Diego 212 20
## 4 ALASKA San_Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los_Angeles 694 117
#compare percent of delays by city
summary_by_percent <- df_csv_clean_wide %>%
mutate(
percent_delayed = (total_delayed / (`total_on time` + total_delayed))*100
)
print(summary_by_percent)
## # A tibble: 10 × 5
## airline city `total_on time` total_delayed percent_delayed
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA Los_Angeles 497 62 11.1
## 2 ALASKA Phoenix 221 12 5.15
## 3 ALASKA San_Diego 212 20 8.62
## 4 ALASKA San_Francisco 503 102 16.9
## 5 ALASKA Seattle 1841 305 14.2
## 6 AM WEST Los_Angeles 694 117 14.4
## 7 AM WEST Phoenix 4840 415 7.90
## 8 AM WEST San_Diego 383 65 14.5
## 9 AM WEST San_Francisco 320 129 28.7
## 10 AM WEST Seattle 201 61 23.3
#compare total percent delay overall for both airlines
overall_summary <- summary_by_percent %>%
group_by(airline) %>%
summarize(
total_delayed = sum(total_delayed, na.rm=TRUE),
total_on_time = sum(`total_on time`, na.rm = TRUE),
overall_percent_delayed = (total_delayed / (total_delayed + total_on_time)) * 100
)
print(overall_summary)
## # A tibble: 2 × 4
## airline total_delayed total_on_time overall_percent_delayed
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3274 13.3
## 2 AM WEST 787 6438 10.9
The bar charts generated provide a clear, visual comparison of the percentage of delayed flights for each airline across multiple cities and in total. By presenting the data side-by-side, we can easily see which cities have the highest delay rates.
ggplot(overall_summary, aes(x = airline, y = overall_percent_delayed, fill = airline)) +
geom_col() +
labs(title = "Overall Percentage of Delayed Flights by Airline",
y = "Percentage of Delayed Flights", x = "Airline") +
theme_minimal()
ggplot(summary_by_percent, aes(x = city, y = percent_delayed, fill = airline)) +
geom_col(position = "dodge") +
labs(title = "Percentage of Delayed Flights by City and Airline",
y = "Percentage of Delayed Flights", x = "City") +
theme_minimal()
The charts illustrate that Phoenix for AM WEST and Seattle for ALASKA account for the majority of delayed flights across the five cities. With more data, it would be worthwhile to further investigate the underlying causes of the delays in these specific cities.
One possible hypothesis could be weather-related issues, but the fact that delays are predominantly experienced by a single airline in each city suggests that weather alone may not be a sufficient explanation. If weather were the primary cause, we would likely expect delays to affect both airlines in each city more equitably.
Given the data, a more plausible explanation may be airline-specific factors:
Seattle is a known hub for ALASKA, so the high volume of flights may contribute to a higher number of delays.
Phoenix has a significant number of delayed flights for AM WEST, which could indicate operational inefficiencies or challenges specific to that airport or airline.