Overview

This report demonstrates how to clean messy data on airline flight delays, and perform analysis to compare the number of delayed flights for two airlines across different cities. The analysis also includes visualizations that highlight delays by city and airline.

Below, I provide two different ways of importing the data into your R studio (MySQL or CSV). Once the data has been imported, we’ll clean the data using **tidyverse** functions, and perform transformations to reshape the dataset for analysis. The goal is to summarize and compare the delays for each airline across multiple cities and produce visualizations that aid in understanding trends.

MySQL Database

You can download the MySQL script <https://raw.githubusercontent.com/awrubes/Assignment_5/main/flights_database.sql> in order to create the necessary table in your database, which I’ve named “tidy”. If you wish to name the database something else, make sure to change the r code accordingly. This method uses env variables for your MySQL user credentials.

user_name <- user_name <- Sys.getenv("MYSQL_USER", "default_user")
pass_word <- Sys.getenv("MYSQL_PASS", "default_password")

conn <- dbConnect(RMySQL::MySQL(), dbname = "tidy", host = "localhost", user = user_name, password = pass_word)

dbListTables(conn)
## [1] "flights" "messy"
df <- dbReadTable(conn, "messy")

head(df)
##   unnamed_col  status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1      ALASKA on time         497     221       212           503    1841
## 2        <NA> delayed          62      12        20           102     305
## 3        <NA>    <NA>          NA      NA        NA            NA      NA
## 4     AM WEST on time         694    4840       383           320     201
## 5        <NA> delayed         117     415        65           129      61

CSV Import

If you don’t care to create a MySQL database, you can go ahead and import the following CSV file.

#alternatively read in the CSV file
df_csv <- read_csv("https://raw.githubusercontent.com/awrubes/Assignment_5/main/mess_data.csv")
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los_Angeles, Phoenix, San_Diego, San_Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
head(df_csv)
## # A tibble: 5 × 7
##   ...1    ...2    Los_Angeles Phoenix San_Diego San_Francisco Seattle
##   <chr>   <chr>         <dbl>   <dbl>     <dbl>         <dbl>   <dbl>
## 1 ALASKA  on time         497     221       212           503    1841
## 2 <NA>    delayed          62      12        20           102     305
## 3 <NA>    <NA>             NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5 <NA>    delayed         117     415        65           129      61

Data Cleaning and Missing Values

For the rest of the code, we will be using the df_csv variable. If you decided to create a MySQL database, you will need to switch out the database names accordingly.

The data imported contains some messy rows that need to be removed. Specifically, there are rows with missing values that we should filter out. Additionally, the first column contains the airline names in some rows but is NA in others. We can clean this by filling the missing airline names using the mutate() and ifelse() functions.

#get rid of empty row
df_csv_clean <- df_csv %>%
  filter(rowSums(is.na(.)) != ncol(.))

head(df_csv_clean)
## # A tibble: 4 × 7
##   ...1    ...2    Los_Angeles Phoenix San_Diego San_Francisco Seattle
##   <chr>   <chr>         <dbl>   <dbl>     <dbl>         <dbl>   <dbl>
## 1 ALASKA  on time         497     221       212           503    1841
## 2 <NA>    delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 <NA>    delayed         117     415        65           129      61
colnames(df_csv_clean)
## [1] "...1"          "...2"          "Los_Angeles"   "Phoenix"      
## [5] "San_Diego"     "San_Francisco" "Seattle"
#access the first col name and store in variable
first_col <- colnames(df_csv_clean)[1]

#add airline name and remove the first column
df_csv_clean <- df_csv_clean %>%
  mutate(
    airline = ifelse(is.na(!!sym(first_col)), lag(!!sym(first_col)), !!sym(first_col))
  )%>%
  select(-all_of(first_col))

head(df_csv_clean)
## # A tibble: 4 × 7
##   ...2    Los_Angeles Phoenix San_Diego San_Francisco Seattle airline
##   <chr>         <dbl>   <dbl>     <dbl>         <dbl>   <dbl> <chr>  
## 1 on time         497     221       212           503    1841 ALASKA 
## 2 delayed          62      12        20           102     305 ALASKA 
## 3 on time         694    4840       383           320     201 AM WEST
## 4 delayed         117     415        65           129      61 AM WEST

Pivoting Data

With this cleaner representation of the data, we can pivot it from wide to a long format where each row represents a flight status and its corresponding city and airline. We also clean up the column names so that the data is easy to read and understand.

#pivot to long format
df_csv_clean <-df_csv_clean %>%
  pivot_longer(
    cols = "Los_Angeles":"Seattle",
    names_to = "city",
    values_to = "flight_count"
  )

new_first_col <- colnames(df_csv_clean)[1]

df_csv_clean <- df_csv_clean %>%
  rename(status = all_of(new_first_col))

head(df_csv_clean)
## # A tibble: 6 × 4
##   status  airline city          flight_count
##   <chr>   <chr>   <chr>                <dbl>
## 1 on time ALASKA  Los_Angeles            497
## 2 on time ALASKA  Phoenix                221
## 3 on time ALASKA  San_Diego              212
## 4 on time ALASKA  San_Francisco          503
## 5 on time ALASKA  Seattle               1841
## 6 delayed ALASKA  Los_Angeles             62

Further Analysis

Once the data is formatted properly, we can easily summarize the data to look at the total number of flights for each airline, both delayed and on time. However, these numbers alone don’t tell us much about how the airlines are performing. So, in order to actually get more insightful data on airline performance we’ll want to calculate percentages for on_time and delayed flights.

# Group data by airline and status (on time vs delayed)
summary_by_status <- df_csv_clean %>%
  group_by(airline, status) %>%
  summarize(total_flights = sum(flight_count, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
print(summary_by_status)
## # A tibble: 4 × 3
##   airline status  total_flights
##   <chr>   <chr>           <dbl>
## 1 ALASKA  delayed           501
## 2 ALASKA  on time          3274
## 3 AM WEST delayed           787
## 4 AM WEST on time          6438

Summarizing the Data

To understand the delay patterns, we summarize the total number of flights (both on time and delayed) by airline and look at the respective percentages for each, by airline and by city.

This percentage calculation is important because it normalizes the data, making it easier to compare the performance of the two airlines, even though they operate different numbers of flights.

For example:

These percentages give us a clear understanding of each airline’s overall performance in terms of delays. Even though AM WEST operates more flights overall, calculating percentages allows us to make meaningful comparisons between the two airlines.

#store total ALASKA flights
total_alaska <- summary_by_status %>%
  filter(airline == "ALASKA") %>%
  summarize(total_flights = sum(total_flights, na.rm = TRUE)) %>%
  pull(total_flights)  # `pull()` extracts the value as a scalar

#store total AM WEST flights
total_amwest <- summary_by_status %>%
  filter(airline == "AM WEST") %>%
  summarize(total_flights = sum(total_flights, na.rm = TRUE)) %>%
  pull(total_flights)  # `pull()` extracts the value as a scalar

#calculate the percent of delays and on time by total number of flights per airline
summary_by_status_percent_total <- summary_by_status %>%
  mutate(
    percent_of_total = case_when(
      airline == "ALASKA" ~ (total_flights/total_alaska) * 100,
      airline == "AM WEST" ~ (total_flights/total_amwest) * 100
    )
  )

print(summary_by_status_percent_total)
## # A tibble: 4 × 4
##   airline status  total_flights percent_of_total
##   <chr>   <chr>           <dbl>            <dbl>
## 1 ALASKA  delayed           501             13.3
## 2 ALASKA  on time          3274             86.7
## 3 AM WEST delayed           787             10.9
## 4 AM WEST on time          6438             89.1
#create a wide table with on-time and delayed columns for added analysis
df_csv_clean_wide <- df_csv_clean %>%
  pivot_wider(names_from = status, values_from = flight_count, names_prefix = "total_")

head(df_csv_clean_wide)
## # A tibble: 6 × 4
##   airline city          `total_on time` total_delayed
##   <chr>   <chr>                   <dbl>         <dbl>
## 1 ALASKA  Los_Angeles               497            62
## 2 ALASKA  Phoenix                   221            12
## 3 ALASKA  San_Diego                 212            20
## 4 ALASKA  San_Francisco             503           102
## 5 ALASKA  Seattle                  1841           305
## 6 AM WEST Los_Angeles               694           117
#compare percent of delays by city
summary_by_percent <- df_csv_clean_wide %>%
  mutate(
    percent_delayed = (total_delayed / (`total_on time` + total_delayed))*100
  )

print(summary_by_percent)
## # A tibble: 10 × 5
##    airline city          `total_on time` total_delayed percent_delayed
##    <chr>   <chr>                   <dbl>         <dbl>           <dbl>
##  1 ALASKA  Los_Angeles               497            62           11.1 
##  2 ALASKA  Phoenix                   221            12            5.15
##  3 ALASKA  San_Diego                 212            20            8.62
##  4 ALASKA  San_Francisco             503           102           16.9 
##  5 ALASKA  Seattle                  1841           305           14.2 
##  6 AM WEST Los_Angeles               694           117           14.4 
##  7 AM WEST Phoenix                  4840           415            7.90
##  8 AM WEST San_Diego                 383            65           14.5 
##  9 AM WEST San_Francisco             320           129           28.7 
## 10 AM WEST Seattle                   201            61           23.3
#compare total percent delay overall for both airlines
overall_summary <- summary_by_percent %>%
  group_by(airline) %>%
  summarize(
    total_delayed = sum(total_delayed, na.rm=TRUE),
    total_on_time = sum(`total_on time`, na.rm = TRUE),
    overall_percent_delayed = (total_delayed / (total_delayed + total_on_time)) * 100
)

print(overall_summary)
## # A tibble: 2 × 4
##   airline total_delayed total_on_time overall_percent_delayed
##   <chr>           <dbl>         <dbl>                   <dbl>
## 1 ALASKA            501          3274                    13.3
## 2 AM WEST           787          6438                    10.9

Visualizing Delays by City and Airline

The bar charts generated provide a clear, visual comparison of the percentage of delayed flights for each airline across multiple cities and in total. By presenting the data side-by-side, we can easily see which cities have the highest delay rates.

ggplot(overall_summary, aes(x = airline, y = overall_percent_delayed, fill = airline)) +
  geom_col() +
  labs(title = "Overall Percentage of Delayed Flights by Airline", 
       y = "Percentage of Delayed Flights", x = "Airline") +
  theme_minimal()

ggplot(summary_by_percent, aes(x = city, y = percent_delayed, fill = airline)) +
  geom_col(position = "dodge") +
  labs(title = "Percentage of Delayed Flights by City and Airline", 
       y = "Percentage of Delayed Flights", x = "City") +
  theme_minimal()

Conclusion

The charts illustrate that Phoenix for AM WEST and Seattle for ALASKA account for the majority of delayed flights across the five cities. With more data, it would be worthwhile to further investigate the underlying causes of the delays in these specific cities.

One possible hypothesis could be weather-related issues, but the fact that delays are predominantly experienced by a single airline in each city suggests that weather alone may not be a sufficient explanation. If weather were the primary cause, we would likely expect delays to affect both airlines in each city more equitably.

Given the data, a more plausible explanation may be airline-specific factors: