a5a

Author

XiaoFei

Introduction

Task is to create a csv file that contain flight info by two airline company. we will intentionally create an untidy data set first, and do some data cleaning on data set structure, handle some missing data, etc. Save the tidy_data as a new csv file and perform data analysis.

For analysis, a bar chart will be used to show percentage and flight counts of both delays or arrival rates for two airlines overall. A table will be used to show result of percentage and flight counts of both delays or arrival rates for two airlines across five cities.

Loading Data

A copy of csv was created with the same format from the assignment, uploaded to the github repository. Later a tidy version will also be uploaded as well.

library(tidyverse)
library(dplyr)
library(scales)
library(janitor)
library(knitr)
library(kableExtra)


untidy_data = read.csv('https://raw.githubusercontent.com/xiaofeimei1/DATA607A5AirlineDelays/refs/heads/main/flightrecords.csv')

untidy_data

        X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2         delayed          62      12        20           102     305
3                          NA      NA        NA            NA      NA
4 AM WEST on time         694    4840       383           320     201
5         delayed         117     415        65           129      61

Tidy Data

Original data used “wide” structure, multiple destinations were spread out in multiple columns, some rows missing values, or some columns missing name. Here are procedures to be applied to tidy the data: add names to the first two columns: airline company and flight status; Remove any rows with missing values, convert all numerical strings to be of numeric type by removing comma in some numbers; Use the destination names to convert the data from wide to long, for consistency fill in the airline names on rows that have missing values and do the same for flight status.

colnames(untidy_data)[1] = "airline"
colnames(untidy_data)[2] = "flight Status"

# Missing data, remove row of blanks
destination_names <- colnames(untidy_data[3:length(untidy_data)])
untidy_data <- untidy_data |> na.omit()

# Remove the comma from some large numbers 
untidy_data <- untidy_data |>
  mutate(across(all_of(destination_names), 
                ~ as.numeric(gsub(",", "", .))))

print(untidy_data)

  airline flight Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
1  ALASKA       on time         497     221       212           503    1841
2               delayed          62      12        20           102     305
4 AM WEST       on time         694    4840       383           320     201
5               delayed         117     415        65           129      61

Fill in missing airline values by fill downwards using airlines from above rows.

# For consistency, combine all destination under new column perspective flight into "Num Flights" column
long_data <- untidy_data |> 
  pivot_longer(cols = all_of(destination_names), 
               names_to = "destination", 
               values_to = "num flights")

# Clean column names using janitor 
if (require(janitor, quietly = TRUE)) {
  long_data <- long_data |> janitor::clean_names()
} else {
  colnames(long_data) <- tolower(gsub("[^[:alnum:]]", "_", colnames(long_data)))
}

# Fill missing airline values, Convert empty airline strings to NA then fill downwards.  
tidy_data <- long_data |> 
  mutate(
    airline = ifelse(airline == "" | is.na(airline), NA, airline)
  ) |>
  fill(airline) |>
  mutate(
    destination = str_replace(destination, "\\.", " ")
  )

print(tidy_data)

# A tibble: 20 × 4
   airline flight_status destination   num_flights
   <chr>   <chr>         <chr>               <dbl>
 1 ALASKA  on time       Los Angeles           497
 2 ALASKA  on time       Phoenix               221
 3 ALASKA  on time       San Diego             212
 4 ALASKA  on time       San Francisco         503
 5 ALASKA  on time       Seattle              1841
 6 ALASKA  delayed       Los Angeles            62
 7 ALASKA  delayed       Phoenix                12
 8 ALASKA  delayed       San Diego              20
 9 ALASKA  delayed       San Francisco         102
10 ALASKA  delayed       Seattle               305
11 AM WEST on time       Los Angeles           694
12 AM WEST on time       Phoenix              4840
13 AM WEST on time       San Diego             383
14 AM WEST on time       San Francisco         320
15 AM WEST on time       Seattle               201
16 AM WEST delayed       Los Angeles           117
17 AM WEST delayed       Phoenix               415
18 AM WEST delayed       San Diego              65
19 AM WEST delayed       San Francisco         129
20 AM WEST delayed       Seattle                61

After transform data from wide to long format, save a copy of tidy_data in the same folder and github repository.

write.csv(tidy_data, "tidy_data.csv", row.names = FALSE)
print("File saved as 'tidy_data.csv' in the current folder")

[1] "File saved as 'tidy_data.csv' in the current folder"

Data Analysis

Stacked bar chart to compare percentage of delays and on-time rate for tow airlines overall. As showing in the result below, Alaska airline has less flight counts overall and seems to have higher delay rate, 13%.

library(ggplot2)

plot1_data <- tidy_data |>
  group_by(airline, flight_status) |>
  summarise(
    flights = sum(num_flights),
    .groups = "drop"
  ) |>
  group_by(airline) |>
  mutate(
    total_airline = sum(flights),
    delayed_count = sum(flights[flight_status == "delayed"]),
    ontime_count = sum(flights[flight_status == "on time"]),
    delayed_pct = round(delayed_count / total_airline * 100),
    ontime_pct = round(ontime_count / total_airline * 100),
    color_status = ifelse(flight_status == "delayed", "Delayed", "On Time")
  ) |>
  ungroup() |>
  # Add a column for the top label  
  group_by(airline) |>
  mutate(
    top_label = paste0("Total: ", total_airline, 
                       "\nDelayed: ", delayed_count, " (", delayed_pct, "%)", 
                       "\nOn Time: ", ontime_count, " (", ontime_pct, "%)")
  ) |>
  ungroup()

# Create the plot 
plot1 <- ggplot(plot1_data, aes(x = airline, y = flights, fill = color_status)) +
  geom_bar(stat = "identity", width = 0.7) +

  # Add label on top
  geom_text(data = plot1_data |> group_by(airline) |> slice(1),
            aes(x = airline, 
                y = total_airline + max(plot1_data$total_airline) * 0.08,
                label = top_label),
            inherit.aes = FALSE,
            color = "black",
            size = 3.5,
            fontface = "bold",
            lineheight = 0.9,
            vjust = 0) +
  scale_fill_manual(
    values = c("Delayed" = "#FF4D4D", "On Time" = "#666666"),
    breaks = c("Delayed", "On Time")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.25))) +   
  labs(
    title = "Airlines Comparison with percentages",
    y = "Number of Flights",
    fill = "Flight Status",
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    plot.subtitle = element_text(hjust = 0.5, color = "gray50", size = 12),
    axis.title = element_text(face = "bold"),
    axis.text.x = element_text(face = "bold", size = 11),
    axis.text.y = element_text(size = 10),
    legend.position = "top",
    legend.title = element_text(face = "bold"),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.caption = element_text(color = "gray50", hjust = 0, size = 9)
  )

print(plot1)

List statistics in a table to compare percentage of delays and arrival rates for tow airlines across five cities. Across all destination, Alaska airline actually has lower delay percentages than AM West airline. This seems to be a different conclusion than the bar chart above.

# Calculate statistics by airline and destination
airline_comparison <- tidy_data |>
  group_by(airline, destination, flight_status) |>
  summarise(
    flights = sum(num_flights),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = flight_status,
    values_from = flights,
    values_fill = 0
  ) |>
  mutate(
    total_flights = delayed + `on time`,
    delay_pct = round(delayed / total_flights * 100),
    ontime_pct = round(`on time` / total_flights * 100)
  ) |>
  select(airline, destination, total_flights, delayed, `on time`, delay_pct, ontime_pct) |>
  arrange(destination, airline)

# Display  
airline_comparison |>
  kable(
    caption = "Flight Performance Comparison by Airline and Destination",
    col.names = c("Airline", "Destination", "Total Flights", "Delayed", "On Time", 
                  "Delay %", "On Time %"),
    format = "html"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "left",
    font_size = 12
  ) |>
  column_spec(1, bold = TRUE) |>
  column_spec(2, bold = TRUE) |>
  column_spec(6, color = "red", bold = TRUE) |>
  column_spec(7, color = "green", bold = TRUE) |>
  add_header_above(c(" " = 2, "Flight Counts" = 3, "Percentages" = 2))

Flight Performance Comparison by Airline and Destination
		Flight Counts			Percentages
Airline	Destination	Total Flights	Delayed	On Time	Delay %	On Time %
ALASKA	Los Angeles	559	62	497	11	89
AM WEST	Los Angeles	811	117	694	14	86
ALASKA	Phoenix	233	12	221	5	95
AM WEST	Phoenix	5255	415	4840	8	92
ALASKA	San Diego	232	20	212	9	91
AM WEST	San Diego	448	65	383	15	85
ALASKA	San Francisco	605	102	503	17	83
AM WEST	San Francisco	449	129	320	29	71
ALASKA	Seattle	2146	305	1841	14	86
AM WEST	Seattle	262	61	201	23	77

Summary

Two analysis appear to have two different conclusion. Bar chart shows overall AM West has lower delayed rate (11%) compare to Alaska (13%), however different conclusion can be seen from the table where AM had higher delayed rate across all destinations. This is an example of the topic we discussed during class - Simpson’s paradox.

After closer look, main reason of the discrepancy came caused by the result in destination Phoenix. In phoenix, AM West had vast majority (73%) of its flights operate on this route and only 8% delay, whereas Alaska’s flight volume is more evenly distributed across cities with lower delay rate at every individual destination. The result in Phoenix made AM West appear to have less delay rate overall.