Week 5A : Airlines Delays Analysis

Author

Pascal Hermann Kouogang Tafo

INTRODUCTION

The goal of this assignment is to clean and transform a  “wide” dataset representing arrival delays for ALASKA and AM WEST airlines across five different cities into a “long” format suitable for comparative analysis. I will be using functions in the Tidyverse R-package to accomplish this goal.

APPROACH

  • Create the data table and load it to a GitHub repo so that it could be easily accessible by everyone. 

  • Address the missing airline names using the fill() function, ensuring every “delayed” and “on time” row is correctly attributed to either Airline and also i will  remove the empty row separating the status of the two airlines.

  • Use the pivot_longer() function to transform my data to a long format because to perform our analysis on Airline Delays or On Time it will be better to have Variables as columns and Observation as rows.  

  • Calculate and Plot of delay or On Time percentages for both Airlines overall.

  • Calculate and Plot of delay or On Time percentages for both Airlines across the cities.

  • Analyze potential similarity or not between the two comparison above.

Install and Load packages

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Create the CSV Airlines flights Status CSV file and Save it in my GitHub

Airlines_flights_status <- data.frame(
  X = c("ALASKA", NA, NA, "AM WEST", NA),
  Y = c("on time", "delayed", NA, "on time", "delayed"),
  Los_Angeles   = c(497, 62, NA, 694, 117),
  Phoenix       = c(221, 12, NA, 4840, 415),
  San_Diego     = c(212, 20, NA, 383, 65),
  San_Francisco = c(503, 102, NA, 320, 129),
  Seattle       = c(1841, 305, NA, 210, 61)
)

Airlines_flights_status
        X       Y Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2    <NA> delayed          62      12        20           102     305
3    <NA>    <NA>          NA      NA        NA            NA      NA
4 AM WEST on time         694    4840       383           320     210
5    <NA> delayed         117     415        65           129      61
write.csv(Airlines_flights_status, "Airlines_flights_status.csv", row.names = FALSE)

Read the file from my GitHub repo and populate missing data

# Read CSV file

url<- "https://raw.githubusercontent.com/Pascaltafo2025/Week-5A--DATA-607/refs/heads/main/Airlines_flights_status.csv"

Airlines_flights_status <- read.csv(url)

# Remove the blank row

Airlines_flights_status <- Airlines_flights_status[c(1,2,4,5), ]
Airlines_flights_status
        X       Y Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2    <NA> delayed          62      12        20           102     305
4 AM WEST on time         694    4840       383           320     210
5    <NA> delayed         117     415        65           129      61
# Populate missing data

Airlines_flights_status[c(2,4),1] <- Airlines_flights_status[c(1,3),1]

colnames(Airlines_flights_status)[1:2] <- c('Airlines', 'Status')

Airlines_flights_status
  Airlines  Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1   ALASKA on time         497     221       212           503    1841
2   ALASKA delayed          62      12        20           102     305
4  AM WEST on time         694    4840       383           320     210
5  AM WEST delayed         117     415        65           129      61

Reshape data from wide to long format

We will need to transform the data format from wide to long in order to facilitate our analysis on Airlines status. That transformation will consist of having Variables as columns and Observation as rows.

flights_Data_long <- Airlines_flights_status %>%
  pivot_longer(
    cols      = Los_Angeles:Seattle,
    names_to  = "City",
    values_to = "Count"
  ) %>% # Clean up city names for display
  mutate(City = str_replace_all(City, "_", " "))

flights_Data_long
# A tibble: 20 × 4
   Airlines Status  City          Count
   <chr>    <chr>   <chr>         <int>
 1 ALASKA   on time Los Angeles     497
 2 ALASKA   on time Phoenix         221
 3 ALASKA   on time San Diego       212
 4 ALASKA   on time San Francisco   503
 5 ALASKA   on time Seattle        1841
 6 ALASKA   delayed Los Angeles      62
 7 ALASKA   delayed Phoenix          12
 8 ALASKA   delayed San Diego        20
 9 ALASKA   delayed San Francisco   102
10 ALASKA   delayed Seattle         305
11 AM WEST  on time Los Angeles     694
12 AM WEST  on time Phoenix        4840
13 AM WEST  on time San Diego       383
14 AM WEST  on time San Francisco   320
15 AM WEST  on time Seattle         210
16 AM WEST  delayed Los Angeles     117
17 AM WEST  delayed Phoenix         415
18 AM WEST  delayed San Diego        65
19 AM WEST  delayed San Francisco   129
20 AM WEST  delayed Seattle          61

Let’s compare the overall flights delays percentages for the two airlines and create a bar graph

Step 1:Overall flights status for both airlines

overall_Airlines_summary <- flights_Data_long %>%
  group_by(Airlines, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  group_by(Airlines) %>%
  mutate(
    Grand_Total  = sum(Total),
    Pct          = round(Total / Grand_Total * 100,2)
  ) %>%
  ungroup()

overall_Airlines_summary
# A tibble: 4 × 5
  Airlines Status  Total Grand_Total   Pct
  <chr>    <chr>   <int>       <int> <dbl>
1 ALASKA   delayed   501        3775  13.3
2 ALASKA   on time  3274        3775  86.7
3 AM WEST  delayed   787        7234  10.9
4 AM WEST  on time  6447        7234  89.1

Step 2: Overall delay flights status for the airlines.

I decided to focus on flights delayed because they are more significant in the choice of an airlines for a trip.

overall_flights_delay <- overall_Airlines_summary %>%
  filter(Status == "delayed") %>%
  select(Airlines, Delayed = Total, Grand_Total, Delay_Pct = Pct)

overall_flights_delay
# A tibble: 2 × 4
  Airlines Delayed Grand_Total Delay_Pct
  <chr>      <int>       <int>     <dbl>
1 ALASKA       501        3775      13.3
2 AM WEST      787        7234      10.9

Step 3: Let’s create the bar graph using ggplot2 package

To create the plot, i used the help of Gemini 3.0 with the prompt: “Plot the result of this code( I entered the overall_flights_delay code ) using ggplot2 package in R”.

library(ggplot2)

# Create the plot

ggplot(overall_flights_delay, aes(x = Airlines, y = Delay_Pct, fill = Airlines)) +
  # Using geom_col for pre-summarized data (or geom_bar(stat = "identity"))
  geom_col(width = 0.6) +
  # Adding numerical labels above each bar for clarity
  geom_text(aes(label = paste0(round(Delay_Pct, 2), "%")), 
            vjust = -0.5, 
            fontface = "bold",
            size = 5) +
  # Enhancing visual appeal with titles and clean axes
  labs(
    title = "Overall Percentage of Delayed Flights",
    subtitle = "Aggregated data across all 5 destination cities",
    x = "Airline",
    y = "Percentage of Flights Delayed (%)"
  ) +
  # Choosing color-blind friendly or distinctive colors
  scale_fill_manual(values = c("ALASKA" = "green", "AM WEST" = "orange")) +
  # Removing the legend as the x-axis labels are sufficient
  theme_minimal() +
  theme(legend.position = "none") +
  # Scaling the y-axis slightly higher than the max value to accommodate labels
  scale_y_continuous(limits = c(0, max(overall_flights_delay$Delay_Pct) + 3))

Interpretation:

Despite having almost twice the volume of flights, AM WEST has a lower flights delay percentage of 10.88% compared to 13.27 %. Hence, based on the overall comparison delays percentages for the two airlines, passenger will be tempted to conclude that AM WEST is more punctual than ALASKA.

Let’s compare the flights delays percentages for the two airlines across the five cities

city_flights_summary <- flights_Data_long %>%
  group_by(Airlines, City) %>%
  mutate(City_Total = sum(Count)) %>%
  ungroup() %>%
  filter(Status == "delayed") %>%
  mutate(Delay_Pct = round(Count / City_Total * 100,2)) %>%
  select(Airlines, City, Delayed = Count, City_Total, Delay_Pct)

city_flights_summary <- city_flights_summary %>% arrange(City, Airlines)
city_flights_summary
# A tibble: 10 × 5
   Airlines City          Delayed City_Total Delay_Pct
   <chr>    <chr>           <int>      <int>     <dbl>
 1 ALASKA   Los Angeles        62        559     11.1 
 2 AM WEST  Los Angeles       117        811     14.4 
 3 ALASKA   Phoenix            12        233      5.15
 4 AM WEST  Phoenix           415       5255      7.9 
 5 ALASKA   San Diego          20        232      8.62
 6 AM WEST  San Diego          65        448     14.5 
 7 ALASKA   San Francisco     102        605     16.9 
 8 AM WEST  San Francisco     129        449     28.7 
 9 ALASKA   Seattle           305       2146     14.2 
10 AM WEST  Seattle            61        271     22.5 

Create a bar graph that represents he flights delays percentages for the two airlines across the five cities

Here, I use the same prompt as in the previous question but instead across the five cities on Gemini 3.

ggplot(city_flights_summary, aes(x = City, y = Delay_Pct, fill = Airlines)) +
  # Use position = "dodge" to place bars side-by-side
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +
  # Add percentage labels above bars
  geom_text(aes(label = paste0(round(Delay_Pct, 1), "%")), 
            position = position_dodge(width = 0.8), 
            vjust = -0.5, 
            size = 3.5, 
            fontface = "bold") +
  # Customizing labels and theme
  labs(
    title = "Flight Delay Percentage by City and Airline",
    subtitle = "Alaska Airlines vs. AM West across 5 Western Cities",
    x = "Destination City",
    y = "Percentage of Flights Delayed (%)",
    fill = "Airline"
  ) +
  scale_fill_manual(values = c("ALASKA" = "green", "AM WEST" = "orange")) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank()
  ) +
  # Adjust Y axis to make room for labels
  scale_y_continuous(limits = c(0, max(city_flights_summary$Delay_Pct) + 5))

Interpretation

The result of this analysis is surprising because ALASKA has the lowest rate of delay in every city despite the fact AM WEST was considered more punctual than ALASKA when performing the overall comparison.

ANALYSIS

The paradox when comparing Alaska and AM West airlines delayed flights performance city by city vs the Overall performance can be explained by the vastly different distribution of flight volumes across the cities for the two airlines.

For instance, ALASKA has less delayed flights than AM WEST in Phoenix but the total number of ALASKA flights to Phoenix is less than 5% compared to the total number of AM WEST flights to Phoenix (233 flights for ALASKA vs 5255 flights for AM WEST ).

CONCLUSION

This result brings us to believe that there are more difficult cities to fly to where we believe ALASKA flights were concentrated. Thankfully, we conducted an analysis across cities which helped us to have a “non” bias opinion between both airlines.