Airline Arrival Delays: Tidying and Analysis

Introduction

In this analysis, we examine arrival delays for two airlines across five destinations. The data comes from a CSV file, which is initially in a wide format, showing delays for each airline separately across the destinations. Our goal is to tidy this data, transform it into a format suitable for analysis, and then compare the arrival delays between the two airlines.

Loading the Data

First, we load the data set from the CSV file.

# Load the data from the CSV file
arrival_delays <- read.csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/week%204/airline_arrivals.csv")


# Display the first few rows of the data
head(arrival_delays)

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694   4,840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

# View the structure of the data
str(arrival_delays)

## 'data.frame':    4 obs. of  7 variables:
##  $ Airline      : chr  "ALASKA" "ALASKA" "AM WEST" "AM WEST"
##  $ Status       : chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : chr  "221" "12" "4,840" "415"
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : chr  "1,841" "305" "201" "61"

# Summarize the data to get an idea of its distribution
summary(arrival_delays)

##    Airline             Status           Los.Angeles      Phoenix         
##  Length:4           Length:4           Min.   : 62.0   Length:4          
##  Class :character   Class :character   1st Qu.:103.2   Class :character  
##  Mode  :character   Mode  :character   Median :307.0   Mode  :character  
##                                        Mean   :342.5                     
##                                        3rd Qu.:546.2                     
##                                        Max.   :694.0                     
##    San.Diego      San.Francisco     Seattle         
##  Min.   : 20.00   Min.   :102.0   Length:4          
##  1st Qu.: 53.75   1st Qu.:122.2   Class :character  
##  Median :138.50   Median :224.5   Mode  :character  
##  Mean   :170.00   Mean   :263.5                     
##  3rd Qu.:254.75   3rd Qu.:365.8                     
##  Max.   :383.00   Max.   :503.0

Tidying the Data

The dataset is currently in a “wide” format, with each destination represented as a row, and separate columns for each airline’s delays. We’ll tidy this data by converting it into a “long” format, where each row contains a destination, airline, and the corresponding delay.

# First, inspect the structure of the data
str(arrival_delays)

## 'data.frame':    4 obs. of  7 variables:
##  $ Airline      : chr  "ALASKA" "ALASKA" "AM WEST" "AM WEST"
##  $ Status       : chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : chr  "221" "12" "4,840" "415"
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : chr  "1,841" "305" "201" "61"

# Convert all city columns to numeric, replacing non-numeric values with NA
arrival_delays <- arrival_delays %>%
  mutate_at(vars(Los.Angeles, Phoenix, San.Diego, San.Francisco, Seattle), as.numeric)

## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Phoenix = .Primitive("as.double")(Phoenix)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

# Now pivot the data into long format
tidy_delays <- arrival_delays %>%
  pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
               names_to = "City", values_to = "Flights")

# View the tidied dataset
head(tidy_delays)

## # A tibble: 6 × 4
##   Airline Status  City          Flights
##   <chr>   <chr>   <chr>           <dbl>
## 1 ALASKA  on time Los.Angeles       497
## 2 ALASKA  on time Phoenix           221
## 3 ALASKA  on time San.Diego         212
## 4 ALASKA  on time San.Francisco     503
## 5 ALASKA  on time Seattle            NA
## 6 ALASKA  delayed Los.Angeles        62

Analysis: Comparing Arrival Delays

With the data now tidied, we can begin comparing the average arrival delays for the two airlines across all destinations. This will give us insights into which airline tends to have longer or shorter delays.

Summary of On-Time vs Delayed Flights

# Plot the count of on-time and delayed flights for each airline
ggplot(tidy_delays, aes(x = Airline, y = Flights, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "On-Time vs Delayed Flights by Airline",
       x = "Airline", y = "Number of Flights") +
  theme_minimal()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Delays by City

# Plot the delays across cities
ggplot(tidy_delays %>% filter(Status == "delayed"), 
       aes(x = City, y = Flights, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Flight Delays Across Cities",
       x = "City", y = "Number of Delayed Flights") +
  theme_minimal() +
  coord_flip()

Conclusion

The analysis shows clear differences in arrival delays between the two airlines. By tidying the data and transforming it into a long format, we were able to easily compare the performance of each airline across various destinations. These findings could help further studies or improve operational efficiency by identifying problem areas for certain airlines.