Introduction

In this analysis we tidy and transform a dataset of airline destinations, arrivals, and delays. In addition, we perform comparative analysis between the two airlines (Alaska and AM West).

Load libraries

library(readr)
## Warning: package 'readr' was built under R version 4.3.3
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Read CSV file

url <- "https://raw.githubusercontent.com/Emin-NYC/DATA607-week4/refs/heads/main/airline_delays.csv"

airline_delays <- read.csv(url)

View data

print(airline_delays)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM West on time         694    4840       383           320     201
## 4 AM West delayed         117     415        65           129      61

Tidying and transforming the data from wide to long format

tidy_delays <- airline_delays %>%
  pivot_longer(
    cols = c(Los.Angeles, Phoenix, San.Diego, San.Francisco, Seattle), 
    names_to = "Destination",   
    values_to = "Delay"         
  )

View tidy data

head(tidy_delays)
## # A tibble: 6 × 4
##   Airline Status  Destination   Delay
##   <chr>   <chr>   <chr>         <int>
## 1 Alaska  on time Los.Angeles     497
## 2 Alaska  on time Phoenix         221
## 3 Alaska  on time San.Diego       212
## 4 Alaska  on time San.Francisco   503
## 5 Alaska  on time Seattle        1841
## 6 Alaska  delayed Los.Angeles      62

Missing value check

sum(is.na(tidy_delays))
## [1] 0

Perform comparative analysis

summary_delays <- tidy_delays %>%
  group_by(Airline) %>%
  summarize(
    Mean_Delay = mean(Delay, na.rm = TRUE),
    Median_Delay = median(Delay, na.rm = TRUE),
    SD_Delay = sd(Delay, na.rm = TRUE)
  )

Findings from comparative analysis

Mean delay shows us that on average, AM West flights are more delayed than Alaska flights.

Median delay suggests that both airlines have this value lower than mean delays, which indicates the distribution of delays is right-skewed.

Standard deviation for AM West suggests larger variability in delay times compared to Alaska. This could mean AM Flights have higher inconsistency in their performance, with most days being minimally delayed, while other days have extremely high delays.

View summary statistics

print(summary_delays)
## # A tibble: 2 × 4
##   Airline Mean_Delay Median_Delay SD_Delay
##   <chr>        <dbl>        <dbl>    <dbl>
## 1 AM West       722.         260.    1460.
## 2 Alaska        378.         216.     544.

Visualizing the data

ggplot(tidy_delays, aes(x = Destination, y = Delay, color = Airline)) +
  geom_jitter(width = 0.2, height = 0) +
  theme_minimal() +
  labs(title = "Arrival Delays by Destination and Airline",
       x = "Destination",
       y = "Arrival Delay (minutes)") +
  scale_color_manual(values = c("skyblue", "orange"))

ggplot(summary_delays, aes(x = Airline, y = Mean_Delay, fill = Airline)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Delay by Airline",
       x = "Airline",
       y = "Average Delay (minutes)") +
  geom_text(aes(label = round(Mean_Delay, 1)), vjust = -0.5) +
  scale_fill_brewer(palette = "Set3")

Conclusion

AM West has a higher average delay and greater variability in delays vs Alaska.

Alaska is more consistent and has lower average delays and less variability.