In this analysis we tidy and transform a dataset of airline destinations, arrivals, and delays. In addition, we perform comparative analysis between the two airlines (Alaska and AM West).
library(readr)
## Warning: package 'readr' was built under R version 4.3.3
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
url <- "https://raw.githubusercontent.com/Emin-NYC/DATA607-week4/refs/heads/main/airline_delays.csv"
airline_delays <- read.csv(url)
print(airline_delays)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 AM West delayed 117 415 65 129 61
tidy_delays <- airline_delays %>%
pivot_longer(
cols = c(Los.Angeles, Phoenix, San.Diego, San.Francisco, Seattle),
names_to = "Destination",
values_to = "Delay"
)
head(tidy_delays)
## # A tibble: 6 × 4
## Airline Status Destination Delay
## <chr> <chr> <chr> <int>
## 1 Alaska on time Los.Angeles 497
## 2 Alaska on time Phoenix 221
## 3 Alaska on time San.Diego 212
## 4 Alaska on time San.Francisco 503
## 5 Alaska on time Seattle 1841
## 6 Alaska delayed Los.Angeles 62
sum(is.na(tidy_delays))
## [1] 0
summary_delays <- tidy_delays %>%
group_by(Airline) %>%
summarize(
Mean_Delay = mean(Delay, na.rm = TRUE),
Median_Delay = median(Delay, na.rm = TRUE),
SD_Delay = sd(Delay, na.rm = TRUE)
)
Mean delay shows us that on average, AM West flights are more delayed than Alaska flights.
Median delay suggests that both airlines have this value lower than mean delays, which indicates the distribution of delays is right-skewed.
Standard deviation for AM West suggests larger variability in delay times compared to Alaska. This could mean AM Flights have higher inconsistency in their performance, with most days being minimally delayed, while other days have extremely high delays.
print(summary_delays)
## # A tibble: 2 × 4
## Airline Mean_Delay Median_Delay SD_Delay
## <chr> <dbl> <dbl> <dbl>
## 1 AM West 722. 260. 1460.
## 2 Alaska 378. 216. 544.
ggplot(tidy_delays, aes(x = Destination, y = Delay, color = Airline)) +
geom_jitter(width = 0.2, height = 0) +
theme_minimal() +
labs(title = "Arrival Delays by Destination and Airline",
x = "Destination",
y = "Arrival Delay (minutes)") +
scale_color_manual(values = c("skyblue", "orange"))
ggplot(summary_delays, aes(x = Airline, y = Mean_Delay, fill = Airline)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Delay by Airline",
x = "Airline",
y = "Average Delay (minutes)") +
geom_text(aes(label = round(Mean_Delay, 1)), vjust = -0.5) +
scale_fill_brewer(palette = "Set3")
AM West has a higher average delay and greater variability in delays vs Alaska.
Alaska is more consistent and has lower average delays and less variability.