In this analysis, we examine arrival delays for two airlines across five destinations. The data comes from a CSV file, which is initially in a wide format, showing delays for each airline separately across the destinations. Our goal is to tidy this data, transform it into a format suitable for analysis, and then compare the arrival delays between the two airlines.
First, we load the data set from the CSV file.
# Load the data from the CSV file
arrival_delays <- read.csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/week%204/airline_arrivals.csv")
# Display the first few rows of the data
head(arrival_delays)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4,840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
# View the structure of the data
str(arrival_delays)
## 'data.frame': 4 obs. of 7 variables:
## $ Airline : chr "ALASKA" "ALASKA" "AM WEST" "AM WEST"
## $ Status : chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : chr "221" "12" "4,840" "415"
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : chr "1,841" "305" "201" "61"
# Summarize the data to get an idea of its distribution
summary(arrival_delays)
## Airline Status Los.Angeles Phoenix
## Length:4 Length:4 Min. : 62.0 Length:4
## Class :character Class :character 1st Qu.:103.2 Class :character
## Mode :character Mode :character Median :307.0 Mode :character
## Mean :342.5
## 3rd Qu.:546.2
## Max. :694.0
## San.Diego San.Francisco Seattle
## Min. : 20.00 Min. :102.0 Length:4
## 1st Qu.: 53.75 1st Qu.:122.2 Class :character
## Median :138.50 Median :224.5 Mode :character
## Mean :170.00 Mean :263.5
## 3rd Qu.:254.75 3rd Qu.:365.8
## Max. :383.00 Max. :503.0
The dataset is currently in a “wide” format, with each destination represented as a row, and separate columns for each airline’s delays. We’ll tidy this data by converting it into a “long” format, where each row contains a destination, airline, and the corresponding delay.
# First, inspect the structure of the data
str(arrival_delays)
## 'data.frame': 4 obs. of 7 variables:
## $ Airline : chr "ALASKA" "ALASKA" "AM WEST" "AM WEST"
## $ Status : chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : chr "221" "12" "4,840" "415"
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : chr "1,841" "305" "201" "61"
# Convert all city columns to numeric, replacing non-numeric values with NA
arrival_delays <- arrival_delays %>%
mutate_at(vars(Los.Angeles, Phoenix, San.Diego, San.Francisco, Seattle), as.numeric)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Phoenix = .Primitive("as.double")(Phoenix)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# Now pivot the data into long format
tidy_delays <- arrival_delays %>%
pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
names_to = "City", values_to = "Flights")
# View the tidied dataset
head(tidy_delays)
## # A tibble: 6 × 4
## Airline Status City Flights
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle NA
## 6 ALASKA delayed Los.Angeles 62
With the data now tidied, we can begin comparing the average arrival delays for the two airlines across all destinations. This will give us insights into which airline tends to have longer or shorter delays.
# Plot the count of on-time and delayed flights for each airline
ggplot(tidy_delays, aes(x = Airline, y = Flights, fill = Status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "On-Time vs Delayed Flights by Airline",
x = "Airline", y = "Number of Flights") +
theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
# Plot the delays across cities
ggplot(tidy_delays %>% filter(Status == "delayed"),
aes(x = City, y = Flights, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Flight Delays Across Cities",
x = "City", y = "Number of Delayed Flights") +
theme_minimal() +
coord_flip()
The analysis shows clear differences in arrival delays between the two airlines. By tidying the data and transforming it into a long format, we were able to easily compare the performance of each airline across various destinations. These findings could help further studies or improve operational efficiency by identifying problem areas for certain airlines.