In this report is to examines the arrival delays for two airlines accross five destination. Packages tidyr and dplyr will be use to tidy the dataset after creating the data, then analyze the differences in the delays between the airlines.
library(tibble)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
library(stringr)
library(ggplot2)
data <- tibble(
Airline = rep(c("ALASKA", "AM WEST"), each = 2),
Status = rep(c("on time", "delayed"), times = 2),
Los_Angeles = c(497, 62, 694, 117),
Phoenix = c(221, 12, 4840, 415),
San_Diego = c(212, 20, 383, 65),
San_Francisco = c(503, 102, 320, 129),
Seattle = c(1841, 305, 201, 61))
write_csv(data, "airline_delays.csv")
The Data will be read and tidying its format.
delays <- read_csv("airline_delays.csv", col_types = cols(Airline = col_character(), Status = col_character()))
The data set will be converted to a long format using “pivot_long” while insuring clarity
delays$Status[is.na(delays$Status)] <- "Unknown"
delays_long <- delays |>
pivot_longer(cols = -c(Airline, Status), names_to = "Destination", values_to = "Count")
Observing the percentage of flights delays per airlines, while assuring all missing values are handled.
delays_long <- delays_long |>
complete(Airline, Status, Destination, fill = list(Count = 0))
delay_percentage <- delays_long |>
group_by(Airline, Status) |>
summarise(Total_Flights = sum(Count), .groups = "drop") |>
pivot_wider(names_from = Status, values_from = Total_Flights, values_fill = list(`on time` = 0, delayed = 0)) |>
mutate(Percentage_Delayed = (`delayed` / (`delayed` + `on time`)) * 100)
delay_percentage
## # A tibble: 2 × 4
## Airline delayed `on time` Percentage_Delayed
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3274 13.3
## 2 AM WEST 787 6438 10.9
The destination-wise performance analysis reveals airline-specific inefficiencies and outside variables like weather or airport congestion, assisting in the identification of trends in delays across locations. Beyond the performance of the airline as a whole, we can learn more by analyzing the percentage of delays per destination.
destination_summary <- delays_long %>%
group_by(Destination, Airline, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
pivot_wider(names_from = Status, values_from = Total, values_fill = list(`on time` = 0, delayed = 0)) %>%
mutate(Percentage_Delayed = (`delayed` / (`delayed` + `on time`)) * 100)
destination_summary
## # A tibble: 10 × 5
## Destination Airline delayed `on time` Percentage_Delayed
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Los_Angeles ALASKA 62 497 11.1
## 2 Los_Angeles AM WEST 117 694 14.4
## 3 Phoenix ALASKA 12 221 5.15
## 4 Phoenix AM WEST 415 4840 7.90
## 5 San_Diego ALASKA 20 212 8.62
## 6 San_Diego AM WEST 65 383 14.5
## 7 San_Francisco ALASKA 102 503 16.9
## 8 San_Francisco AM WEST 129 320 28.7
## 9 Seattle ALASKA 305 1841 14.2
## 10 Seattle AM WEST 61 201 23.3
This step will ensure that the column names are reference correctly.
colnames(delay_percentage) <- make.names(colnames(delay_percentage))
colnames(destination_summary) <- make.names(colnames(destination_summary))
print(delay_percentage)
## # A tibble: 2 × 4
## Airline delayed on.time Percentage_Delayed
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3274 13.3
## 2 AM WEST 787 6438 10.9
ggplot(destination_summary, aes(x = Destination, y = Percentage_Delayed, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Percentage of Delayed Flights by Destination and Airline", y = "Percentage Delayed (%)")
## Finally willdemonstrate a Histogram of Total Flights per
Destination
ggplot(delays_long, aes(x = Count)) +
geom_histogram(binwidth = 50, fill = "tomato", color = "gold", alpha = 0.7) +
facet_wrap(~ Destination) +
theme_minimal() +
labs(title = "Histogram of Flight Counts per Destination", x = "Number of Flights", y = "Frequency")
There are performance variations between AM WEST and ALASKA in terms of the percentage of delayed flights.
Comparing cities reveals that some have much worse delays for one airline than the other, suggesting that there are location-specific factors at play.
The explanation for the discrepancy is that performance at specific locations may not be adequately reflected by aggregated airline delay percentages. A more thorough perspective is offered by a city-specific analysis.