This report presents an analysis of airline arrival delays for two airlines using R. The original dataset is essentially a chart that later became a CSV file which I generated called data.csv.
The view includes flight arrival records for two airlines, Alaska Airlines and AM West, across five major U.S. airports. The objective is to transform, tidy, and analyze the data to compare delays between the two airlines.
The comma separated value representation of the above view is shown below.
,,Los Angeles,Phoenix,San Diego,San Francisco,Seattle
ALASKA,On Time,497,221,212,503,1841
,Delayed,62,12,20,102,305
AM WEST,On Time,694,4840,383,320,201
,Delayed,117,415,65,129,61
# Import tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tibble)
# Visualization libraries
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
# URL to the data
df <- read.csv("c:/Users/Ali/Documents/R-documents/data.csv", header = TRUE)
To clean the dataset:
# Clean missing airline names after replacing strings with NA values
df_clean <- df %>%
mutate(X = ifelse(X == "", NA, X)) %>%
fill(X, .direction = "down")
# Filter for delayed flights
df_filtered <- df_clean %>%
# Filtering the status column to solely retrieve the delayed rows
filter(X.1 == "Delayed") %>%
# Dropping the column pertaining to flight status
select(-X.1)
The dataset is transposed so that specific airports become the rows instead of columns. After transposition, column names are reassigned to easily compare the delays from both airlines.
# Transpose and rename columns
df_transposed <- df_filtered %>%
t() %>%
as.data.frame()
# Temporary names of transposed DataFrame columns
print(names(df_transposed)) # Leads to V1 and V2
## [1] "V1" "V2"
# Rename columns to reflect the airlines and the fact that the rows pertain to delays from each major airport.
df_renamed <- df_transposed[-1,] %>%
rename(Alaska_Delayed = "V1", AM_WEST_Delayed = "V2")
# Convert row names into a column called Airport for easy identification
## Index to Column Mapping
df_renamed <- rownames_to_column(df_renamed, var="Airport")
Visualizations - Bar Chart - A comparative visualization of the number of delays per airport.
# Bar Chart
ggplot(df_renamed, aes(x = Airport)) +
geom_bar(aes(y = AM_WEST_Delayed, fill = "AM West"), stat = "identity", position = "dodge") +
geom_bar(aes(y = Alaska_Delayed, fill = "Alaska Airlines"), stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Alaska Airlines" = "gold", "AM West" = "orangered")) +
labs(title = "Comparison of Delayed Flights by Airport", x = "Airport", y = "Number of Delayed Flights") +
theme_minimal()
ggsave("bar_chart_delayed_flights.png", width = 8, height = 5, dpi = 300)
From the visualization, it is clear that AM West has significantly more delays at majority of the airports, while Alaska Airlines has a higher number of delays at Seattle, but is also very close to the number of delays at AM West which is indicated by the bar eclipsing any red colored bar from appearing at the location of Seattle.