Purpose of the Report

In this report is to examines the arrival delays for two airlines accross five destination. Packages tidyr and dplyr will be use to tidy the dataset after creating the data, then analyze the differences in the delays between the airlines.

library(tibble)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)
library(stringr)
library(ggplot2)

The first Step is to Create the Csv File

data <- tibble(
  Airline = rep(c("ALASKA", "AM WEST"), each = 2),
  Status = rep(c("on time", "delayed"), times = 2),
  Los_Angeles = c(497, 62, 694, 117),
  Phoenix = c(221, 12, 4840, 415),
  San_Diego = c(212, 20, 383, 65),
  San_Francisco = c(503, 102, 320, 129),
  Seattle = c(1841, 305, 201, 61))

Using Write.csv to Save the Dataset

write_csv(data, "airline_delays.csv")

Tidying and Reading Data.

The Data will be read and tidying its format.

delays <- read_csv("airline_delays.csv", col_types = cols(Airline = col_character(), Status = col_character()))

Data Tidying

The data set will be converted to a long format using “pivot_long” while insuring clarity

delays$Status[is.na(delays$Status)] <- "Unknown"
delays_long <- delays |>
  pivot_longer(cols = -c(Airline, Status), names_to = "Destination", values_to = "Count")

Analyzing he comparatives

Observing the percentage of flights delays per airlines, while assuring all missing values are handled.

delays_long <- delays_long |>
  complete(Airline, Status, Destination, fill = list(Count = 0))

delay_percentage <- delays_long |>
  group_by(Airline, Status) |>
  summarise(Total_Flights = sum(Count), .groups = "drop") |>
  pivot_wider(names_from = Status, values_from = Total_Flights, values_fill = list(`on time` = 0, delayed = 0)) |>
  mutate(Percentage_Delayed = (`delayed` / (`delayed` + `on time`)) * 100)
delay_percentage

## # A tibble: 2 × 4
##   Airline delayed `on time` Percentage_Delayed
##   <chr>     <dbl>     <dbl>              <dbl>
## 1 ALASKA      501      3274               13.3
## 2 AM WEST     787      6438               10.9

Screening Destination-wise Performance

The destination-wise performance analysis reveals airline-specific inefficiencies and outside variables like weather or airport congestion, assisting in the identification of trends in delays across locations. Beyond the performance of the airline as a whole, we can learn more by analyzing the percentage of delays per destination.

destination_summary <- delays_long %>%
  group_by(Destination, Airline, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  pivot_wider(names_from = Status, values_from = Total, values_fill = list(`on time` = 0, delayed = 0)) %>%
  mutate(Percentage_Delayed = (`delayed` / (`delayed` + `on time`)) * 100)
destination_summary

## # A tibble: 10 × 5
##    Destination   Airline delayed `on time` Percentage_Delayed
##    <chr>         <chr>     <dbl>     <dbl>              <dbl>
##  1 Los_Angeles   ALASKA       62       497              11.1 
##  2 Los_Angeles   AM WEST     117       694              14.4 
##  3 Phoenix       ALASKA       12       221               5.15
##  4 Phoenix       AM WEST     415      4840               7.90
##  5 San_Diego     ALASKA       20       212               8.62
##  6 San_Diego     AM WEST      65       383              14.5 
##  7 San_Francisco ALASKA      102       503              16.9 
##  8 San_Francisco AM WEST     129       320              28.7 
##  9 Seattle       ALASKA      305      1841              14.2 
## 10 Seattle       AM WEST      61       201              23.3

Organizing Name issues in the column

This step will ensure that the column names are reference correctly.

colnames(delay_percentage) <- make.names(colnames(delay_percentage))
colnames(destination_summary) <- make.names(colnames(destination_summary))

print(delay_percentage)

## # A tibble: 2 × 4
##   Airline delayed on.time Percentage_Delayed
##   <chr>     <dbl>   <dbl>              <dbl>
## 1 ALASKA      501    3274               13.3
## 2 AM WEST     787    6438               10.9

Visualizing the Data with Ggplot

ggplot(destination_summary, aes(x = Destination, y = Percentage_Delayed, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Percentage of Delayed Flights by Destination and Airline", y = "Percentage Delayed (%)")

## Finally willdemonstrate a Histogram of Total Flights per Destination

ggplot(delays_long, aes(x = Count)) +
  geom_histogram(binwidth = 50, fill = "tomato", color = "gold", alpha = 0.7) +
  facet_wrap(~ Destination) +
  theme_minimal() +
  labs(title = "Histogram of Flight Counts per Destination", x = "Number of Flights", y = "Frequency")

Overall comparison:

There are performance variations between AM WEST and ALASKA in terms of the percentage of delayed flights.

Comparing cities reveals that some have much worse delays for one airline than the other, suggesting that there are location-specific factors at play.

The explanation for the discrepancy is that performance at specific locations may not be adequately reflected by aggregated airline delay percentages. A more thorough perspective is offered by a city-specific analysis.

Tidying and Transforming Data

W. Durosier

2025-03-02