We were provided with a chart detailing arrivals and delays for two airlines across five destinations. Since the chart is in a wide format, the first step is to import the data. We obtain the data from GitHub and process it using the readr package. Because the values are not separated by commas, we utilize a specific delimiter during the import to ensure that each destination is allocated its own column, rather than being merged into a single column.
knitr::opts_chunk$set(echo = TRUE)
#install.packages(c("readr", "dplyr", "tidyr"))
library(readr); library(dplyr); library(tidyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#import dataset from Github, had to use delimiter because not comma seperated
wide_airport <- wide_airport <- read_delim("https://raw.githubusercontent.com/tiffhugh/DATA71200/refs/heads/main/wide_airport.csv", delim = "\t")
## New names:
## • `` -> `...1`
## • `` -> `...2`
## Rows: 4 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): ...1, ...2
## dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(wide_airport)
While the data has been successfully imported, several significant issues must be addressed before we can proceed with any analysis. The dataset is untidy, lacking column names for “airline” and “status,” and contains empty cells. By utilizing the tidy package, we can make necessary adjustments to enhance the data’s quality and structure.
knitr::opts_chunk$set(echo = TRUE)
# Fix columns include airline and status
colnames(wide_airport)[1] <- "airline"
colnames(wide_airport)[2] <- "status"
# Name empty cell to the correct airline
wide_airport <- wide_airport %>%
mutate(airline = ifelse(airline == "", NA, airline)) %>%
fill(airline, .direction = "down") %>%
mutate(airline = replace(airline, c(2, 4), c("ALASKA", "AM WEST")))
wide_airport
## # A tibble: 4 × 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Now that our table is complete, the next step is using pivot_longer() to transform our dataset from a wide format to a long format, allowing us to have a more organized structure where each destination is represented in a separate row along with its corresponding arrival status and frequency.
knitr::opts_chunk$set(echo = TRUE)
long_airport <- wide_airport %>%
pivot_longer(
cols = -c(airline, status),
names_to = 'destination',
values_to = 'Count'
)
head(long_airport)
## # A tibble: 6 × 4
## airline status destination Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
The data is in a long format so analysis can be done.
knitr::opts_chunk$set(echo = TRUE)
# Calculate total counts for each airline and status
total_counts <- long_airport %>%
group_by(airline, status) %>%
summarize(TotalCount = sum(Count, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
percentage_arrival_rates <- total_counts %>%
group_by(airline) %>%
mutate(Percentage = (TotalCount / sum(TotalCount)) * 100) %>% # Calculate percentage
ungroup()
# create a comparsion between the airlines
comparison <- percentage_arrival_rates %>%
filter(airline %in% c("ALASKA", "AM WEST"))
comparison
## # A tibble: 4 × 4
## airline status TotalCount Percentage
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA delayed 501 13.3
## 2 ALASKA on time 3274 86.7
## 3 AM WEST delayed 787 10.9
## 4 AM WEST on time 6438 89.1
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(knitr)
# table using kable
table_comparison <- comparison %>%
select(airline, status, TotalCount, Percentage)
kable(table_comparison, caption = "Comparison of Arrival Rates for ALASKA and AM WEST Airlines")
airline | status | TotalCount | Percentage |
---|---|---|---|
ALASKA | delayed | 501 | 13.27152 |
ALASKA | on time | 3274 | 86.72848 |
AM WEST | delayed | 787 | 10.89273 |
AM WEST | on time | 6438 | 89.10727 |
# barplot
ggplot(comparison, aes(x = airline, y = Percentage, fill = status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Comparison of Arrival Rates by Airline",
x = "Airline",
y = "Percentage of Arrival Rates",
fill = "Arrival Status"
) +
theme_dark()
AM
WEST Airlines not only has a higher percentage of on-time arrivals
compared to ALASKA Airlines but also handles significantly more incoming
flights overall. With 7,225 flights in total, AM WEST manages a robust
operation, maintaining an 89.1% on-time arrival rate and only 10.9% of
flights being delayed. In contrast, ALASKA Airlines, with 3,775 flights,
has a slightly lower 86.7% on-time rate and a higher delay rate of
13.3%. Despite having fewer flights, ALASKA experiences a slightly
greater proportion of delays, while AM WEST’s larger volume of flights
shows their ability to maintain punctuality on a bigger scale,
reflecting more consistent performance under higher demand.
knitr::opts_chunk$set(echo = TRUE)
city_comparison_filtered <- long_airport %>%
filter(airline %in% c("ALASKA", "AM WEST")) %>%
filter(destination %in% c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")) %>%
group_by(airline, status, destination) %>%
summarize(TotalCount = sum(Count, na.rm = TRUE)) %>%
ungroup() %>%
group_by(airline, destination) %>%
mutate(Percentage = (TotalCount / sum(TotalCount)) * 100) %>% # calculate percentage of cities
ungroup()
## `summarise()` has grouped output by 'airline', 'status'. You can override using
## the `.groups` argument.
city_comparison_filtered
## # A tibble: 20 × 5
## airline status destination TotalCount Percentage
## <chr> <chr> <chr> <dbl> <dbl>
## 1 ALASKA delayed Los Angeles 62 11.1
## 2 ALASKA delayed Phoenix 12 5.15
## 3 ALASKA delayed San Diego 20 8.62
## 4 ALASKA delayed San Francisco 102 16.9
## 5 ALASKA delayed Seattle 305 14.2
## 6 ALASKA on time Los Angeles 497 88.9
## 7 ALASKA on time Phoenix 221 94.8
## 8 ALASKA on time San Diego 212 91.4
## 9 ALASKA on time San Francisco 503 83.1
## 10 ALASKA on time Seattle 1841 85.8
## 11 AM WEST delayed Los Angeles 117 14.4
## 12 AM WEST delayed Phoenix 415 7.90
## 13 AM WEST delayed San Diego 65 14.5
## 14 AM WEST delayed San Francisco 129 28.7
## 15 AM WEST delayed Seattle 61 23.3
## 16 AM WEST on time Los Angeles 694 85.6
## 17 AM WEST on time Phoenix 4840 92.1
## 18 AM WEST on time San Diego 383 85.5
## 19 AM WEST on time San Francisco 320 71.3
## 20 AM WEST on time Seattle 201 76.7
knitr::opts_chunk$set(echo = TRUE)
#table
kable(city_comparison_filtered, caption = "Comparison of Arrival Rates for ALASKA and AM WEST Airlines Across Cities")
airline | status | destination | TotalCount | Percentage |
---|---|---|---|---|
ALASKA | delayed | Los Angeles | 62 | 11.091234 |
ALASKA | delayed | Phoenix | 12 | 5.150215 |
ALASKA | delayed | San Diego | 20 | 8.620690 |
ALASKA | delayed | San Francisco | 102 | 16.859504 |
ALASKA | delayed | Seattle | 305 | 14.212488 |
ALASKA | on time | Los Angeles | 497 | 88.908766 |
ALASKA | on time | Phoenix | 221 | 94.849785 |
ALASKA | on time | San Diego | 212 | 91.379310 |
ALASKA | on time | San Francisco | 503 | 83.140496 |
ALASKA | on time | Seattle | 1841 | 85.787512 |
AM WEST | delayed | Los Angeles | 117 | 14.426634 |
AM WEST | delayed | Phoenix | 415 | 7.897241 |
AM WEST | delayed | San Diego | 65 | 14.508929 |
AM WEST | delayed | San Francisco | 129 | 28.730512 |
AM WEST | delayed | Seattle | 61 | 23.282443 |
AM WEST | on time | Los Angeles | 694 | 85.573366 |
AM WEST | on time | Phoenix | 4840 | 92.102759 |
AM WEST | on time | San Diego | 383 | 85.491071 |
AM WEST | on time | San Francisco | 320 | 71.269488 |
AM WEST | on time | Seattle | 201 | 76.717557 |
#barlpot
ggplot(city_comparison_filtered, aes(x = destination, y = Percentage, fill = status)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~airline) +
labs(
title = "Comparison of Arrival Rates Across Cities",
x = "City",
y = "Percentage of Arrival Rates",
fill = "Arrival Status"
) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)
)
The
comparison of arrival rates for ALASKA and AM WEST airlines across five
cities—Los Angeles, Phoenix, San Diego, San Francisco, and Seattle—shows
that AM WEST generally has a higher percentage of on-time arrivals in
most cities. For example, AM WEST had 92.1% of its flights arriving on
time in Phoenix, compared to ALASKA’s 94.8%. However, AM WEST
experienced more delays in some cities, such as San Francisco, where
28.7% of their flights were delayed compared to 16.9% for ALASKA.
Despite ALASKA’s smaller operations in some cities, it maintained
competitive on-time rates, with 88.9% on-time arrivals in Los Angeles,
and 85.8% in Seattle. Overall, while both airlines perform well, AM WEST
sees a higher proportion of delays in certain cities. The different
volumes of flights in various cities account for the discrepancies
seen.
Reference Grolemund, Garrett, and Hadley Wickham. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.