Untidy Airport

Import Wide Data From GitHub

We were provided with a chart detailing arrivals and delays for two airlines across five destinations. Since the chart is in a wide format, the first step is to import the data. We obtain the data from GitHub and process it using the readr package. Because the values are not separated by commas, we utilize a specific delimiter during the import to ensure that each destination is allocated its own column, rather than being merged into a single column.

knitr::opts_chunk$set(echo = TRUE)
#install.packages(c("readr", "dplyr", "tidyr"))
library(readr); library(dplyr); library(tidyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#import dataset from Github, had to use delimiter because not comma seperated 
wide_airport <- wide_airport <- read_delim("https://raw.githubusercontent.com/tiffhugh/DATA71200/refs/heads/main/wide_airport.csv", delim = "\t")

## New names:
## • `` -> `...1`
## • `` -> `...2`

## Rows: 4 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): ...1, ...2
## dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(wide_airport)

Populate Missing Data and Renaming

While the data has been successfully imported, several significant issues must be addressed before we can proceed with any analysis. The dataset is untidy, lacking column names for “airline” and “status,” and contains empty cells. By utilizing the tidy package, we can make necessary adjustments to enhance the data’s quality and structure.

knitr::opts_chunk$set(echo = TRUE)
# Fix columns include airline and status 
colnames(wide_airport)[1] <- "airline"
colnames(wide_airport)[2] <- "status"
# Name empty cell to the correct airline 
wide_airport <- wide_airport %>%
  mutate(airline = ifelse(airline == "", NA, airline)) %>%  
  fill(airline, .direction = "down") %>%  
  mutate(airline = replace(airline, c(2, 4), c("ALASKA", "AM WEST")))
wide_airport

## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 ALASKA  delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 AM WEST delayed           117     415          65             129      61

Transform Data Wide To Long

Now that our table is complete, the next step is using pivot_longer() to transform our dataset from a wide format to a long format, allowing us to have a more organized structure where each destination is represented in a separate row along with its corresponding arrival status and frequency.

knitr::opts_chunk$set(echo = TRUE)
long_airport <- wide_airport %>%
  pivot_longer(
    cols = -c(airline, status), 
    names_to = 'destination',     
    values_to = 'Count'       
  )
head(long_airport)

## # A tibble: 6 × 4
##   airline status  destination   Count
##   <chr>   <chr>   <chr>         <dbl>
## 1 ALASKA  on time Los Angeles     497
## 2 ALASKA  on time Phoenix         221
## 3 ALASKA  on time San Diego       212
## 4 ALASKA  on time San Francisco   503
## 5 ALASKA  on time Seattle        1841
## 6 ALASKA  delayed Los Angeles      62

Compared Percentage Of Arrivals By Airlines

The data is in a long format so analysis can be done.

knitr::opts_chunk$set(echo = TRUE)

# Calculate total counts for each airline and status
total_counts <- long_airport %>%
  group_by(airline, status) %>%
  summarize(TotalCount = sum(Count, na.rm = TRUE)) %>%
  ungroup()

## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.

percentage_arrival_rates <- total_counts %>%
  group_by(airline) %>%
  mutate(Percentage = (TotalCount / sum(TotalCount)) * 100) %>% # Calculate percentage 
  ungroup()
# create a comparsion between the airlines 
comparison <- percentage_arrival_rates %>%
  filter(airline %in% c("ALASKA", "AM WEST"))

comparison

## # A tibble: 4 × 4
##   airline status  TotalCount Percentage
##   <chr>   <chr>        <dbl>      <dbl>
## 1 ALASKA  delayed        501       13.3
## 2 ALASKA  on time       3274       86.7
## 3 AM WEST delayed        787       10.9
## 4 AM WEST on time       6438       89.1

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(knitr)

# table using kable
table_comparison <- comparison %>%
  select(airline, status, TotalCount, Percentage)
kable(table_comparison, caption = "Comparison of Arrival Rates for ALASKA and AM WEST Airlines")

Comparison of Arrival Rates for ALASKA and AM WEST Airlines
airline	status	TotalCount	Percentage
ALASKA	delayed	501	13.27152
ALASKA	on time	3274	86.72848
AM WEST	delayed	787	10.89273
AM WEST	on time	6438	89.10727

# barplot 
ggplot(comparison, aes(x = airline, y = Percentage, fill = status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Comparison of Arrival Rates by Airline",
    x = "Airline",
    y = "Percentage of Arrival Rates",
    fill = "Arrival Status"
  ) +
  theme_dark()

AM WEST Airlines not only has a higher percentage of on-time arrivals compared to ALASKA Airlines but also handles significantly more incoming flights overall. With 7,225 flights in total, AM WEST manages a robust operation, maintaining an 89.1% on-time arrival rate and only 10.9% of flights being delayed. In contrast, ALASKA Airlines, with 3,775 flights, has a slightly lower 86.7% on-time rate and a higher delay rate of 13.3%. Despite having fewer flights, ALASKA experiences a slightly greater proportion of delays, while AM WEST’s larger volume of flights shows their ability to maintain punctuality on a bigger scale, reflecting more consistent performance under higher demand.

Compared Percentage Of Arrivals By Cities Across Airlines

knitr::opts_chunk$set(echo = TRUE)

city_comparison_filtered <- long_airport %>%
  filter(airline %in% c("ALASKA", "AM WEST")) %>%
  filter(destination %in% c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")) %>%
  group_by(airline, status, destination) %>%
  summarize(TotalCount = sum(Count, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(airline, destination) %>%
  mutate(Percentage = (TotalCount / sum(TotalCount)) * 100) %>% # calculate percentage of cities 
  ungroup()

## `summarise()` has grouped output by 'airline', 'status'. You can override using
## the `.groups` argument.

city_comparison_filtered

## # A tibble: 20 × 5
##    airline status  destination   TotalCount Percentage
##    <chr>   <chr>   <chr>              <dbl>      <dbl>
##  1 ALASKA  delayed Los Angeles           62      11.1 
##  2 ALASKA  delayed Phoenix               12       5.15
##  3 ALASKA  delayed San Diego             20       8.62
##  4 ALASKA  delayed San Francisco        102      16.9 
##  5 ALASKA  delayed Seattle              305      14.2 
##  6 ALASKA  on time Los Angeles          497      88.9 
##  7 ALASKA  on time Phoenix              221      94.8 
##  8 ALASKA  on time San Diego            212      91.4 
##  9 ALASKA  on time San Francisco        503      83.1 
## 10 ALASKA  on time Seattle             1841      85.8 
## 11 AM WEST delayed Los Angeles          117      14.4 
## 12 AM WEST delayed Phoenix              415       7.90
## 13 AM WEST delayed San Diego             65      14.5 
## 14 AM WEST delayed San Francisco        129      28.7 
## 15 AM WEST delayed Seattle               61      23.3 
## 16 AM WEST on time Los Angeles          694      85.6 
## 17 AM WEST on time Phoenix             4840      92.1 
## 18 AM WEST on time San Diego            383      85.5 
## 19 AM WEST on time San Francisco        320      71.3 
## 20 AM WEST on time Seattle              201      76.7

knitr::opts_chunk$set(echo = TRUE)
#table 
kable(city_comparison_filtered, caption = "Comparison of Arrival Rates for ALASKA and AM WEST Airlines Across Cities")

Comparison of Arrival Rates for ALASKA and AM WEST Airlines Across Cities
airline	status	destination	TotalCount	Percentage
ALASKA	delayed	Los Angeles	62	11.091234
ALASKA	delayed	Phoenix	12	5.150215
ALASKA	delayed	San Diego	20	8.620690
ALASKA	delayed	San Francisco	102	16.859504
ALASKA	delayed	Seattle	305	14.212488
ALASKA	on time	Los Angeles	497	88.908766
ALASKA	on time	Phoenix	221	94.849785
ALASKA	on time	San Diego	212	91.379310
ALASKA	on time	San Francisco	503	83.140496
ALASKA	on time	Seattle	1841	85.787512
AM WEST	delayed	Los Angeles	117	14.426634
AM WEST	delayed	Phoenix	415	7.897241
AM WEST	delayed	San Diego	65	14.508929
AM WEST	delayed	San Francisco	129	28.730512
AM WEST	delayed	Seattle	61	23.282443
AM WEST	on time	Los Angeles	694	85.573366
AM WEST	on time	Phoenix	4840	92.102759
AM WEST	on time	San Diego	383	85.491071
AM WEST	on time	San Francisco	320	71.269488
AM WEST	on time	Seattle	201	76.717557

#barlpot 
ggplot(city_comparison_filtered, aes(x = destination, y = Percentage, fill = status)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~airline) + 
  labs(
    title = "Comparison of Arrival Rates Across Cities",
    x = "City",
    y = "Percentage of Arrival Rates",
    fill = "Arrival Status"
  ) +
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    plot.title = element_text(hjust = 0.5)
  )

The comparison of arrival rates for ALASKA and AM WEST airlines across five cities—Los Angeles, Phoenix, San Diego, San Francisco, and Seattle—shows that AM WEST generally has a higher percentage of on-time arrivals in most cities. For example, AM WEST had 92.1% of its flights arriving on time in Phoenix, compared to ALASKA’s 94.8%. However, AM WEST experienced more delays in some cities, such as San Francisco, where 28.7% of their flights were delayed compared to 16.9% for ALASKA. Despite ALASKA’s smaller operations in some cities, it maintained competitive on-time rates, with 88.9% on-time arrivals in Los Angeles, and 85.8% in Seattle. Overall, while both airlines perform well, AM WEST sees a higher proportion of delays in certain cities. The different volumes of flights in various cities account for the discrepancies seen.

Reference Grolemund, Garrett, and Hadley Wickham. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.