Import Wide Data From GitHub

We were provided with a chart detailing arrivals and delays for two airlines across five destinations. Since the chart is in a wide format, the first step is to import the data. We obtain the data from GitHub and process it using the readr package. Because the values are not separated by commas, we utilize a specific delimiter during the import to ensure that each destination is allocated its own column, rather than being merged into a single column.

knitr::opts_chunk$set(echo = TRUE)
#install.packages(c("readr", "dplyr", "tidyr"))
library(readr); library(dplyr); library(tidyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#import dataset from Github, had to use delimiter because not comma seperated 
wide_airport <- wide_airport <- read_delim("https://raw.githubusercontent.com/tiffhugh/DATA71200/refs/heads/main/wide_airport.csv", delim = "\t")
## New names:
## • `` -> `...1`
## • `` -> `...2`
## Rows: 4 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): ...1, ...2
## dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(wide_airport)

Populate Missing Data and Renaming

While the data has been successfully imported, several significant issues must be addressed before we can proceed with any analysis. The dataset is untidy, lacking column names for “airline” and “status,” and contains empty cells. By utilizing the tidy package, we can make necessary adjustments to enhance the data’s quality and structure.

knitr::opts_chunk$set(echo = TRUE)
# Fix columns include airline and status 
colnames(wide_airport)[1] <- "airline"
colnames(wide_airport)[2] <- "status"
# Name empty cell to the correct airline 
wide_airport <- wide_airport %>%
  mutate(airline = ifelse(airline == "", NA, airline)) %>%  
  fill(airline, .direction = "down") %>%  
  mutate(airline = replace(airline, c(2, 4), c("ALASKA", "AM WEST")))
wide_airport
## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 ALASKA  delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 AM WEST delayed           117     415          65             129      61

Transform Data Wide To Long

Now that our table is complete, the next step is using pivot_longer() to transform our dataset from a wide format to a long format, allowing us to have a more organized structure where each destination is represented in a separate row along with its corresponding arrival status and frequency.

knitr::opts_chunk$set(echo = TRUE)
long_airport <- wide_airport %>%
  pivot_longer(
    cols = -c(airline, status), 
    names_to = 'destination',     
    values_to = 'Count'       
  )
head(long_airport)
## # A tibble: 6 × 4
##   airline status  destination   Count
##   <chr>   <chr>   <chr>         <dbl>
## 1 ALASKA  on time Los Angeles     497
## 2 ALASKA  on time Phoenix         221
## 3 ALASKA  on time San Diego       212
## 4 ALASKA  on time San Francisco   503
## 5 ALASKA  on time Seattle        1841
## 6 ALASKA  delayed Los Angeles      62

Compared Percentage Of Arrivals By Airlines

The data is in a long format so analysis can be done.

knitr::opts_chunk$set(echo = TRUE)

# Calculate total counts for each airline and status
total_counts <- long_airport %>%
  group_by(airline, status) %>%
  summarize(TotalCount = sum(Count, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
percentage_arrival_rates <- total_counts %>%
  group_by(airline) %>%
  mutate(Percentage = (TotalCount / sum(TotalCount)) * 100) %>% # Calculate percentage 
  ungroup()
# create a comparsion between the airlines 
comparison <- percentage_arrival_rates %>%
  filter(airline %in% c("ALASKA", "AM WEST"))

comparison
## # A tibble: 4 × 4
##   airline status  TotalCount Percentage
##   <chr>   <chr>        <dbl>      <dbl>
## 1 ALASKA  delayed        501       13.3
## 2 ALASKA  on time       3274       86.7
## 3 AM WEST delayed        787       10.9
## 4 AM WEST on time       6438       89.1
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(knitr)

# table using kable
table_comparison <- comparison %>%
  select(airline, status, TotalCount, Percentage)
kable(table_comparison, caption = "Comparison of Arrival Rates for ALASKA and AM WEST Airlines")
Comparison of Arrival Rates for ALASKA and AM WEST Airlines
airline status TotalCount Percentage
ALASKA delayed 501 13.27152
ALASKA on time 3274 86.72848
AM WEST delayed 787 10.89273
AM WEST on time 6438 89.10727
# barplot 
ggplot(comparison, aes(x = airline, y = Percentage, fill = status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Comparison of Arrival Rates by Airline",
    x = "Airline",
    y = "Percentage of Arrival Rates",
    fill = "Arrival Status"
  ) +
  theme_dark()

AM WEST Airlines not only has a higher percentage of on-time arrivals compared to ALASKA Airlines but also handles significantly more incoming flights overall. With 7,225 flights in total, AM WEST manages a robust operation, maintaining an 89.1% on-time arrival rate and only 10.9% of flights being delayed. In contrast, ALASKA Airlines, with 3,775 flights, has a slightly lower 86.7% on-time rate and a higher delay rate of 13.3%. Despite having fewer flights, ALASKA experiences a slightly greater proportion of delays, while AM WEST’s larger volume of flights shows their ability to maintain punctuality on a bigger scale, reflecting more consistent performance under higher demand.

Compared Percentage Of Arrivals By Cities Across Airlines

knitr::opts_chunk$set(echo = TRUE)

city_comparison_filtered <- long_airport %>%
  filter(airline %in% c("ALASKA", "AM WEST")) %>%
  filter(destination %in% c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")) %>%
  group_by(airline, status, destination) %>%
  summarize(TotalCount = sum(Count, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(airline, destination) %>%
  mutate(Percentage = (TotalCount / sum(TotalCount)) * 100) %>% # calculate percentage of cities 
  ungroup()
## `summarise()` has grouped output by 'airline', 'status'. You can override using
## the `.groups` argument.
city_comparison_filtered
## # A tibble: 20 × 5
##    airline status  destination   TotalCount Percentage
##    <chr>   <chr>   <chr>              <dbl>      <dbl>
##  1 ALASKA  delayed Los Angeles           62      11.1 
##  2 ALASKA  delayed Phoenix               12       5.15
##  3 ALASKA  delayed San Diego             20       8.62
##  4 ALASKA  delayed San Francisco        102      16.9 
##  5 ALASKA  delayed Seattle              305      14.2 
##  6 ALASKA  on time Los Angeles          497      88.9 
##  7 ALASKA  on time Phoenix              221      94.8 
##  8 ALASKA  on time San Diego            212      91.4 
##  9 ALASKA  on time San Francisco        503      83.1 
## 10 ALASKA  on time Seattle             1841      85.8 
## 11 AM WEST delayed Los Angeles          117      14.4 
## 12 AM WEST delayed Phoenix              415       7.90
## 13 AM WEST delayed San Diego             65      14.5 
## 14 AM WEST delayed San Francisco        129      28.7 
## 15 AM WEST delayed Seattle               61      23.3 
## 16 AM WEST on time Los Angeles          694      85.6 
## 17 AM WEST on time Phoenix             4840      92.1 
## 18 AM WEST on time San Diego            383      85.5 
## 19 AM WEST on time San Francisco        320      71.3 
## 20 AM WEST on time Seattle              201      76.7
knitr::opts_chunk$set(echo = TRUE)
#table 
kable(city_comparison_filtered, caption = "Comparison of Arrival Rates for ALASKA and AM WEST Airlines Across Cities")
Comparison of Arrival Rates for ALASKA and AM WEST Airlines Across Cities
airline status destination TotalCount Percentage
ALASKA delayed Los Angeles 62 11.091234
ALASKA delayed Phoenix 12 5.150215
ALASKA delayed San Diego 20 8.620690
ALASKA delayed San Francisco 102 16.859504
ALASKA delayed Seattle 305 14.212488
ALASKA on time Los Angeles 497 88.908766
ALASKA on time Phoenix 221 94.849785
ALASKA on time San Diego 212 91.379310
ALASKA on time San Francisco 503 83.140496
ALASKA on time Seattle 1841 85.787512
AM WEST delayed Los Angeles 117 14.426634
AM WEST delayed Phoenix 415 7.897241
AM WEST delayed San Diego 65 14.508929
AM WEST delayed San Francisco 129 28.730512
AM WEST delayed Seattle 61 23.282443
AM WEST on time Los Angeles 694 85.573366
AM WEST on time Phoenix 4840 92.102759
AM WEST on time San Diego 383 85.491071
AM WEST on time San Francisco 320 71.269488
AM WEST on time Seattle 201 76.717557
#barlpot 
ggplot(city_comparison_filtered, aes(x = destination, y = Percentage, fill = status)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~airline) + 
  labs(
    title = "Comparison of Arrival Rates Across Cities",
    x = "City",
    y = "Percentage of Arrival Rates",
    fill = "Arrival Status"
  ) +
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    plot.title = element_text(hjust = 0.5)
  )

The comparison of arrival rates for ALASKA and AM WEST airlines across five cities—Los Angeles, Phoenix, San Diego, San Francisco, and Seattle—shows that AM WEST generally has a higher percentage of on-time arrivals in most cities. For example, AM WEST had 92.1% of its flights arriving on time in Phoenix, compared to ALASKA’s 94.8%. However, AM WEST experienced more delays in some cities, such as San Francisco, where 28.7% of their flights were delayed compared to 16.9% for ALASKA. Despite ALASKA’s smaller operations in some cities, it maintained competitive on-time rates, with 88.9% on-time arrivals in Los Angeles, and 85.8% in Seattle. Overall, while both airlines perform well, AM WEST sees a higher proportion of delays in certain cities. The different volumes of flights in various cities account for the discrepancies seen.

Reference Grolemund, Garrett, and Hadley Wickham. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.