Introduction

We are now in week 5, and for Assignment 5A, we will be analyzing airline arrival data. With the provided dataset about Alaska Airlines and AM WEST Airlines, the goal will be to practice principles of tidy data. I’ll transform the unstructured wide dataset into a long format to compare delays across the five destinations.

Planned Workflow

For my workflow, I plan to recreate the dataset in its original wide format as a csv file. This is to maintain its intended messy structure with missing information to practice cleaning the data. I’ll then feed the csv into R and utilize dplyr and tidyr to help reshape the data, handle the missing information, and properly perform the analysis as required.

Anticipated data challenges

Some challenges that I anticipate encountering will be reshaping the dataset from wide to long form and handling the missing values. I will also need to properly convey the delay information through percentages since the difference in the amount of flights both airlines completed can possibly skew the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
raw_data <- read_csv("airline_delays.csv", show_col_types = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
raw_data
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61
tidy_data <- raw_data %>%
  rename(Airline = 1, Status = 2) %>%
  fill(Airline) %>%
  filter(!is.na(Status)) %>%
  pivot_longer(
    cols = c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
    names_to = "City",
    values_to = "Flight_Count"
  )
tidy_data
## # A tibble: 20 × 4
##    Airline Status  City          Flight_Count
##    <chr>   <chr>   <chr>                <dbl>
##  1 ALASKA  on time Los Angeles            497
##  2 ALASKA  on time Phoenix                221
##  3 ALASKA  on time San Diego              212
##  4 ALASKA  on time San Francisco          503
##  5 ALASKA  on time Seattle               1841
##  6 ALASKA  delayed Los Angeles             62
##  7 ALASKA  delayed Phoenix                 12
##  8 ALASKA  delayed San Diego               20
##  9 ALASKA  delayed San Francisco          102
## 10 ALASKA  delayed Seattle                305
## 11 AM WEST on time Los Angeles            694
## 12 AM WEST on time Phoenix               4840
## 13 AM WEST on time San Diego              383
## 14 AM WEST on time San Francisco          320
## 15 AM WEST on time Seattle                201
## 16 AM WEST delayed Los Angeles            117
## 17 AM WEST delayed Phoenix                415
## 18 AM WEST delayed San Diego               65
## 19 AM WEST delayed San Francisco          129
## 20 AM WEST delayed Seattle                 61

Cleanup

To tidy the dataset, I used rename to give headers to unnamed columns. I then used fill to carry the Airline names down into the delayed rows that were empty. After I filtered out the blank separator row, I used pivot_longer to transform the city columns to city. This helped convert the data from wide to a long and orderly format.

Analysis

overall_analysis <- tidy_data %>%
  group_by(Airline, Status) %>%
  summarise(Total_Flights = sum(Flight_Count), .groups = 'drop') %>%
  group_by(Airline) %>%
  mutate(Percentage = round(Total_Flights / sum(Total_Flights) * 100, 2))
overall_analysis
## # A tibble: 4 × 4
## # Groups:   Airline [2]
##   Airline Status  Total_Flights Percentage
##   <chr>   <chr>           <dbl>      <dbl>
## 1 ALASKA  delayed           501       13.3
## 2 ALASKA  on time          3274       86.7
## 3 AM WEST delayed           787       10.9
## 4 AM WEST on time          6438       89.1
city_analysis <- tidy_data %>%
  group_by(Airline, City) %>%
  mutate(Total_In_City = sum(Flight_Count)) %>%
  filter(Status == "delayed") %>%
  mutate(Delay_Percentage = round(Flight_Count / Total_In_City * 100, 2)) %>%
  select(Airline, City, Delay_Percentage)
city_analysis
## # A tibble: 10 × 3
## # Groups:   Airline, City [10]
##    Airline City          Delay_Percentage
##    <chr>   <chr>                    <dbl>
##  1 ALASKA  Los Angeles              11.1 
##  2 ALASKA  Phoenix                   5.15
##  3 ALASKA  San Diego                 8.62
##  4 ALASKA  San Francisco            16.9 
##  5 ALASKA  Seattle                  14.2 
##  6 AM WEST Los Angeles              14.4 
##  7 AM WEST Phoenix                   7.9 
##  8 AM WEST San Diego                14.5 
##  9 AM WEST San Francisco            28.7 
## 10 AM WEST Seattle                  23.3
ggplot(city_analysis, aes(x = City, y = Delay_Percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Arrival Delay Percentage by City",
       y = "Percentage Delayed",
       x = "City") +
  theme_minimal()

Conclusion

From what I’ve noticed assessing the data, AM WEST has a lower overall delay rate. However, Alaska performs better in each city with a lower delay percentage. It appears that the difference maker that gives the illusion AM WEST has the lower delay as a whole is in phoenix. Phoenix it appears to be a low delay city in general when it comes to flights operating there.