We are now in week 5, and for Assignment 5A, we will be analyzing airline arrival data. With the provided dataset about Alaska Airlines and AM WEST Airlines, the goal will be to practice principles of tidy data. I’ll transform the unstructured wide dataset into a long format to compare delays across the five destinations.
For my workflow, I plan to recreate the dataset in its original wide format as a csv file. This is to maintain its intended messy structure with missing information to practice cleaning the data. I’ll then feed the csv into R and utilize dplyr and tidyr to help reshape the data, handle the missing information, and properly perform the analysis as required.
Some challenges that I anticipate encountering will be reshaping the dataset from wide to long form and handling the missing values. I will also need to properly convey the delay information through percentages since the difference in the amount of flights both airlines completed can possibly skew the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
raw_data <- read_csv("airline_delays.csv", show_col_types = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
raw_data
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
tidy_data <- raw_data %>%
rename(Airline = 1, Status = 2) %>%
fill(Airline) %>%
filter(!is.na(Status)) %>%
pivot_longer(
cols = c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
names_to = "City",
values_to = "Flight_Count"
)
tidy_data
## # A tibble: 20 × 4
## Airline Status City Flight_Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
To tidy the dataset, I used rename to give headers to unnamed columns. I then used fill to carry the Airline names down into the delayed rows that were empty. After I filtered out the blank separator row, I used pivot_longer to transform the city columns to city. This helped convert the data from wide to a long and orderly format.
overall_analysis <- tidy_data %>%
group_by(Airline, Status) %>%
summarise(Total_Flights = sum(Flight_Count), .groups = 'drop') %>%
group_by(Airline) %>%
mutate(Percentage = round(Total_Flights / sum(Total_Flights) * 100, 2))
overall_analysis
## # A tibble: 4 × 4
## # Groups: Airline [2]
## Airline Status Total_Flights Percentage
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA delayed 501 13.3
## 2 ALASKA on time 3274 86.7
## 3 AM WEST delayed 787 10.9
## 4 AM WEST on time 6438 89.1
city_analysis <- tidy_data %>%
group_by(Airline, City) %>%
mutate(Total_In_City = sum(Flight_Count)) %>%
filter(Status == "delayed") %>%
mutate(Delay_Percentage = round(Flight_Count / Total_In_City * 100, 2)) %>%
select(Airline, City, Delay_Percentage)
city_analysis
## # A tibble: 10 × 3
## # Groups: Airline, City [10]
## Airline City Delay_Percentage
## <chr> <chr> <dbl>
## 1 ALASKA Los Angeles 11.1
## 2 ALASKA Phoenix 5.15
## 3 ALASKA San Diego 8.62
## 4 ALASKA San Francisco 16.9
## 5 ALASKA Seattle 14.2
## 6 AM WEST Los Angeles 14.4
## 7 AM WEST Phoenix 7.9
## 8 AM WEST San Diego 14.5
## 9 AM WEST San Francisco 28.7
## 10 AM WEST Seattle 23.3
ggplot(city_analysis, aes(x = City, y = Delay_Percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Arrival Delay Percentage by City",
y = "Percentage Delayed",
x = "City") +
theme_minimal()
From what I’ve noticed assessing the data, AM WEST has a lower overall delay rate. However, Alaska performs better in each city with a lower delay percentage. It appears that the difference maker that gives the illusion AM WEST has the lower delay as a whole is in phoenix. Phoenix it appears to be a low delay city in general when it comes to flights operating there.