library(dplyr)
library(tidyverse)
##File Recreation
The first thing we needed to do was create our own .csv of the data displayed in the assignment pdf. I did that in excel and uploaded it to GitHub where we can call it into this r markdown.
library(readr)
# Define the URL of the raw CSV file
data_url <- "https://raw.githubusercontent.com/tcgraham-data/data-607-week-5/main/regional_flight_data.csv?raw=true"
# Load the CSV data into a dataframe
flight_data <- read_csv(data_url)
# View the first few rows of the data
head(flight_data)
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
##File Cleaning
Now we need to begin manipulating this data and making something out of it. To start, we need to give columns 1 and 2 some meaningful titles.
# Check the current column names
names(flight_data)
## [1] "...1" "...2" "Los Angeles" "Phoenix"
## [5] "San Diego" "San Francisco" "Seattle"
# Rename the first two columns
names(flight_data)[1:2] <- c("Airline", "Status")
# Verify that the column names have been updated
names(flight_data)
## [1] "Airline" "Status" "Los Angeles" "Phoenix"
## [5] "San Diego" "San Francisco" "Seattle"
Now that we’ve fixed that, we need to resolve empty rows and remove them.
flight_data_clean <- flight_data %>%
filter(!if_all(everything(), is.na))
head(flight_data_clean)
## # A tibble: 4 × 7
## Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
With that task complete, we now need to add airline names to all the cells in the Airline column.
# Assign "Alaska" to row 2, column 1 and "AM WEST" to row 4, column 1
flight_data_clean[2, 1] <- "ALASKA"
flight_data_clean[4, 1] <- "AM WEST"
# Check changes to ensure correctness
head(flight_data_clean)
## # A tibble: 4 × 7
## Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
##Data Reshaping
I’m sure there’s a better way to do that step than just smash the names into the cells, but since I don’t have a ton of data to deal with this proved easiest. The next thing I need to do is pivot my data to long. I admit I spent like an hour trying to do this the hard way before I really figured out how to use pivot_longer and pivot_wider. Thanks Internet!
#Reshape flight_data_clean to long format for the new "Cities" columns.
# flight_data_clean has Airline, Status, and then city columns. We want to keep Airline and Status, but reshape the data around a new "Cities" column, collapsing 5 columns into 1 column using pivot_longer on all the data from columns 3 onward.
flight_long <- flight_data_clean %>%
pivot_longer(
cols = 3:ncol(flight_data_clean),
names_to = "Cities",
values_to = "Count"
)
#Now we want to do the opposite using pivot_wider so that "on time" and "delayed" from Status become separate columns, exploding one column out into two columns, bringing with it all the associated data.
flight_wide <- flight_long %>%
pivot_wider(
names_from = Status,
values_from = Count
) %>%
rename(`On Time` = `on time`, Delayed = delayed)
print(flight_wide)
## # A tibble: 10 × 4
## Airline Cities `On Time` Delayed
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San Diego 212 20
## 4 ALASKA San Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los Angeles 694 117
## 7 AM WEST Phoenix 4840 415
## 8 AM WEST San Diego 383 65
## 9 AM WEST San Francisco 320 129
## 10 AM WEST Seattle 201 61
Now I think we have cleaned up and longer data to work with. This was really hard, but I’m glad I figured it out. Felt kind of neat when it all clicked for me. It’s also 8:53pm and I’m suddenly worried that while I’m in California 11:59pm in New York means my assignment will be late at 9pm my time and not 11:59 my time. Nothing to do but soldier on.
Also, thanks for the rubric! Here is the first bit of analysis. Let’s review the percentage of on time vs. delays for each airline in aggregate:
overall_summary <- flight_wide %>%
group_by(Airline) %>%
summarise(
Total_On_Time = sum(`On Time`, na.rm = TRUE),
Total_Delayed = sum(Delayed, na.rm = TRUE)
) %>%
mutate(
Total_Flights = Total_On_Time + Total_Delayed,
Delay_Percentage = Total_Delayed / Total_Flights * 100,
On_Time_Percentage = Total_On_Time / Total_Flights * 100
)
overall_summary
## # A tibble: 2 × 6
## Airline Total_On_Time Total_Delayed Total_Flights Delay_Percentage
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 3274 501 3775 13.3
## 2 AM WEST 6438 787 7225 10.9
## # ℹ 1 more variable: On_Time_Percentage <dbl>
ggplot(overall_summary, aes(x = Airline, y = Delay_Percentage, fill = Airline)) +
geom_bar(stat = "identity") +
labs(
title = "Overall Delay Percentage by Airline",
x = "Airline",
y = "Delay Percentage (%)"
) +
theme_minimal()
And we can see by straight percentage that Alaska has a slightly higher
delay percentage. But honestly, a 2% delta between the two airlines is
fairly small. I do wonder, seeing the two this close, if we could break
this out by city and see if there’s a standout city where one airline is
particularly delayed. For example, perhaps it’s better to fly into San
Diego on Alaskan over AM West?
city_summary <- flight_wide %>%
group_by(Airline, Cities) %>%
summarise(
On_Time = sum(`On Time`, na.rm = TRUE),
Delayed = sum(Delayed, na.rm = TRUE)
) %>%
mutate(
Total_Flights = On_Time + Delayed,
Delay_Percentage = Delayed / Total_Flights * 100,
On_Time_Percentage = On_Time / Total_Flights * 100
)
city_summary
## # A tibble: 10 × 7
## # Groups: Airline [2]
## Airline Cities On_Time Delayed Total_Flights Delay_Percentage
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Los Angeles 497 62 559 11.1
## 2 ALASKA Phoenix 221 12 233 5.15
## 3 ALASKA San Diego 212 20 232 8.62
## 4 ALASKA San Francisco 503 102 605 16.9
## 5 ALASKA Seattle 1841 305 2146 14.2
## 6 AM WEST Los Angeles 694 117 811 14.4
## 7 AM WEST Phoenix 4840 415 5255 7.90
## 8 AM WEST San Diego 383 65 448 14.5
## 9 AM WEST San Francisco 320 129 449 28.7
## 10 AM WEST Seattle 201 61 262 23.3
## # ℹ 1 more variable: On_Time_Percentage <dbl>
ggplot(city_summary, aes(x = Cities, y = Delay_Percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Delay Percentage by City and Airline",
x = "City",
y = "Delay Percentage (%)"
) +
theme_minimal()
This is NOT at all what I would have expected to see considering how close the two airlines ere in aggregate. It was 11% or 13% total delay. But once you look at the city chart, it was striking to see that Alaska has a 17% delay at SFO, compared to a whopping 29% on AM West. Morevoer….it looks like AM WEST very delayed everywhere but Phoenix. But the 7% delay rate in Phoenix represents 5,255 flights which is so substantial that it tones down the substantial delays basically everywhere else. TLDR; AM West is great for Phoenix and that’s about it. Alaskan seems to run a generally better operation when it comes to delays.
Ultimately, it was an interesting experiment to look at the aggregate data first which seems nominally different. And then to see the striking change on a city by city basis, especially when considering the total flights into those cities. It was neat to see how one really positive outlier had such a huge effect overall, skewing the data to make me think on initial glance that they were as competitive as Alaskan.