Assignment 5

library(dplyr)
library(tidyverse)

##File Recreation

The first thing we needed to do was create our own .csv of the data displayed in the assignment pdf. I did that in excel and uploaded it to GitHub where we can call it into this r markdown.

library(readr)

# Define the URL of the raw CSV file
data_url <- "https://raw.githubusercontent.com/tcgraham-data/data-607-week-5/main/regional_flight_data.csv?raw=true"

# Load the CSV data into a dataframe
flight_data <- read_csv(data_url)

# View the first few rows of the data
head(flight_data)
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

##File Cleaning

Now we need to begin manipulating this data and making something out of it. To start, we need to give columns 1 and 2 some meaningful titles.

# Check the current column names
names(flight_data)
## [1] "...1"          "...2"          "Los Angeles"   "Phoenix"      
## [5] "San Diego"     "San Francisco" "Seattle"
# Rename the first two columns
names(flight_data)[1:2] <- c("Airline", "Status")

# Verify that the column names have been updated
names(flight_data)
## [1] "Airline"       "Status"        "Los Angeles"   "Phoenix"      
## [5] "San Diego"     "San Francisco" "Seattle"

Now that we’ve fixed that, we need to resolve empty rows and remove them.

flight_data_clean <- flight_data %>% 
  filter(!if_all(everything(), is.na))

head(flight_data_clean)
## # A tibble: 4 × 7
##   Airline Status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61

With that task complete, we now need to add airline names to all the cells in the Airline column.

# Assign "Alaska" to row 2, column 1 and "AM WEST" to row 4, column 1
flight_data_clean[2, 1] <- "ALASKA"
flight_data_clean[4, 1] <- "AM WEST"

# Check changes to ensure correctness
head(flight_data_clean)
## # A tibble: 4 × 7
##   Airline Status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 ALASKA  delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 AM WEST delayed           117     415          65             129      61

##Data Reshaping

I’m sure there’s a better way to do that step than just smash the names into the cells, but since I don’t have a ton of data to deal with this proved easiest. The next thing I need to do is pivot my data to long. I admit I spent like an hour trying to do this the hard way before I really figured out how to use pivot_longer and pivot_wider. Thanks Internet!

#Reshape flight_data_clean to long format for the new "Cities" columns.
# flight_data_clean has Airline, Status, and then city columns. We want to keep Airline and Status, but reshape the data around a new "Cities" column, collapsing 5 columns into 1 column using pivot_longer on all the data from columns 3 onward.
flight_long <- flight_data_clean %>%
  pivot_longer(
    cols = 3:ncol(flight_data_clean),
    names_to = "Cities",
    values_to = "Count"
  )
#Now we want to do the opposite using pivot_wider so that "on time" and "delayed" from Status become separate columns, exploding one column out into two columns, bringing with it all the associated data.
flight_wide <- flight_long %>%
  pivot_wider(
    names_from = Status,
    values_from = Count
  ) %>%
  rename(`On Time` = `on time`, Delayed = delayed)

print(flight_wide)
## # A tibble: 10 × 4
##    Airline Cities        `On Time` Delayed
##    <chr>   <chr>             <dbl>   <dbl>
##  1 ALASKA  Los Angeles         497      62
##  2 ALASKA  Phoenix             221      12
##  3 ALASKA  San Diego           212      20
##  4 ALASKA  San Francisco       503     102
##  5 ALASKA  Seattle            1841     305
##  6 AM WEST Los Angeles         694     117
##  7 AM WEST Phoenix            4840     415
##  8 AM WEST San Diego           383      65
##  9 AM WEST San Francisco       320     129
## 10 AM WEST Seattle             201      61

Now I think we have cleaned up and longer data to work with. This was really hard, but I’m glad I figured it out. Felt kind of neat when it all clicked for me. It’s also 8:53pm and I’m suddenly worried that while I’m in California 11:59pm in New York means my assignment will be late at 9pm my time and not 11:59 my time. Nothing to do but soldier on.

Percentages Analysis

Also, thanks for the rubric! Here is the first bit of analysis. Let’s review the percentage of on time vs. delays for each airline in aggregate:

overall_summary <- flight_wide %>%
  group_by(Airline) %>%
  summarise(
    Total_On_Time = sum(`On Time`, na.rm = TRUE),
    Total_Delayed = sum(Delayed, na.rm = TRUE)
  ) %>%
  mutate(
    Total_Flights = Total_On_Time + Total_Delayed,
    Delay_Percentage = Total_Delayed / Total_Flights * 100,
    On_Time_Percentage = Total_On_Time / Total_Flights * 100
  )

overall_summary
## # A tibble: 2 × 6
##   Airline Total_On_Time Total_Delayed Total_Flights Delay_Percentage
##   <chr>           <dbl>         <dbl>         <dbl>            <dbl>
## 1 ALASKA           3274           501          3775             13.3
## 2 AM WEST          6438           787          7225             10.9
## # ℹ 1 more variable: On_Time_Percentage <dbl>
ggplot(overall_summary, aes(x = Airline, y = Delay_Percentage, fill = Airline)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Overall Delay Percentage by Airline",
    x = "Airline",
    y = "Delay Percentage (%)"
  ) +
  theme_minimal()

And we can see by straight percentage that Alaska has a slightly higher delay percentage. But honestly, a 2% delta between the two airlines is fairly small. I do wonder, seeing the two this close, if we could break this out by city and see if there’s a standout city where one airline is particularly delayed. For example, perhaps it’s better to fly into San Diego on Alaskan over AM West?

city_summary <- flight_wide %>%
  group_by(Airline, Cities) %>%
  summarise(
    On_Time = sum(`On Time`, na.rm = TRUE),
    Delayed = sum(Delayed, na.rm = TRUE)
  ) %>%
  mutate(
    Total_Flights = On_Time + Delayed,
    Delay_Percentage = Delayed / Total_Flights * 100,
    On_Time_Percentage = On_Time / Total_Flights * 100
  )

city_summary
## # A tibble: 10 × 7
## # Groups:   Airline [2]
##    Airline Cities        On_Time Delayed Total_Flights Delay_Percentage
##    <chr>   <chr>           <dbl>   <dbl>         <dbl>            <dbl>
##  1 ALASKA  Los Angeles       497      62           559            11.1 
##  2 ALASKA  Phoenix           221      12           233             5.15
##  3 ALASKA  San Diego         212      20           232             8.62
##  4 ALASKA  San Francisco     503     102           605            16.9 
##  5 ALASKA  Seattle          1841     305          2146            14.2 
##  6 AM WEST Los Angeles       694     117           811            14.4 
##  7 AM WEST Phoenix          4840     415          5255             7.90
##  8 AM WEST San Diego         383      65           448            14.5 
##  9 AM WEST San Francisco     320     129           449            28.7 
## 10 AM WEST Seattle           201      61           262            23.3 
## # ℹ 1 more variable: On_Time_Percentage <dbl>
ggplot(city_summary, aes(x = Cities, y = Delay_Percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Delay Percentage by City and Airline",
    x = "City",
    y = "Delay Percentage (%)"
  ) +
  theme_minimal()

Conclusion

This is NOT at all what I would have expected to see considering how close the two airlines ere in aggregate. It was 11% or 13% total delay. But once you look at the city chart, it was striking to see that Alaska has a 17% delay at SFO, compared to a whopping 29% on AM West. Morevoer….it looks like AM WEST very delayed everywhere but Phoenix. But the 7% delay rate in Phoenix represents 5,255 flights which is so substantial that it tones down the substantial delays basically everywhere else. TLDR; AM West is great for Phoenix and that’s about it. Alaskan seems to run a generally better operation when it comes to delays.

Ultimately, it was an interesting experiment to look at the aggregate data first which seems nominally different. And then to see the striking change on a city by city basis, especially when considering the total flights into those cities. It was neat to see how one really positive outlier had such a huge effect overall, skewing the data to make me think on initial glance that they were as competitive as Alaskan.