##Assignment – Tidying and Transforming Data

For this assignment we are starting with a CSV file with a wide structure. The task is to tidy it and transform it so it’s easier to analyze.

The data used for this assignment is describes arrival delays for two airlines across five destinations.

First we’ll load needed libraries.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(knitr)

Second we’ll load data, in this case from github.

airlines <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/airlines.csv")

# View the first few lines to inspect the data
head(airlines)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM West On Time         694    4840       383           320     201
## 5         Delayed         117     415        65           129      61

Next we’ll rename some columns

colnames(airlines) <- c("Airline", "Status", "Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle")

# View the first few lines to inspect the data
head(airlines)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM West On Time         694    4840       383           320     201
## 5         Delayed         117     415        65           129      61

Remove the 4th empty row if it exists

airlines <- airlines %>% filter(!(Airline == "" & Status == ""))

# View the first few lines to inspect the data
head(airlines)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3 AM West On Time         694    4840       383           320     201
## 4         Delayed         117     415        65           129      61

Fill the missing airline names downwards, but first we have to replace missing values or currently empty strings, with NA so that fill() can work

# Replace empty strings with NA
airlines$Airline[airlines$Airline == ""] <- NA

airlines <- airlines %>%
  fill(Airline, .direction = "down")

# View the first few lines to inspect the data
head(airlines)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2  Alaska Delayed          62      12        20           102     305
## 3 AM West On Time         694    4840       383           320     201
## 4 AM West Delayed         117     415        65           129      61

Now we’ll reshape the data to long format to include city and count columns

tidy_airlines <- airlines %>%
  pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
               names_to = "City",
               values_to = "Count") %>%
 group_by(Airline, Status, City) %>%
  summarise(Total_Flights = sum(Count, na.rm = TRUE), .groups = 'drop')

# View the first few lines to inspect the data
print(tidy_airlines)
## # A tibble: 20 × 4
##    Airline Status  City          Total_Flights
##    <chr>   <chr>   <chr>                 <int>
##  1 AM West Delayed Los.Angeles             117
##  2 AM West Delayed Phoenix                 415
##  3 AM West Delayed San.Diego                65
##  4 AM West Delayed San.Francisco           129
##  5 AM West Delayed Seattle                  61
##  6 AM West On Time Los.Angeles             694
##  7 AM West On Time Phoenix                4840
##  8 AM West On Time San.Diego               383
##  9 AM West On Time San.Francisco           320
## 10 AM West On Time Seattle                 201
## 11 Alaska  Delayed Los.Angeles              62
## 12 Alaska  Delayed Phoenix                  12
## 13 Alaska  Delayed San.Diego                20
## 14 Alaska  Delayed San.Francisco           102
## 15 Alaska  Delayed Seattle                 305
## 16 Alaska  On Time Los.Angeles             497
## 17 Alaska  On Time Phoenix                 221
## 18 Alaska  On Time San.Diego               212
## 19 Alaska  On Time San.Francisco           503
## 20 Alaska  On Time Seattle                1841

Now that the table was transformed from a wide format to a long one, let’s see what the data can tell us about airlines’ delays.

# First calculate the percentage of on-time and delayed flights for each airline
flight_summary <- tidy_airlines %>% #create a new table with the summary
  group_by(Airline, Status) %>% #the new table will have totals for airlines and status
  summarise(Count = sum(c_across(where(is.numeric)), na.rm = TRUE)) %>%  #sums up the numeric columns for flight counts
  group_by(Airline) %>%
  mutate(Total = sum(Count),  #calculate total flights per airline
         Percentage = (Count / Total) * 100)  # Calculate percentage
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
flight_summary$Status <- factor(flight_summary$Status, levels = c("On Time", "Delayed"))

# With the summary table created we can now create a bar plot with ggplot
ggplot(flight_summary, aes(x = Airline, y = Percentage, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")),  # Add percentage labels to the bars
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3.5) +  # Adjust position of labels
  labs(title = "Percentage of On-Time vs Delayed Flights by Airline",
       x = "Airline", y = "Percentage (%)") +
  scale_fill_manual(values = c("On Time" = "springgreen3", "Delayed" = "tomato3")) +  # Customize colors
  theme_minimal()

As the graph shows us, both airlines have very similar percentages of on-time versus delayed flight totals, with AM West having slightly more on time flights overall.

But what if I we break down the data by cities?

#Calculate the percentage of on-time and delayed flights for each city
city_summary <- tidy_airlines %>%
  group_by(City, Status, Airline) %>%  # Include Airline in the grouping
  summarise(Total_Flights = sum(Total_Flights, na.rm = TRUE), .groups = 'drop') %>%
  group_by(City, Airline) %>%  
  mutate(Total_City_Flights = sum(Total_Flights),
         Percentage = (Total_Flights / Total_City_Flights) * 100)

#Create the bar plot with ggplot
ggplot(city_summary, aes(x = City, y = Percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
  geom_text(data = filter(city_summary, Status == "On Time"), 
            aes(label = paste0(round(Percentage, 1), "%")),
            position = position_dodge(width = 0.8), vjust = -0.4, size = 3.0) +  # Add percentage labels
  labs(title = "On-Time Flights by City and Airline",
       x = "City", y = "Percentage (%)") +
  scale_fill_manual(values = c("Alaska" = "royalblue4", "AM West" = "sienna3")) +  # Set airline colors
  theme_minimal()

Surprisingly, for each city, Alaska Airlines tends to have more on-time flights than America West Airlines. Why is this?

print (airlines)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2  Alaska Delayed          62      12        20           102     305
## 3 AM West On Time         694    4840       383           320     201
## 4 AM West Delayed         117     415        65           129      61

If we see the total number of flights to Phoenix for AM West, we can see that is by fart the biggest count in the entire table, adding a lot more weight to their average “on-time” status.

AM West on time flights average: 89.1% AM West on time flights Phoenix: 92.1%

Phoenix was the best performing route both for AM West and Alaska, but since AM West had significantly more flights there than Alaska, it skewed the data heavily in their favor.

At the same time, both San Francisco and Seattle were the worst performing routes for both airlines, but this time it was Alaska that had the most number of flights to those cities.

##Conclusion If we only look at total averages AM West seems to be more reliable than Alaska Airlines. But looking more carefully at each city’s data we can see that Alaska actually is the one performing better in each scenario. Since we don’t know the city(es) of origin or the time of year it’s difficult to make a definitive statement, but based on this data only Alaska seemed to be more reliable in terms having On Time flights