##Assignment – Tidying and Transforming Data
For this assignment we are starting with a CSV file with a wide structure. The task is to tidy it and transform it so it’s easier to analyze.
The data used for this assignment is describes arrival delays for two airlines across five destinations.
First we’ll load needed libraries.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(knitr)
Second we’ll load data, in this case from github.
airlines <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/airlines.csv")
# View the first few lines to inspect the data
head(airlines)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West On Time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
Next we’ll rename some columns
colnames(airlines) <- c("Airline", "Status", "Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle")
# View the first few lines to inspect the data
head(airlines)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West On Time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
Remove the 4th empty row if it exists
airlines <- airlines %>% filter(!(Airline == "" & Status == ""))
# View the first few lines to inspect the data
head(airlines)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 Delayed 117 415 65 129 61
Fill the missing airline names downwards, but first we have to replace missing values or currently empty strings, with NA so that fill() can work
# Replace empty strings with NA
airlines$Airline[airlines$Airline == ""] <- NA
airlines <- airlines %>%
fill(Airline, .direction = "down")
# View the first few lines to inspect the data
head(airlines)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Alaska Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 AM West Delayed 117 415 65 129 61
Now we’ll reshape the data to long format to include city and count columns
tidy_airlines <- airlines %>%
pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
names_to = "City",
values_to = "Count") %>%
group_by(Airline, Status, City) %>%
summarise(Total_Flights = sum(Count, na.rm = TRUE), .groups = 'drop')
# View the first few lines to inspect the data
print(tidy_airlines)
## # A tibble: 20 × 4
## Airline Status City Total_Flights
## <chr> <chr> <chr> <int>
## 1 AM West Delayed Los.Angeles 117
## 2 AM West Delayed Phoenix 415
## 3 AM West Delayed San.Diego 65
## 4 AM West Delayed San.Francisco 129
## 5 AM West Delayed Seattle 61
## 6 AM West On Time Los.Angeles 694
## 7 AM West On Time Phoenix 4840
## 8 AM West On Time San.Diego 383
## 9 AM West On Time San.Francisco 320
## 10 AM West On Time Seattle 201
## 11 Alaska Delayed Los.Angeles 62
## 12 Alaska Delayed Phoenix 12
## 13 Alaska Delayed San.Diego 20
## 14 Alaska Delayed San.Francisco 102
## 15 Alaska Delayed Seattle 305
## 16 Alaska On Time Los.Angeles 497
## 17 Alaska On Time Phoenix 221
## 18 Alaska On Time San.Diego 212
## 19 Alaska On Time San.Francisco 503
## 20 Alaska On Time Seattle 1841
Now that the table was transformed from a wide format to a long one, let’s see what the data can tell us about airlines’ delays.
# First calculate the percentage of on-time and delayed flights for each airline
flight_summary <- tidy_airlines %>% #create a new table with the summary
group_by(Airline, Status) %>% #the new table will have totals for airlines and status
summarise(Count = sum(c_across(where(is.numeric)), na.rm = TRUE)) %>% #sums up the numeric columns for flight counts
group_by(Airline) %>%
mutate(Total = sum(Count), #calculate total flights per airline
Percentage = (Count / Total) * 100) # Calculate percentage
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
flight_summary$Status <- factor(flight_summary$Status, levels = c("On Time", "Delayed"))
# With the summary table created we can now create a bar plot with ggplot
ggplot(flight_summary, aes(x = Airline, y = Percentage, fill = Status)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), # Add percentage labels to the bars
position = position_dodge(width = 0.9), vjust = -0.5, size = 3.5) + # Adjust position of labels
labs(title = "Percentage of On-Time vs Delayed Flights by Airline",
x = "Airline", y = "Percentage (%)") +
scale_fill_manual(values = c("On Time" = "springgreen3", "Delayed" = "tomato3")) + # Customize colors
theme_minimal()
As the graph shows us, both airlines have very similar percentages of on-time versus delayed flight totals, with AM West having slightly more on time flights overall.
But what if I we break down the data by cities?
#Calculate the percentage of on-time and delayed flights for each city
city_summary <- tidy_airlines %>%
group_by(City, Status, Airline) %>% # Include Airline in the grouping
summarise(Total_Flights = sum(Total_Flights, na.rm = TRUE), .groups = 'drop') %>%
group_by(City, Airline) %>%
mutate(Total_City_Flights = sum(Total_Flights),
Percentage = (Total_Flights / Total_City_Flights) * 100)
#Create the bar plot with ggplot
ggplot(city_summary, aes(x = City, y = Percentage, fill = Airline)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
geom_text(data = filter(city_summary, Status == "On Time"),
aes(label = paste0(round(Percentage, 1), "%")),
position = position_dodge(width = 0.8), vjust = -0.4, size = 3.0) + # Add percentage labels
labs(title = "On-Time Flights by City and Airline",
x = "City", y = "Percentage (%)") +
scale_fill_manual(values = c("Alaska" = "royalblue4", "AM West" = "sienna3")) + # Set airline colors
theme_minimal()
Surprisingly, for each city, Alaska Airlines tends to have more on-time flights than America West Airlines. Why is this?
print (airlines)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Alaska Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 AM West Delayed 117 415 65 129 61
If we see the total number of flights to Phoenix for AM West, we can see that is by fart the biggest count in the entire table, adding a lot more weight to their average “on-time” status.
AM West on time flights average: 89.1% AM West on time flights Phoenix: 92.1%
Phoenix was the best performing route both for AM West and Alaska, but since AM West had significantly more flights there than Alaska, it skewed the data heavily in their favor.
At the same time, both San Francisco and Seattle were the worst performing routes for both airlines, but this time it was Alaska that had the most number of flights to those cities.
##Conclusion If we only look at total averages AM West seems to be more reliable than Alaska Airlines. But looking more carefully at each city’s data we can see that Alaska actually is the one performing better in each scenario. Since we don’t know the city(es) of origin or the time of year it’s difficult to make a definitive statement, but based on this data only Alaska seemed to be more reliable in terms having On Time flights