The goal of this assignment is to clean and transform a “wide” dataset representing arrival delays for ALASKA and AM WEST airlines across five different cities into a “long” format suitable for comparative analysis. I will be using functions in the Tidyverse R-package to accomplish this goal.
APPROACH
Create the data table and load it to a GitHub repo so that it could be easily accessible by everyone.
Address the missing airline names using the fill() function, ensuring every “delayed” and “on time” row is correctly attributed to either Airline and also i will remove the empty row separating the status of the two airlines.
Use the pivot_longer() function to transform my data to a long format because to perform our analysis on Airline Delays or On Time it will be better to have Variables as columns and Observation as rows.
Calculate and Plot of delay or On Time percentages for both Airlines overall.
Calculate and Plot of delay or On Time percentages for both Airlines across the cities.
Analyze potential similarity or not between the two comparison above.
Install and Load packages
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Create the CSV Airlines flights Status CSV file and Save it in my GitHub
X Y Los_Angeles Phoenix San_Diego San_Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 <NA> delayed 62 12 20 102 305
3 <NA> <NA> NA NA NA NA NA
4 AM WEST on time 694 4840 383 320 210
5 <NA> delayed 117 415 65 129 61
Airlines Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 210
5 AM WEST delayed 117 415 65 129 61
Reshape data from wide to long format
We will need to transform the data format from wide to long in order to facilitate our analysis on Airlines status. That transformation will consist of having Variables as columns and Observation as rows.
flights_Data_long <- Airlines_flights_status %>%pivot_longer(cols = Los_Angeles:Seattle,names_to ="City",values_to ="Count" ) %>%# Clean up city names for displaymutate(City =str_replace_all(City, "_", " "))flights_Data_long
# A tibble: 20 × 4
Airlines Status City Count
<chr> <chr> <chr> <int>
1 ALASKA on time Los Angeles 497
2 ALASKA on time Phoenix 221
3 ALASKA on time San Diego 212
4 ALASKA on time San Francisco 503
5 ALASKA on time Seattle 1841
6 ALASKA delayed Los Angeles 62
7 ALASKA delayed Phoenix 12
8 ALASKA delayed San Diego 20
9 ALASKA delayed San Francisco 102
10 ALASKA delayed Seattle 305
11 AM WEST on time Los Angeles 694
12 AM WEST on time Phoenix 4840
13 AM WEST on time San Diego 383
14 AM WEST on time San Francisco 320
15 AM WEST on time Seattle 210
16 AM WEST delayed Los Angeles 117
17 AM WEST delayed Phoenix 415
18 AM WEST delayed San Diego 65
19 AM WEST delayed San Francisco 129
20 AM WEST delayed Seattle 61
Let’s compare the overall flights delays percentages for the two airlines and create a bar graph
# A tibble: 4 × 5
Airlines Status Total Grand_Total Pct
<chr> <chr> <int> <int> <dbl>
1 ALASKA delayed 501 3775 13.3
2 ALASKA on time 3274 3775 86.7
3 AM WEST delayed 787 7234 10.9
4 AM WEST on time 6447 7234 89.1
Step 2: Overall delay flights status for the airlines.
I decided to focus on flights delayed because they are more significant in the choice of an airlines for a trip.
# A tibble: 2 × 4
Airlines Delayed Grand_Total Delay_Pct
<chr> <int> <int> <dbl>
1 ALASKA 501 3775 13.3
2 AM WEST 787 7234 10.9
Step 3: Let’s create the bar graph using ggplot2 package
To create the plot, i used the help of Gemini 3.0 with the prompt: “Plot the result of this code( I entered the overall_flights_delay code ) using ggplot2 package in R”.
library(ggplot2)# Create the plotggplot(overall_flights_delay, aes(x = Airlines, y = Delay_Pct, fill = Airlines)) +# Using geom_col for pre-summarized data (or geom_bar(stat = "identity"))geom_col(width =0.6) +# Adding numerical labels above each bar for claritygeom_text(aes(label =paste0(round(Delay_Pct, 2), "%")), vjust =-0.5, fontface ="bold",size =5) +# Enhancing visual appeal with titles and clean axeslabs(title ="Overall Percentage of Delayed Flights",subtitle ="Aggregated data across all 5 destination cities",x ="Airline",y ="Percentage of Flights Delayed (%)" ) +# Choosing color-blind friendly or distinctive colorsscale_fill_manual(values =c("ALASKA"="green", "AM WEST"="orange")) +# Removing the legend as the x-axis labels are sufficienttheme_minimal() +theme(legend.position ="none") +# Scaling the y-axis slightly higher than the max value to accommodate labelsscale_y_continuous(limits =c(0, max(overall_flights_delay$Delay_Pct) +3))
Interpretation:
Despite having almost twice the volume of flights, AM WEST has a lower flights delay percentage of 10.88% compared to 13.27 %. Hence, based on the overall comparison delays percentages for the two airlines, passenger will be tempted to conclude that AM WEST is more punctual than ALASKA.
Let’s compare the flights delays percentages for the two airlines across the five cities
# A tibble: 10 × 5
Airlines City Delayed City_Total Delay_Pct
<chr> <chr> <int> <int> <dbl>
1 ALASKA Los Angeles 62 559 11.1
2 AM WEST Los Angeles 117 811 14.4
3 ALASKA Phoenix 12 233 5.15
4 AM WEST Phoenix 415 5255 7.9
5 ALASKA San Diego 20 232 8.62
6 AM WEST San Diego 65 448 14.5
7 ALASKA San Francisco 102 605 16.9
8 AM WEST San Francisco 129 449 28.7
9 ALASKA Seattle 305 2146 14.2
10 AM WEST Seattle 61 271 22.5
Create a bar graph that represents he flights delays percentages for the two airlines across the five cities
Here, I use the same prompt as in the previous question but instead across the five cities on Gemini 3.
ggplot(city_flights_summary, aes(x = City, y = Delay_Pct, fill = Airlines)) +# Use position = "dodge" to place bars side-by-sidegeom_bar(stat ="identity", position =position_dodge(width =0.8), width =0.7) +# Add percentage labels above barsgeom_text(aes(label =paste0(round(Delay_Pct, 1), "%")), position =position_dodge(width =0.8), vjust =-0.5, size =3.5, fontface ="bold") +# Customizing labels and themelabs(title ="Flight Delay Percentage by City and Airline",subtitle ="Alaska Airlines vs. AM West across 5 Western Cities",x ="Destination City",y ="Percentage of Flights Delayed (%)",fill ="Airline" ) +scale_fill_manual(values =c("ALASKA"="green", "AM WEST"="orange")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),panel.grid.major.x =element_blank() ) +# Adjust Y axis to make room for labelsscale_y_continuous(limits =c(0, max(city_flights_summary$Delay_Pct) +5))
Interpretation
The result of this analysis is surprising because ALASKA has the lowest rate of delay in every city despite the fact AM WEST was considered more punctual than ALASKA when performing the overall comparison.
ANALYSIS
The paradox when comparing Alaska and AM West airlines delayed flights performance city by city vs the Overall performance can be explained by the vastly different distribution of flight volumes across the cities for the two airlines.
For instance, ALASKA has less delayed flights than AM WEST in Phoenix but the total number of ALASKA flights to Phoenix is less than 5% compared to the total number of AM WEST flights to Phoenix (233 flights for ALASKA vs 5255 flights for AM WEST ).
CONCLUSION
This result brings us to believe that there are more difficult cities to fly to where we believe ALASKA flights were concentrated. Thankfully, we conducted an analysis across cities which helped us to have a “non” bias opinion between both airlines.