In this assignment, we will be comparing arrival delays of two airlines, Alaska and AM West, across five destinations. The main goal for this assignment is to tidy up and transform the data for easier analysis. Let us load the data file, and try to tidy/transform the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
#Import .CSV file
arrival_delay <- read.csv('https://raw.githubusercontent.com/mirajpatel289/Data607/refs/heads/main/Week%205%20HW%20Table.csv')
#remove 3rd row as its all N/A
arrival_delay <- arrival_delay[-3,]
#add the airline
arrival_delay[2, "X"] <- "ALASKA"
arrival_delay[4, "X"] <- "AM WEST"
#pivot data
arrival_delay <- arrival_delay %>%
pivot_longer(
cols = Los.Angeles:Seattle,
names_to = "City",
values_to = "Count"
)
print(arrival_delay)
## # A tibble: 20 × 4
## X X.1 City Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
Now, we will do individual analysis of each of the flights and their on time/delayed counts.
#new data frame for only Alaska and AM West
alaska <- arrival_delay %>%
filter(X == "ALASKA")
am_west <- arrival_delay %>%
filter(X == "AM WEST")
#new data frames for on time and delayed for each flight
alaska_ot <- alaska %>%
filter(X.1 == "on time")
alaska_delay <- alaska %>%
filter(X.1 == "delayed")
am_west_ot <- am_west %>%
filter(X.1 == "on time")
am_west_delay <- am_west %>%
filter(X.1 == "delayed")
#calculate on time and delayed flights counts and percentage
alaska_delay_count <- sum(alaska_delay$Count)
alaska_ot_count <- sum(alaska_ot$Count)
alaska_percent_ot <- round((alaska_ot_count/(alaska_delay_count + alaska_ot_count)) * 100)
am_west_delay_count <- sum(am_west_delay$Count)
am_west_ot_count <- sum(am_west_ot$Count)
am_west_percent_ot <- round((am_west_ot_count/(am_west_delay_count + am_west_ot_count)) * 100)
#put these finding into their respective data frame
alaska_summary <- data.frame(
Airline = "ALASKA",
TotalOnTime = alaska_ot_count,
TotalDelay = alaska_delay_count,
TotalFlights = alaska_ot_count + alaska_delay_count,
PercentOnTime = alaska_percent_ot
)
am_west_summary <- data.frame(
Airline = "AM WEST",
TotalOnTime = am_west_ot_count,
TotalDelay = am_west_delay_count,
TotalFlights = am_west_ot_count + am_west_delay_count,
PercentOnTime = am_west_percent_ot
)
#combine two data frames into one
airline_summary <- rbind(alaska_summary, am_west_summary)
print(airline_summary)
## Airline TotalOnTime TotalDelay TotalFlights PercentOnTime
## 1 ALASKA 3274 501 3775 87
## 2 AM WEST 6438 787 7225 89
Based on the table, AM West airline has almost double the amount of flights as Alaska airline. The delays for AM West is also higher than Alaska, which can be expected as there are more flights for AM West. The main point is the percentage of flights on time, which AM West is higher, 89% vs 87%. Having a larger amount of flights and a better on time percentage, AM West looks like an airline you would not expect to have delays compared to Alaska airline.
Now we know which airline would be on time more often, but how do they vary city to city. In this data, we have 5 cities that these airlines go to. The cities are Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The first city we will compare from the list is Los Angeles, California.
#Alaska/LA data frame
la_alaska <- data.frame(
Airline = "Alaska",
City = "Los Angeles",
Delayed = alaska_delay[1, 4],
OnTime = alaska_ot[1, 4],
Total = alaska_delay[1, 4] + alaska_ot[1, 4],
PercentOnTime = round((alaska_ot[1, 4] / (alaska_delay[1, 4] + alaska_ot[1, 4])) * 100)
)
colnames(la_alaska) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#AM West/LA data frame
la_am_west <- data.frame(
Airline = "AM WEST",
City = "Los Angeles",
Delayed = am_west_delay[1, 4],
OnTime = am_west_ot[1, 4],
Total = am_west_delay[1, 4] + am_west_ot[1, 4],
PercentOnTime = round((am_west_ot[1, 4] / (am_west_delay[1, 4] + am_west_ot[1, 4])) * 100)
)
colnames(la_am_west) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#combine two data frames into one
la_summary <- rbind(la_alaska, la_am_west)
print(la_summary)
## Airline City Delayed Flights On Time Flights Total Flights
## 1 Alaska Los Angeles 62 497 559
## 2 AM WEST Los Angeles 117 694 811
## Percent On Time Flights
## 1 89
## 2 86
For the city of Los Angeles, Alaska airlines had 559 total flights and 89% of them were on time or 11% were delayed. AM West airlines had more total flights than Alaska airlines with a total of 811 flights. 86% of these AM West flights were on time. It would seem that you would expect more delayed flights from AM West solely based on the count (almost double the amount of delayed flights than Alaska), but this difference is due to amount of AM West flights.
The next city we will compare is Phoenix, Arizona.
#Alaska/Phoenix data frame
phx_alaska <- data.frame(
Airline = "Alaska",
City = "Phoenix",
Delayed = alaska_delay[2, 4],
OnTime = alaska_ot[2, 4],
Total = alaska_delay[2, 4] + alaska_ot[2, 4],
PercentOnTime = round((alaska_ot[2, 4] / (alaska_delay[2, 4] + alaska_ot[2, 4])) * 100)
)
colnames(phx_alaska) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#AM West/Phoenix data frame
phx_am_west <- data.frame(
Airline = "AM WEST",
City = "Phoenix",
Delayed = am_west_delay[2, 4],
OnTime = am_west_ot[2, 4],
Total = am_west_delay[2, 4] + am_west_ot[2, 4],
PercentOnTime = round((am_west_ot[2, 4] / (am_west_delay[2, 4] + am_west_ot[2, 4])) * 100)
)
colnames(phx_am_west) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#combine two data frames into one
phx_summary <- rbind(phx_alaska, phx_am_west)
print(phx_summary)
## Airline City Delayed Flights On Time Flights Total Flights
## 1 Alaska Phoenix 12 221 233
## 2 AM WEST Phoenix 415 4840 5255
## Percent On Time Flights
## 1 95
## 2 92
In Phoenix, Alaska airlines presence is low compared to AM West airlines. Alaska has 233 total flights compared to AM West airlines 5255 total flights. The percent on time flights in Phoenix is higher for both airlines compared to Los Angeles. Again, Alaska has a better percentage of on time flights than AM West. Again, this is probably due to the amount of flights per airlines. Let’s check if this trend continues to the next city, San Diego, California.
#Alaska/San Diego data frame
sd_alaska <- data.frame(
Airline = "Alaska",
City = "San Diego",
Delayed = alaska_delay[3, 4],
OnTime = alaska_ot[3, 4],
Total = alaska_delay[3, 4] + alaska_ot[3, 4],
PercentOnTime = round((alaska_ot[3, 4] / (alaska_delay[3, 4] + alaska_ot[3, 4])) * 100)
)
colnames(sd_alaska) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#AM West/San Diego data frame
sd_am_west <- data.frame(
Airline = "AM WEST",
City = "San Diego",
Delayed = am_west_delay[3, 4],
OnTime = am_west_ot[3, 4],
Total = am_west_delay[3, 4] + am_west_ot[3, 4],
PercentOnTime = round((am_west_ot[3, 4] / (am_west_delay[3, 4] + am_west_ot[3, 4])) * 100)
)
colnames(sd_am_west) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#combine two data frames into one
sd_summary <- rbind(sd_alaska, sd_am_west)
print(sd_summary)
## Airline City Delayed Flights On Time Flights Total Flights
## 1 Alaska San Diego 20 212 232
## 2 AM WEST San Diego 65 383 448
## Percent On Time Flights
## 1 91
## 2 85
From the table above, the amount of flights for AM West (448) is almost twice the amount of flights for Alaska (232). While this is the closest we have seen Alaska and AM West in terms of number of flights, AM West still has a lower percentage of on time flights in San Diego. The difference for total flights is smaller, but the difference for percent on time is larger. This is surprising to see as previously for AM West airlines the counts were vastly larger and the percentage on time was close to the percentage on time for Alaska. But for San Diego its the opposite, the counts are smaller and gap in percentage on time flights is wider.
Moving onto another city in California, San Francisco.
#Alaska/San Francisco data frame
sf_alaska <- data.frame(
Airline = "Alaska",
City = "San Francisco",
Delayed = alaska_delay[4, 4],
OnTime = alaska_ot[4, 4],
Total = alaska_delay[4, 4] + alaska_ot[4, 4],
PercentOnTime = round((alaska_ot[4, 4] / (alaska_delay[4, 4] + alaska_ot[4, 4])) * 100)
)
colnames(sf_alaska) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#AM West/San Francisco data frame
sf_am_west <- data.frame(
Airline = "AM WEST",
City = "San Francisco",
Delayed = am_west_delay[4, 4],
OnTime = am_west_ot[4, 4],
Total = am_west_delay[4, 4] + am_west_ot[4, 4],
PercentOnTime = round((am_west_ot[4, 4] / (am_west_delay[4, 4] + am_west_ot[4, 4])) * 100)
)
colnames(sf_am_west) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#combine two data frames into one
sf_summary <- rbind(sf_alaska, sf_am_west)
print(sf_summary)
## Airline City Delayed Flights On Time Flights Total Flights
## 1 Alaska San Francisco 102 503 605
## 2 AM WEST San Francisco 129 320 449
## Percent On Time Flights
## 1 83
## 2 71
This is the first city where there are more Alaska airline flights, 605, than AM West flights, 449. The number of delayed flights are almost the same but the number total flights are not. The trend of Alaska having a better percentage of on time flights remains true, 83% vs 71%. There is a higher percentage on delayed flights for both Alaska and AM West airlines. This could be due to variables such as weather, airport congestion, etc.
The last city we will do these comparisons for is Seattle, Washington.
#Alaska/Seattle data frame
sea_alaska <- data.frame(
Airline = "Alaska",
City = "Seattle",
Delayed = alaska_delay[5, 4],
OnTime = alaska_ot[5, 4],
Total = alaska_delay[5, 4] + alaska_ot[5, 4],
PercentOnTime = round((alaska_ot[5, 4] / (alaska_delay[5, 4] + alaska_ot[5, 4])) * 100)
)
colnames(sea_alaska) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#AM West/Seattle data frame
sea_am_west <- data.frame(
Airline = "AM WEST",
City = "Seattle",
Delayed = am_west_delay[5, 4],
OnTime = am_west_ot[5, 4],
Total = am_west_delay[5, 4] + am_west_ot[5, 4],
PercentOnTime = round((am_west_ot[5, 4] / (am_west_delay[5, 4] + am_west_ot[5, 4])) * 100)
)
colnames(sea_am_west) <- c("Airline", "City", "Delayed Flights", "On Time Flights", "Total Flights", "Percent On Time Flights" )
#combine two data frames into one
sea_summary <- rbind(sea_alaska, sea_am_west)
print(sea_summary)
## Airline City Delayed Flights On Time Flights Total Flights
## 1 Alaska Seattle 305 1841 2146
## 2 AM WEST Seattle 61 201 262
## Percent On Time Flights
## 1 86
## 2 77
The second city where Alaska airlines has more total flights (almost 10 times more) than AM West airlines. The number of delayed flights for Alaska airlines is 5 times more than the number of delayed flights for AM West. With a larger number of delayed flights and total flights, Alaska airline still has a better on time flight percentage than AM West airline.
To make the city by city analysis clearer, lets combine the tables into one.
city_summary <- rbind(la_summary, phx_summary, sd_summary, sf_summary, sea_summary)
print(city_summary)
## Airline City Delayed Flights On Time Flights Total Flights
## 1 Alaska Los Angeles 62 497 559
## 2 AM WEST Los Angeles 117 694 811
## 3 Alaska Phoenix 12 221 233
## 4 AM WEST Phoenix 415 4840 5255
## 5 Alaska San Diego 20 212 232
## 6 AM WEST San Diego 65 383 448
## 7 Alaska San Francisco 102 503 605
## 8 AM WEST San Francisco 129 320 449
## 9 Alaska Seattle 305 1841 2146
## 10 AM WEST Seattle 61 201 262
## Percent On Time Flights
## 1 89
## 2 86
## 3 95
## 4 92
## 5 91
## 6 85
## 7 83
## 8 71
## 9 86
## 10 77
Looking at the overall analysis earlier, I said that AM West looks like the better choice overall. After performing the city by city analysis, I had to rethink whether AM West is better choice. The analysis of the Alaska airlines and AM West airline across 5 different cities shows an example of Simpson’s Paradox. Simpson’s Paradox is when trends observed within individual groups reverse or disappear when the groups are combined. The best example is AM West overall vs city by city. AM West airline had a worse on time flight percentage than Alaska airline across all 5 cities. When the cities data combined for Alaska and AM West, AM West had a higher on time flight percentage than Alaska. Why? This discrepancy could result from different numbers of flights per city, variables such as weather or airport congestion, or a disproportionate impact from high-traffic cities.