For this assignment, I had to transform a wide table into a long table and analyze the data. Converting the wide dataset of airline information to a long dataset was simple with the pivot_longer function in tidyr. Initially, I plotted the data to see if anything stood out. Afterwards, I mutated the dataframe to look at the percentage of delayed flights from each airline. Then I plotted the percentage of delayed flights by city to see if there were any trends.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#Created a csv in excel and uploaded it to github. Saved it as a dataframe(df)
df <- read.csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/week%205%20flights.csv")
df
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Added some values and column names to df
df[2,1] <- "ALASKA"
df[5,1] <- "AM WEST"
colnames(df) <- c("Airline", "Status", "Los Angeles", "Pheonix", "San Diego", "San Francisco", "Seattle")
df
## Airline Status Los Angeles Pheonix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
#Used the pivot_longer function in tidyr to convert wide data into long data
df_long <- df %>%
pivot_longer(
cols = c("Los Angeles", "Pheonix", "San Diego", "San Francisco", "Seattle"),
names_to = "City",
values_to = "Value",
values_drop_na = TRUE) %>%
select("Airline", "City", "Status", "Value") %>%
arrange(City, Status)
df_long
## # A tibble: 20 × 4
## Airline City Status Value
## <chr> <chr> <chr> <int>
## 1 ALASKA Los Angeles delayed 62
## 2 AM WEST Los Angeles delayed 117
## 3 ALASKA Los Angeles on time 497
## 4 AM WEST Los Angeles on time 694
## 5 ALASKA Pheonix delayed 12
## 6 AM WEST Pheonix delayed 415
## 7 ALASKA Pheonix on time 221
## 8 AM WEST Pheonix on time 4840
## 9 ALASKA San Diego delayed 20
## 10 AM WEST San Diego delayed 65
## 11 ALASKA San Diego on time 212
## 12 AM WEST San Diego on time 383
## 13 ALASKA San Francisco delayed 102
## 14 AM WEST San Francisco delayed 129
## 15 ALASKA San Francisco on time 503
## 16 AM WEST San Francisco on time 320
## 17 ALASKA Seattle delayed 305
## 18 AM WEST Seattle delayed 61
## 19 ALASKA Seattle on time 1841
## 20 AM WEST Seattle on time 201
#Wanted to get a general sense of the data by plotting it. Looks like Phenoix has the largest amount of on time flights.
df_long %>%
ggplot(aes(x= City, y = Value, fill = Status))+
facet_wrap(~Airline)+
geom_bar(stat = "identity", position = "dodge")+
labs(title = "Airline Delays", x = "City", y = "Count")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
###Analysis to compare the arrival delays for the two airlines
#Calculating the percentage of delayed flights on ALASKA airline
df_delayed_alaska <- df_long %>%
filter(Airline == "ALASKA") %>%
mutate(total_flights = sum(Value)) %>%
filter(Status == "delayed") %>%
mutate(pct_flight_delay = sum(Value)/total_flights*100)
df_delayed_alaska
## # A tibble: 5 × 6
## Airline City Status Value total_flights pct_flight_delay
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 ALASKA Los Angeles delayed 62 3775 13.3
## 2 ALASKA Pheonix delayed 12 3775 13.3
## 3 ALASKA San Diego delayed 20 3775 13.3
## 4 ALASKA San Francisco delayed 102 3775 13.3
## 5 ALASKA Seattle delayed 305 3775 13.3
#Calculating the percentage of delayed flights on AM WEST airline
df_delay_am <- df_long %>%
filter(Airline == "AM WEST") %>%
mutate(total_flights = sum(Value)) %>%
filter(Status == "delayed") %>%
mutate(pct_flight_delay = sum(Value)/total_flights*100)
df_delay_am
## # A tibble: 5 × 6
## Airline City Status Value total_flights pct_flight_delay
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 AM WEST Los Angeles delayed 117 7225 10.9
## 2 AM WEST Pheonix delayed 415 7225 10.9
## 3 AM WEST San Diego delayed 65 7225 10.9
## 4 AM WEST San Francisco delayed 129 7225 10.9
## 5 AM WEST Seattle delayed 61 7225 10.9
#Finding the percentage of delayed flights by city and airline
df_delay_city <- df_long %>%
group_by(City,Airline) %>%
mutate(total_flights_city = sum(Value)) %>%
mutate(pct_delay_city = Value/total_flights_city*100)
df_delay_city
## # A tibble: 20 × 6
## # Groups: City, Airline [10]
## Airline City Status Value total_flights_city pct_delay_city
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 ALASKA Los Angeles delayed 62 559 11.1
## 2 AM WEST Los Angeles delayed 117 811 14.4
## 3 ALASKA Los Angeles on time 497 559 88.9
## 4 AM WEST Los Angeles on time 694 811 85.6
## 5 ALASKA Pheonix delayed 12 233 5.15
## 6 AM WEST Pheonix delayed 415 5255 7.90
## 7 ALASKA Pheonix on time 221 233 94.8
## 8 AM WEST Pheonix on time 4840 5255 92.1
## 9 ALASKA San Diego delayed 20 232 8.62
## 10 AM WEST San Diego delayed 65 448 14.5
## 11 ALASKA San Diego on time 212 232 91.4
## 12 AM WEST San Diego on time 383 448 85.5
## 13 ALASKA San Francisco delayed 102 605 16.9
## 14 AM WEST San Francisco delayed 129 449 28.7
## 15 ALASKA San Francisco on time 503 605 83.1
## 16 AM WEST San Francisco on time 320 449 71.3
## 17 ALASKA Seattle delayed 305 2146 14.2
## 18 AM WEST Seattle delayed 61 262 23.3
## 19 ALASKA Seattle on time 1841 2146 85.8
## 20 AM WEST Seattle on time 201 262 76.7
df_delay_city %>%
filter(Status == "delayed") %>%
ggplot(aes(x= City, y = pct_delay_city, fill = Airline))+
geom_bar(stat = "identity", position = "dodge")+
labs(title = "Airline Delays by percentage", x = "City", y = "Percent Delayed")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Conclusion: AM West airline had more flights at 7225 compared to Alaska Airlines at 3775. Even though they had almost double the number of flights as Alaska Airlines, AM West (10%) had a lower overall percentage of delayed flights than Alaska Airlines (13%).
However, if I were to compare the percentage of delays in each city, Alaska Airlines has a lower percentage of delayed flights in each city compared to AM West.
The reason AM West airline has a lower overall percentage of delayed flights is because Phoenix has almost 5,000 flights that were on time.