In this assignment, I will tidy and transform data using tidyr and dplyr as needed. Then, I will analyze the arrival delays between the two airlines.
# import flights data
df <- read.csv('https://raw.githubusercontent.com/yli1048/yli1048/refs/heads/607/Week%204.csv', header=TRUE)
names(df)[names(df) == "X"] <- "Airlines"
names(df)[names(df) == "X.1"] <- "Status"
glimpse(df)
## Rows: 4
## Columns: 7
## $ Airlines <chr> "ALASKA", "ALASKA", "AM WEST", "AM WEST"
## $ Status <chr> "on time", "delayed", "on time", "delayed"
## $ Los.Angeles <int> 497, 62, 694, 117
## $ Phoenix <int> 221, 12, 4840, 415
## $ San.Diego <int> 212, 20, 383, 65
## $ San.Francisco <int> 503, 102, 320, 129
## $ Seattle <int> 1841, 305, 201, 61
#Tidy data for visualization
long.df <- pivot_longer(df, cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"), names_to = "City", values_to = "Flights")
print(long.df)
## # A tibble: 20 × 4
## Airlines Status City Flights
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
al.total <- with(long.df, sum(Flights[Airlines == "ALASKA"]))
al.delayed <- with(long.df, sum(Flights[Airlines == "ALASKA" & Status == 'delayed']))
al.percent = (al.delayed / al.total) * 100
print("The percent of arrival delays of ALASKA airline is")
## [1] "The percent of arrival delays of ALASKA airline is"
print(al.percent)
## [1] 13.27152
long.df %>%
filter(Airlines == "ALASKA" & Status == "delayed") %>%
ggplot(aes(x = City, y = Flights)) +
geom_col()
aw.total <- with(long.df, sum(Flights[Airlines == "AM WEST"]))
aw.delayed <- with(long.df, sum(Flights[Airlines == "AM WEST" & Status == 'delayed']))
aw.percent = (aw.delayed / aw.total) * 100
print("The percent of arrival delays of AW WEST airline is")
## [1] "The percent of arrival delays of AW WEST airline is"
print(aw.percent)
## [1] 10.89273
long.df %>%
filter(Airlines == "AM WEST" & Status == "delayed") %>%
ggplot(aes(x = City, y = Flights)) +
geom_col()
long.df %>%
filter(Status == "delayed") %>%
ggplot(aes(x = City, y = Flights, fill = Airlines)) +
ggtitle("Arrival Delays") +
geom_bar(stat = "identity", position = "dodge", width = 0.75)
When comparing the percentage of arrival delays, we can see that even though AM WEST airline experiences a higher number of arrival delays, it is actually less likely to be delayed since it has a lower percentage. However, AM WEST airline also has a high likelihood of experiencing delays when traveling to Phoenix. On the other hand, ALASKA airline has a higher chance of experiencing delays when traveling to Seattle.
I have learned how to use pivot_longer and pivot_wider to expand or reduce the number of columns. This can make the data easier to read and help with creating data visualizations. For instance, in this assignment, I utilized pivot_longer to merge all the cities into a single column, which made it simpler to plot and analyze the data frame.