Overview

In this assignment, I will tidy and transform data using tidyr and dplyr as needed. Then, I will analyze the arrival delays between the two airlines.

# import flights data
df <- read.csv('https://raw.githubusercontent.com/yli1048/yli1048/refs/heads/607/Week%204.csv', header=TRUE)
names(df)[names(df) == "X"] <- "Airlines"
names(df)[names(df) == "X.1"] <- "Status"
glimpse(df)
## Rows: 4
## Columns: 7
## $ Airlines      <chr> "ALASKA", "ALASKA", "AM WEST", "AM WEST"
## $ Status        <chr> "on time", "delayed", "on time", "delayed"
## $ Los.Angeles   <int> 497, 62, 694, 117
## $ Phoenix       <int> 221, 12, 4840, 415
## $ San.Diego     <int> 212, 20, 383, 65
## $ San.Francisco <int> 503, 102, 320, 129
## $ Seattle       <int> 1841, 305, 201, 61

Tidy data

#Tidy data for visualization
long.df <- pivot_longer(df, cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"), names_to = "City", values_to = "Flights")
print(long.df)
## # A tibble: 20 × 4
##    Airlines Status  City          Flights
##    <chr>    <chr>   <chr>           <int>
##  1 ALASKA   on time Los.Angeles       497
##  2 ALASKA   on time Phoenix           221
##  3 ALASKA   on time San.Diego         212
##  4 ALASKA   on time San.Francisco     503
##  5 ALASKA   on time Seattle          1841
##  6 ALASKA   delayed Los.Angeles        62
##  7 ALASKA   delayed Phoenix            12
##  8 ALASKA   delayed San.Diego          20
##  9 ALASKA   delayed San.Francisco     102
## 10 ALASKA   delayed Seattle           305
## 11 AM WEST  on time Los.Angeles       694
## 12 AM WEST  on time Phoenix          4840
## 13 AM WEST  on time San.Diego         383
## 14 AM WEST  on time San.Francisco     320
## 15 AM WEST  on time Seattle           201
## 16 AM WEST  delayed Los.Angeles       117
## 17 AM WEST  delayed Phoenix           415
## 18 AM WEST  delayed San.Diego          65
## 19 AM WEST  delayed San.Francisco     129
## 20 AM WEST  delayed Seattle            61

Compare arrival delays

al.total <- with(long.df, sum(Flights[Airlines == "ALASKA"]))

al.delayed <- with(long.df, sum(Flights[Airlines == "ALASKA" & Status == 'delayed']))

al.percent = (al.delayed / al.total) * 100
print("The percent of arrival delays of ALASKA airline is")
## [1] "The percent of arrival delays of ALASKA airline is"
print(al.percent)
## [1] 13.27152
long.df %>%
  filter(Airlines == "ALASKA" & Status == "delayed") %>%
  ggplot(aes(x = City, y = Flights)) +
  geom_col()

aw.total <- with(long.df, sum(Flights[Airlines == "AM WEST"]))

aw.delayed <- with(long.df, sum(Flights[Airlines == "AM WEST" & Status == 'delayed']))

aw.percent = (aw.delayed / aw.total) * 100
print("The percent of arrival delays of AW WEST airline is")
## [1] "The percent of arrival delays of AW WEST airline is"
print(aw.percent)
## [1] 10.89273
long.df %>%
  filter(Airlines == "AM WEST" & Status == "delayed") %>%
  ggplot(aes(x = City, y = Flights)) +
  geom_col()

long.df %>%
filter(Status == "delayed") %>%
ggplot(aes(x = City, y = Flights, fill = Airlines)) + 
  ggtitle("Arrival Delays") + 
  geom_bar(stat = "identity", position = "dodge", width = 0.75)

When comparing the percentage of arrival delays, we can see that even though AM WEST airline experiences a higher number of arrival delays, it is actually less likely to be delayed since it has a lower percentage. However, AM WEST airline also has a high likelihood of experiencing delays when traveling to Phoenix. On the other hand, ALASKA airline has a higher chance of experiencing delays when traveling to Seattle.

Conclusion

I have learned how to use pivot_longer and pivot_wider to expand or reduce the number of columns. This can make the data easier to read and help with creating data visualizations. For instance, in this assignment, I utilized pivot_longer to merge all the cities into a single column, which made it simpler to plot and analyze the data frame.