This is an R Markdown document with Sean Amato’s work for the week 5 homework relating to tidying a dataset.
First, I started by importing practically an exact replica of the data from blackboard.
messy_df <- read.csv("https://raw.githubusercontent.com/samato0624/DATA607/main/HW_Week_5_SA_Table.csv")
Here I’m printing the data to inspect it and do some initial cleaning.
print(messy_df)
## X Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
# Removing the blank row.
messy_df2 <- messy_df %>%
filter(Seattle != "[0-9]")
# Adding missing values to the first column.
messy_df2[2,1] <- "ALASKA"
messy_df2[4,1] <- "AM WEST"
# Cleaning up the column names.
colnames(messy_df2) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")
print(messy_df2)
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Now that some basic transformations have been completed, the next step is to switch from a wide to a long format. Below you can observe that the original table has been atomized as far as possible, with each row representing a single flight.
clean_df <- pivot_longer(
messy_df2,
cols = "Los Angeles":"Phoenix":"San Diego":"San Francisco":"Seattle",
names_to = "Destination",
values_to = "Flights"
) %>% uncount(Flights)
head(clean_df,10)
## # A tibble: 10 × 3
## Airline Status Destination
## <chr> <chr> <chr>
## 1 ALASKA on time Los Angeles
## 2 ALASKA on time Los Angeles
## 3 ALASKA on time Los Angeles
## 4 ALASKA on time Los Angeles
## 5 ALASKA on time Los Angeles
## 6 ALASKA on time Los Angeles
## 7 ALASKA on time Los Angeles
## 8 ALASKA on time Los Angeles
## 9 ALASKA on time Los Angeles
## 10 ALASKA on time Los Angeles
Now let’s perform an some exploratory data analysis to see if we can find any meaningful differences between the 2 airlines.
Based on the chart presented below, by counts alone we can tell that AM WEST has more overall arrival delays, but this doesn’t help us infer anything about the delays in different destinations.
arrival_delays <- clean_df %>%
filter(Status == "delayed")
ggplot(arrival_delays, aes(x = Airline)) + geom_bar(stat = "count", fill = "purple") +
theme_classic()
When we aggregate the delayed flights by city we find that AM WEST struggles to get people to Phoeniz on-time, while ALASKA struggles to get people to Seattle on-time. The rest of the destinations have marginal differences in comparison to Phoenix and Seattle, but this chart doesn’t help us understand anything about the total throughput that each airline is responsible for. Thus, these charts do not provide a fair comparison.
ggplot(arrival_delays, aes(x = Airline)) + geom_bar(stat = "count", fill = "steelblue") +
facet_wrap(~Destination) +
theme_classic()
Here I’m going to take clean_df and convert it back to a wider format for an evaluation of delay proportions.
Step 1. Separate the data by Status to create 2 different tables. Additionally, I concatenated the destination with the airline to create a key, with which I’m able to use later to rejoin the two tables.
clean_df_delayed <- clean_df %>%
filter(Status == "delayed") %>%
count(Airline, Status, Destination) %>%
mutate(Airline_Dest = paste(Airline, Destination))
clean_df_on_time <- clean_df %>%
filter(Status == "on time") %>%
count(Airline, Status, Destination) %>%
mutate(Airline_Dest = paste(Airline, Destination)) %>%
select(c(2,4,5))
Step 2. Join the data back together by the key (Airline_Dest) I made earlier.
clean_df2 <- left_join(clean_df_delayed, clean_df_on_time, by = "Airline_Dest")
print(clean_df2)
## # A tibble: 10 × 7
## Airline Status.x Destination n.x Airline_Dest Status.y n.y
## <chr> <chr> <chr> <int> <chr> <chr> <int>
## 1 ALASKA delayed Los Angeles 62 ALASKA Los Angeles on time 497
## 2 ALASKA delayed Phoenix 12 ALASKA Phoenix on time 221
## 3 ALASKA delayed San Diego 20 ALASKA San Diego on time 212
## 4 ALASKA delayed San Francisco 102 ALASKA San Francisco on time 503
## 5 ALASKA delayed Seattle 305 ALASKA Seattle on time 1841
## 6 AM WEST delayed Los Angeles 117 AM WEST Los Angeles on time 694
## 7 AM WEST delayed Phoenix 415 AM WEST Phoenix on time 4840
## 8 AM WEST delayed San Diego 65 AM WEST San Diego on time 383
## 9 AM WEST delayed San Francisco 129 AM WEST San Francisco on time 320
## 10 AM WEST delayed Seattle 61 AM WEST Seattle on time 201
Step 3. Change column names for house keeping
purposes and remove redundant data such as the status columns and the
key.
Step 4. Calculate proportions of delayed flights and
plot them to understand what the airlines track records look like in
terms of getting people to their destinations on time.
colnames(clean_df2) <- c("Airline", "Status_delayed", "Destination", "Delay_Count", "Airline_Dest", "Status_on_time", "On_Time_Count")
clean_df3 <- clean_df2 %>%
select(c(1,3,4,7)) %>%
mutate(Total = Delay_Count + On_Time_Count) %>%
mutate(Delay_Proportion = Delay_Count/Total)
print(clean_df3)
## # A tibble: 10 × 6
## Airline Destination Delay_Count On_Time_Count Total Delay_Proportion
## <chr> <chr> <int> <int> <int> <dbl>
## 1 ALASKA Los Angeles 62 497 559 0.111
## 2 ALASKA Phoenix 12 221 233 0.0515
## 3 ALASKA San Diego 20 212 232 0.0862
## 4 ALASKA San Francisco 102 503 605 0.169
## 5 ALASKA Seattle 305 1841 2146 0.142
## 6 AM WEST Los Angeles 117 694 811 0.144
## 7 AM WEST Phoenix 415 4840 5255 0.0790
## 8 AM WEST San Diego 65 383 448 0.145
## 9 AM WEST San Francisco 129 320 449 0.287
## 10 AM WEST Seattle 61 201 262 0.233
ggplot(clean_df3, aes(x = Airline, y = Delay_Proportion)) +
geom_bar(stat = "identity", fill = "darkgreen") +
facet_wrap(~Destination) +
theme_classic()
Conclusions: If I was to bet on one airline over another to get me to my destination on time I would go with Alaska. The proportion of flight delays through Alaska by destination is lower than AM WEST across the board and the count of delayed flights via AM WEST is ~37% greater than Alaska.
There is a caveat to this though, below I totaled the number of flights that each airline completes and for ALASKA vs AM WEST it’s 3,775 & 7,225 respectively. Based on supply and demand, if I’m looking to get a cheaper flight I would wager that AM WEST would save me money in exchange for a less reliable service. However I can’t confirm my claim since AM WEST stopped operating back in 2007.
Alaska <- clean_df3 %>%
filter(Airline == "ALASKA")
sum(Alaska$Total)
## [1] 3775
AM_West <- clean_df3 %>%
filter(Airline == "AM WEST")
sum(AM_West$Total)
## [1] 7225