R Markdown

This is an R Markdown document with Sean Amato’s work for the week 5 homework relating to tidying a dataset.


First, I started by importing practically an exact replica of the data from blackboard.

messy_df <- read.csv("https://raw.githubusercontent.com/samato0624/DATA607/main/HW_Week_5_SA_Table.csv")

Here I’m printing the data to inspect it and do some initial cleaning.

print(messy_df)
##         X  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
# Removing the blank row.
messy_df2 <- messy_df %>% 
  filter(Seattle != "[0-9]")

# Adding missing values to the first column.
messy_df2[2,1] <- "ALASKA"
messy_df2[4,1] <- "AM WEST"

# Cleaning up the column names.
colnames(messy_df2) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")

print(messy_df2)
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Now that some basic transformations have been completed, the next step is to switch from a wide to a long format. Below you can observe that the original table has been atomized as far as possible, with each row representing a single flight.

clean_df <- pivot_longer(
  messy_df2, 
  cols = "Los Angeles":"Phoenix":"San Diego":"San Francisco":"Seattle", 
  names_to = "Destination", 
  values_to = "Flights"
) %>% uncount(Flights)

head(clean_df,10)
## # A tibble: 10 × 3
##    Airline Status  Destination
##    <chr>   <chr>   <chr>      
##  1 ALASKA  on time Los Angeles
##  2 ALASKA  on time Los Angeles
##  3 ALASKA  on time Los Angeles
##  4 ALASKA  on time Los Angeles
##  5 ALASKA  on time Los Angeles
##  6 ALASKA  on time Los Angeles
##  7 ALASKA  on time Los Angeles
##  8 ALASKA  on time Los Angeles
##  9 ALASKA  on time Los Angeles
## 10 ALASKA  on time Los Angeles

Now let’s perform an some exploratory data analysis to see if we can find any meaningful differences between the 2 airlines.

Based on the chart presented below, by counts alone we can tell that AM WEST has more overall arrival delays, but this doesn’t help us infer anything about the delays in different destinations.

arrival_delays <- clean_df %>%
  filter(Status == "delayed")

ggplot(arrival_delays, aes(x = Airline)) + geom_bar(stat = "count", fill = "purple") + 
  theme_classic()

When we aggregate the delayed flights by city we find that AM WEST struggles to get people to Phoeniz on-time, while ALASKA struggles to get people to Seattle on-time. The rest of the destinations have marginal differences in comparison to Phoenix and Seattle, but this chart doesn’t help us understand anything about the total throughput that each airline is responsible for. Thus, these charts do not provide a fair comparison.

ggplot(arrival_delays, aes(x = Airline)) + geom_bar(stat = "count", fill = "steelblue") + 
  facet_wrap(~Destination) +
  theme_classic()

Here I’m going to take clean_df and convert it back to a wider format for an evaluation of delay proportions.

Step 1. Separate the data by Status to create 2 different tables. Additionally, I concatenated the destination with the airline to create a key, with which I’m able to use later to rejoin the two tables.

clean_df_delayed <- clean_df %>%
  filter(Status == "delayed") %>%
  count(Airline, Status, Destination) %>%
  mutate(Airline_Dest = paste(Airline, Destination))

clean_df_on_time <- clean_df %>%
  filter(Status == "on time") %>%
  count(Airline, Status, Destination) %>%
  mutate(Airline_Dest = paste(Airline, Destination)) %>%
  select(c(2,4,5))

Step 2. Join the data back together by the key (Airline_Dest) I made earlier.

clean_df2 <- left_join(clean_df_delayed, clean_df_on_time, by = "Airline_Dest")

print(clean_df2)
## # A tibble: 10 × 7
##    Airline Status.x Destination     n.x Airline_Dest          Status.y   n.y
##    <chr>   <chr>    <chr>         <int> <chr>                 <chr>    <int>
##  1 ALASKA  delayed  Los Angeles      62 ALASKA Los Angeles    on time    497
##  2 ALASKA  delayed  Phoenix          12 ALASKA Phoenix        on time    221
##  3 ALASKA  delayed  San Diego        20 ALASKA San Diego      on time    212
##  4 ALASKA  delayed  San Francisco   102 ALASKA San Francisco  on time    503
##  5 ALASKA  delayed  Seattle         305 ALASKA Seattle        on time   1841
##  6 AM WEST delayed  Los Angeles     117 AM WEST Los Angeles   on time    694
##  7 AM WEST delayed  Phoenix         415 AM WEST Phoenix       on time   4840
##  8 AM WEST delayed  San Diego        65 AM WEST San Diego     on time    383
##  9 AM WEST delayed  San Francisco   129 AM WEST San Francisco on time    320
## 10 AM WEST delayed  Seattle          61 AM WEST Seattle       on time    201

Step 3. Change column names for house keeping purposes and remove redundant data such as the status columns and the key.
Step 4. Calculate proportions of delayed flights and plot them to understand what the airlines track records look like in terms of getting people to their destinations on time.

colnames(clean_df2) <- c("Airline", "Status_delayed", "Destination", "Delay_Count", "Airline_Dest", "Status_on_time", "On_Time_Count")

clean_df3 <- clean_df2 %>%
  select(c(1,3,4,7)) %>%
  mutate(Total = Delay_Count + On_Time_Count) %>%
  mutate(Delay_Proportion = Delay_Count/Total)

print(clean_df3)
## # A tibble: 10 × 6
##    Airline Destination   Delay_Count On_Time_Count Total Delay_Proportion
##    <chr>   <chr>               <int>         <int> <int>            <dbl>
##  1 ALASKA  Los Angeles            62           497   559           0.111 
##  2 ALASKA  Phoenix                12           221   233           0.0515
##  3 ALASKA  San Diego              20           212   232           0.0862
##  4 ALASKA  San Francisco         102           503   605           0.169 
##  5 ALASKA  Seattle               305          1841  2146           0.142 
##  6 AM WEST Los Angeles           117           694   811           0.144 
##  7 AM WEST Phoenix               415          4840  5255           0.0790
##  8 AM WEST San Diego              65           383   448           0.145 
##  9 AM WEST San Francisco         129           320   449           0.287 
## 10 AM WEST Seattle                61           201   262           0.233
ggplot(clean_df3, aes(x = Airline, y = Delay_Proportion)) + 
  geom_bar(stat = "identity", fill = "darkgreen") +
  facet_wrap(~Destination) +
  theme_classic()

Conclusions: If I was to bet on one airline over another to get me to my destination on time I would go with Alaska. The proportion of flight delays through Alaska by destination is lower than AM WEST across the board and the count of delayed flights via AM WEST is ~37% greater than Alaska.

There is a caveat to this though, below I totaled the number of flights that each airline completes and for ALASKA vs AM WEST it’s 3,775 & 7,225 respectively. Based on supply and demand, if I’m looking to get a cheaper flight I would wager that AM WEST would save me money in exchange for a less reliable service. However I can’t confirm my claim since AM WEST stopped operating back in 2007.

Alaska <- clean_df3 %>%
  filter(Airline == "ALASKA")
sum(Alaska$Total)
## [1] 3775
AM_West <- clean_df3 %>%
  filter(Airline == "AM WEST")
sum(AM_West$Total)
## [1] 7225