library(tidyverse)
library(openintro)
library(tidyr)
library(dplyr)
library(data.table)
library(glue)
library(gridExtra)
This lab provides practice for handling untidy data. It is sometimes easier to manually fix some issues from a dataset, but R is powerful enough to handle this process for us. This invaluable for larger datasets.
Start by loading the hosted csv. In this case, R assigned V1 and V2 to the column names without headers. This is fine as we will change them to more useful names while tidying.
There was a blank line in the original data. The function “fread” is not able to handle that and instead inserted a row filled with “NA”s. Remove this line first because it is useless and impacts other mutations.
## V1 V2 Los Angeles Phoenix San Diego San Francisco Seattle
## <char> <char> <int> <int> <int> <int> <int>
## 1: ALASKA on time 497 221 212 503 1841
## 2: delayed 62 12 20 102 305
## 3: AM WEST on time 694 4840 383 320 201
## 4: delayed 117 415 65 129 61
A major issue with this data is that it is in a wide form. This is easy to read on a spreadsheet, but does not work well with many R functions. We can start by collecting each city into a single “city” column using the “melt” function.
melt_flight_df <- untidy_flight_df %>%
melt(id.vars = c(colnames(untidy_flight_df)[1:2]), variable.name = 'city', value.name = 'total_flights')
melt_flight_df
## V1 V2 city total_flights
## <char> <char> <fctr> <int>
## 1: ALASKA on time Los Angeles 497
## 2: delayed Los Angeles 62
## 3: AM WEST on time Los Angeles 694
## 4: delayed Los Angeles 117
## 5: ALASKA on time Phoenix 221
## 6: delayed Phoenix 12
## 7: AM WEST on time Phoenix 4840
## 8: delayed Phoenix 415
## 9: ALASKA on time San Diego 212
## 10: delayed San Diego 20
## 11: AM WEST on time San Diego 383
## 12: delayed San Diego 65
## 13: ALASKA on time San Francisco 503
## 14: delayed San Francisco 102
## 15: AM WEST on time San Francisco 320
## 16: delayed San Francisco 129
## 17: ALASKA on time Seattle 1841
## 18: delayed Seattle 305
## 19: AM WEST on time Seattle 201
## 20: delayed Seattle 61
## V1 V2 city total_flights
V1 and V2 also need more meaningful names. Let’s rename them to airline and status respectively.
## airline status city total_flights
## <char> <char> <fctr> <int>
## 1: ALASKA on time Los Angeles 497
## 2: delayed Los Angeles 62
## 3: AM WEST on time Los Angeles 694
## 4: delayed Los Angeles 117
## 5: ALASKA on time Phoenix 221
## 6: delayed Phoenix 12
## 7: AM WEST on time Phoenix 4840
## 8: delayed Phoenix 415
## 9: ALASKA on time San Diego 212
## 10: delayed San Diego 20
## 11: AM WEST on time San Diego 383
## 12: delayed San Diego 65
## 13: ALASKA on time San Francisco 503
## 14: delayed San Francisco 102
## 15: AM WEST on time San Francisco 320
## 16: delayed San Francisco 129
## 17: ALASKA on time Seattle 1841
## 18: delayed Seattle 305
## 19: AM WEST on time Seattle 201
## 20: delayed Seattle 61
## airline status city total_flights
These columns now have meaningful names and the rows are based on each set of total flights. One last issue is that the airline columns is missing every other entry. This is because visually we can assume that if the on time airline was Alaska, the following delayed airline entry is also Alaska. Since R does not innately do this inference, we need to fill in those cells. The initial creation of the data sets the blanks in airline to an empty character instead of “NA”. One way to work with this is to mutate those values to “NA” so that the “fill” function can fill in those missing values with the previous one. The dataframe is finally tidy.
tidy_flight_df <- rename_flight_df %>%
mutate(airline = if_else(airline == '', NA, airline)) %>%
fill(airline)
tidy_flight_df
## airline status city total_flights
## <char> <char> <fctr> <int>
## 1: ALASKA on time Los Angeles 497
## 2: ALASKA delayed Los Angeles 62
## 3: AM WEST on time Los Angeles 694
## 4: AM WEST delayed Los Angeles 117
## 5: ALASKA on time Phoenix 221
## 6: ALASKA delayed Phoenix 12
## 7: AM WEST on time Phoenix 4840
## 8: AM WEST delayed Phoenix 415
## 9: ALASKA on time San Diego 212
## 10: ALASKA delayed San Diego 20
## 11: AM WEST on time San Diego 383
## 12: AM WEST delayed San Diego 65
## 13: ALASKA on time San Francisco 503
## 14: ALASKA delayed San Francisco 102
## 15: AM WEST on time San Francisco 320
## 16: AM WEST delayed San Francisco 129
## 17: ALASKA on time Seattle 1841
## 18: ALASKA delayed Seattle 305
## 19: AM WEST on time Seattle 201
## 20: AM WEST delayed Seattle 61
## airline status city total_flights
Now it’s time to analyze the tidied up data. We want to compare the arrival delays for the two airlines. Let’s compute the number of flights where status is delayed and the total flights. Get the percentage of delayed flights using delayed_flights / total_flights for both airlines. Compile the calculated data into a dataframe.
total_alaska_delays <- tidy_flight_df %>%
filter(airline == 'ALASKA' & status == 'delayed') %>% with(sum(total_flights))
total_alaska_flights <- tidy_flight_df %>%
filter(airline == 'ALASKA') %>% with(sum(total_flights))
total_alaska_delay_rate = total_alaska_delays / total_alaska_flights
total_am_west_delays <- tidy_flight_df %>%
filter(airline == 'AM WEST' & status == 'delayed') %>% with(sum(total_flights))
total_am_west_flights <- tidy_flight_df %>%
filter(airline == 'AM WEST') %>% with(sum(total_flights))
total_am_west_delay_rate = total_am_west_delays / total_am_west_flights
delay_rates_df <- data.frame(
airline = c('ALASKA', 'AM WEST'),
delay_rate = c(total_alaska_delay_rate, total_am_west_delay_rate)
)
delay_rates_df
## airline delay_rate
## 1 ALASKA 0.1327152
## 2 AM WEST 0.1089273
Alaska has a higher overall delay rate of about 13% than AM West with about 11%. A bar plot is a convenient way to show this since we can examine the bars side by side.
delay_rates_df %>% ggplot(aes(x = airline, y = delay_rate)) +
geom_bar(stat = 'identity') +
labs(title = 'Delay Rate by Airline')
Let’s also compare how both airlines perform in each city. Since we are reusing this code, it’s better to turn it into a function which we will call “calc_flight_rate” which accepts two filter arguments and divides their results. Loop through all combinations of cities and flights to calculate delay rates using this new function.
calc_flight_rate <- function(filter_numer, filter_denom) {
filtered_flights <- tidy_flight_df %>%
filter(eval(parse(text=filter_numer))) %>% with(sum(total_flights))
total_flights <- tidy_flight_df %>%
filter(eval(parse(text=filter_denom))) %>% with(sum(total_flights))
total_flight_rate = filtered_flights / total_flights
return (total_flight_rate)
}
city_delays_col <- c('city', 'airline', 'delay_rate')
city_delays_df <- data.frame(
matrix(
nrow = 0,
ncol = length(city_delays_col)
)
)
for (city_name in unique(tidy_flight_df$city)) {
for (airline_name in unique(tidy_flight_df$airline)) {
filter_arg_delay <- glue("airline == '{airline_name}' & status == 'delayed' & city == '{city_name}'")
filter_arg_total <- glue("airline == '{airline_name}' & city == '{city_name}'")
rate <- calc_flight_rate(filter_arg_delay, filter_arg_total)
city_delays_df <- rbind(city_delays_df, list(city_name,airline_name,rate))
}
}
colnames(city_delays_df) <- city_delays_col
city_delays_df
## city airline delay_rate
## 1 Los Angeles ALASKA 0.11091234
## 2 Los Angeles AM WEST 0.14426634
## 3 Phoenix ALASKA 0.05150215
## 4 Phoenix AM WEST 0.07897241
## 5 San Diego ALASKA 0.08620690
## 6 San Diego AM WEST 0.14508929
## 7 San Francisco ALASKA 0.16859504
## 8 San Francisco AM WEST 0.28730512
## 9 Seattle ALASKA 0.14212488
## 10 Seattle AM WEST 0.23282443
Let’s visualize the trends with another bar plot. Delay rates for each airline are shown side by side and separated by city. This plot definitively shows that for every city, AM West has a higher delay rate. This looks like a massive contradiction with the earlier data. What is going on with this data?
ggplot(city_delays_df, aes(x = city, y = delay_rate, fill = airline)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = 'Delay Rate by Airline and City')
Checking the total flights per city for both airlines adds some insight into this issue. Taking the total flights for each city and airline combination while ignoring the delay status, we get a flights per city table for both airlines. The table implies that AM West’s data are weighted toward Phoenix where it had a delay percentage of 7.9%. Alaska’s data are mostly driven by its performance in Seattle with a delay percentage of 14.2%. Despite Alaska winning the head to head lower delay percentage in every city, it is unfortunately penalized for deriving more of its delay rate from its second worst delay city (only behind San Francisco).
tidy_flight_df %>% group_by(airline, city) %>% summarise(total_flights = sum(total_flights), .groups = 'drop') %>%
arrange(desc(total_flights))
## # A tibble: 10 × 3
## airline city total_flights
## <chr> <fct> <int>
## 1 AM WEST Phoenix 5255
## 2 ALASKA Seattle 2146
## 3 AM WEST Los Angeles 811
## 4 ALASKA San Francisco 605
## 5 ALASKA Los Angeles 559
## 6 AM WEST San Francisco 449
## 7 AM WEST San Diego 448
## 8 AM WEST Seattle 262
## 9 ALASKA Phoenix 233
## 10 ALASKA San Diego 232
city_delays_df %>%
filter(airline == 'AM WEST' & city == 'Phoenix' | airline == 'ALASKA' & city == 'Seattle')
## city airline delay_rate
## 1 Phoenix AM WEST 0.07897241
## 2 Seattle ALASKA 0.14212488
The original input data was very easy to read, but was not tidy for a data frame. Having individual cities as columns was the most egregious issue. It is better to keep data in a tidy long form and adapt as needed.
The city specific analysis was especially eye opening. It was easy to declare AM West the more reliable airline because of a better overall delay rate. However, more granular data showed that Alaska consistently outperformed AM West in every single city.
Further work into solving this conundrum could include accounting for each airline having different amounts of flights in each city.