Assignment Overview

The assignment for Week 6 will consider a wide dataset describing arrival delays for two airlines across five destinations. The purpose of this assignment is to use the tidyr and dplyr packages to tidy and transform the flights datatset and perform comparative analysis of the two airlines.

Setup

The R code for this analysis can be found on GitHub here under “Week 6 - Tidying and Transforming Data”.

The flights dataset can also be found on Github as a .CSV file: Flights Dataset

The following R packages are required for this analysis:

  1. RCurl
  2. knitr
  3. tidyr
  4. dplyr
  5. ggplot2

Load the the airline flight data from GitHub

URL <- getURL("https://raw.githubusercontent.com/kfolsom98/DATA607/master/flights.csv")
flights_df <- tbl_df(read.csv(text = URL, header = TRUE))

Let’s take a look at the raw flights data:

airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Tidy the dataset using tidyr

We see that the flights dataset has five columns for each of the five destination cities. Convert this to a tidy format using gather from the tidyr.

tidy_flights_df <- gather(flights_df, city, total, Los.Angeles:Seattle) %>% 
                   mutate(city = gsub( "\\.", " ", city)) %>%    
                   arrange(airline)

#use mutate to replace any "." characters with space in file's column headers containing city names

The resulting dataframe now contains a single column for city – Los Angeles, Phoenix, San Diego, San Franciso, and Seattle. The total numbers of arrivals and delays can be found in the total column.

airline status city total
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
AM WEST on time Seattle 201
AM WEST delayed Seattle 61

Use dplyr to transform the dataset for analysis

With the flights dataset in a more manageable format, create an aggregate view of each airline’s total number of delays vs. arrivals.

flights_aggr_df <- tidy_flights_df %>% select(airline, status, total) %>% 
    group_by(airline, status) %>% 
    summarise(total = sum(total)) %>% 
    mutate(percent           = round(total/sum(total) * 100, 2),
           percent_formatted = paste0(round(total/sum(total) * 100, 1), "%"),
           total_flights     = sum(total)) 

flights_aggr_df <- as.data.frame(flights_aggr_df)

The aggregate numbers look like this:

airline status total percent_formatted total_flights
ALASKA delayed 501 13.3% 3775
ALASKA on time 3274 86.7% 3775
AM WEST delayed 787 10.9% 7225
AM WEST on time 6438 89.1% 7225

Visualize the airlines aggregate delays versus arrival totals using ggplot2

Use gglot2 to visualize, we see that ALASKA Airline has the higher percentage of delayed flights in aggregate. Also, we see that AM WEST has almost twice as many total flights as compared to ALASKA.

flights_aggr_df %>% filter(status == "delayed") %>%
    ggplot( aes(x=airline, y=percent, fill=airline)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
    coord_flip() + 
    ggtitle("Overall Percentage of Flights Delayed by Airline") +
    xlab("Airline") + ylab("Percentage of Flights Delayed") +
    geom_text(aes(label=paste(percent_formatted, "\n n =", total_flights)), vjust=0.5, hjust=1.1,color="black")

However, this aggregate delayed-flight percentage statistic may not represent the true picture if we consider the destination city in the analysis. Using dplyr, create the analysis above but include the destination city in the calculation.

flights_by_city_df <- tidy_flights_df %>%
group_by(airline, city) %>% 
    mutate(percent = round(total/sum(total)*100, 2),
           percent_formatted=paste0(round(total/sum(total)*100, 1), "%")) 

flights_by_city_df <- as.data.frame(flights_by_city_df)

The result looks like the sample below:

airline status city total percent_formatted
ALASKA on time Los Angeles 497 88.9%
ALASKA delayed Los Angeles 62 11.1%
ALASKA on time Phoenix 221 94.8%
ALASKA delayed Phoenix 12 5.2%
ALASKA on time San Diego 212 91.4%
ALASKA delayed San Diego 20 8.6%
ALASKA on time San Francisco 503 83.1%
ALASKA delayed San Francisco 102 16.9%
ALASKA on time Seattle 1841 85.8%
ALASKA delayed Seattle 305 14.2%

Visualize each airline’s percentage of flight delays for each of the destinations using ggplot2

flights_by_city_df %>% filter(status == "delayed") %>%
ggplot( aes(x=city, y=percent, fill=airline)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    geom_text(aes(label=percent_formatted), vjust=.5, hjust=1,position= position_dodge(width=0.9),  color="black") +
        ggtitle("Percentage of Flights Delayed by Airline by City") +
    xlab("City") + ylab("Percentage of Flights Delayed") +
        coord_flip() 

Or, alternatively, using a Cleveland Plot. The code for the Cleveland Plot below is based largely on similar plots from Winston Chang’s R Graphics Cookbook.

flights_by_city_df %>% filter(status == "delayed") %>%
ggplot( aes(x=percent, y=city)) + 
    geom_segment(aes(yend=city), xend=0, color='grey50') + 
    geom_point(size=4, aes(color=airline)) + 
    geom_text(aes(label=percent_formatted), vjust=-1, hjust=.5,color='black') +
    scale_color_brewer(palette="Set1", limits=c('ALASKA', 'AM WEST')) + 
    theme_bw() + 
     ggtitle("Percentage of Flights Delayed by Airline by City") +
    xlab("Percentage of Flights Delayed") + ylab("City") +
    theme(panel.grid.major.y = element_blank(),
          legend.position=c(1,0.55),
          legend.justification=c(1,0.5))

Based on the graphs above, we see that ALASKA Airline has a consistently lower value for percentage of delayed flights as compared to AM WEST Airline.

If we wanted to programmatically extract the airline with the lowest delayed-flight percentage, this could be done using dplyr.

So, the answer the question, “What airline should be selected for a flight to any one of these cities, assuming the decision is based on the lowest percent-delay value?”:

 flights_by_city_df %>% 
      filter(status == "delayed") %>% 
      group_by(city)  %>% 
      arrange(city, percent) %>%
      filter(row_number() == 1) %>% 
      select(airline, city, status, percent_formatted) %>% kable()
airline city status percent_formatted
ALASKA Los Angeles delayed 11.1%
ALASKA Phoenix delayed 5.2%
ALASKA San Diego delayed 8.6%
ALASKA San Francisco delayed 16.9%
ALASKA Seattle delayed 14.2%

This output is consistent with the graphs comparing each airline’s delayed-flight percentage.