DATA 607: Week 6 Assignment - Tidying and Transforming Data

Setup

The R code for this analysis can be found on GitHub here under “Week 6 - Tidying and Transforming Data”.

The flights dataset can also be found on Github as a .CSV file: Flights Dataset

The following R packages are required for this analysis:

RCurl
knitr
tidyr
dplyr
ggplot2

Load the the airline flight data from GitHub

URL <- getURL("https://raw.githubusercontent.com/kfolsom98/DATA607/master/flights.csv")
flights_df <- tbl_df(read.csv(text = URL, header = TRUE))

Let’s take a look at the raw flights data:

airline	status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Tidy the dataset using `tidyr`

We see that the flights dataset has five columns for each of the five destination cities. Convert this to a tidy format using gather from the tidyr.

tidy_flights_df <- gather(flights_df, city, total, Los.Angeles:Seattle) %>% 
                   mutate(city = gsub( "\\.", " ", city)) %>%    
                   arrange(airline)

#use mutate to replace any "." characters with space in file's column headers containing city names

The resulting dataframe now contains a single column for city – Los Angeles, Phoenix, San Diego, San Franciso, and Seattle. The total numbers of arrivals and delays can be found in the total column.

airline	status	city	total
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
ALASKA	on time	San Diego	212
ALASKA	delayed	San Diego	20
ALASKA	on time	San Francisco	503
ALASKA	delayed	San Francisco	102
ALASKA	on time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117
AM WEST	on time	Phoenix	4840
AM WEST	delayed	Phoenix	415
AM WEST	on time	San Diego	383
AM WEST	delayed	San Diego	65
AM WEST	on time	San Francisco	320
AM WEST	delayed	San Francisco	129
AM WEST	on time	Seattle	201
AM WEST	delayed	Seattle	61

Use `dplyr` to transform the dataset for analysis

With the flights dataset in a more manageable format, create an aggregate view of each airline’s total number of delays vs. arrivals.

flights_aggr_df <- tidy_flights_df %>% select(airline, status, total) %>% 
    group_by(airline, status) %>% 
    summarise(total = sum(total)) %>% 
    mutate(percent           = round(total/sum(total) * 100, 2),
           percent_formatted = paste0(round(total/sum(total) * 100, 1), "%"),
           total_flights     = sum(total)) 

flights_aggr_df <- as.data.frame(flights_aggr_df)

The aggregate numbers look like this:

airline	status	total	percent_formatted	total_flights
ALASKA	delayed	501	13.3%	3775
ALASKA	on time	3274	86.7%	3775
AM WEST	delayed	787	10.9%	7225
AM WEST	on time	6438	89.1%	7225

Visualize the airlines aggregate delays versus arrival totals using ggplot2

Use gglot2 to visualize, we see that ALASKA Airline has the higher percentage of delayed flights in aggregate. Also, we see that AM WEST has almost twice as many total flights as compared to ALASKA.

flights_aggr_df %>% filter(status == "delayed") %>%
    ggplot( aes(x=airline, y=percent, fill=airline)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
    coord_flip() + 
    ggtitle("Overall Percentage of Flights Delayed by Airline") +
    xlab("Airline") + ylab("Percentage of Flights Delayed") +
    geom_text(aes(label=paste(percent_formatted, "\n n =", total_flights)), vjust=0.5, hjust=1.1,color="black")

However, this aggregate delayed-flight percentage statistic may not represent the true picture if we consider the destination city in the analysis. Using dplyr, create the analysis above but include the destination city in the calculation.

flights_by_city_df <- tidy_flights_df %>%
group_by(airline, city) %>% 
    mutate(percent = round(total/sum(total)*100, 2),
           percent_formatted=paste0(round(total/sum(total)*100, 1), "%")) 

flights_by_city_df <- as.data.frame(flights_by_city_df)

The result looks like the sample below:

airline	status	city	total	percent_formatted
ALASKA	on time	Los Angeles	497	88.9%
ALASKA	delayed	Los Angeles	62	11.1%
ALASKA	on time	Phoenix	221	94.8%
ALASKA	delayed	Phoenix	12	5.2%
ALASKA	on time	San Diego	212	91.4%
ALASKA	delayed	San Diego	20	8.6%
ALASKA	on time	San Francisco	503	83.1%
ALASKA	delayed	San Francisco	102	16.9%
ALASKA	on time	Seattle	1841	85.8%
ALASKA	delayed	Seattle	305	14.2%

Visualize each airline’s percentage of flight delays for each of the destinations using ggplot2

flights_by_city_df %>% filter(status == "delayed") %>%
ggplot( aes(x=city, y=percent, fill=airline)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    geom_text(aes(label=percent_formatted), vjust=.5, hjust=1,position= position_dodge(width=0.9),  color="black") +
        ggtitle("Percentage of Flights Delayed by Airline by City") +
    xlab("City") + ylab("Percentage of Flights Delayed") +
        coord_flip()

Or, alternatively, using a Cleveland Plot. The code for the Cleveland Plot below is based largely on similar plots from Winston Chang’s R Graphics Cookbook.

flights_by_city_df %>% filter(status == "delayed") %>%
ggplot( aes(x=percent, y=city)) + 
    geom_segment(aes(yend=city), xend=0, color='grey50') + 
    geom_point(size=4, aes(color=airline)) + 
    geom_text(aes(label=percent_formatted), vjust=-1, hjust=.5,color='black') +
    scale_color_brewer(palette="Set1", limits=c('ALASKA', 'AM WEST')) + 
    theme_bw() + 
     ggtitle("Percentage of Flights Delayed by Airline by City") +
    xlab("Percentage of Flights Delayed") + ylab("City") +
    theme(panel.grid.major.y = element_blank(),
          legend.position=c(1,0.55),
          legend.justification=c(1,0.5))

Based on the graphs above, we see that ALASKA Airline has a consistently lower value for percentage of delayed flights as compared to AM WEST Airline.

If we wanted to programmatically extract the airline with the lowest delayed-flight percentage, this could be done using dplyr.

So, the answer the question, “What airline should be selected for a flight to any one of these cities, assuming the decision is based on the lowest percent-delay value?”:

 flights_by_city_df %>% 
      filter(status == "delayed") %>% 
      group_by(city)  %>% 
      arrange(city, percent) %>%
      filter(row_number() == 1) %>% 
      select(airline, city, status, percent_formatted) %>% kable()

airline	city	status	percent_formatted
ALASKA	Los Angeles	delayed	11.1%
ALASKA	Phoenix	delayed	5.2%
ALASKA	San Diego	delayed	8.6%
ALASKA	San Francisco	delayed	16.9%
ALASKA	Seattle	delayed	14.2%

This output is consistent with the graphs comparing each airline’s delayed-flight percentage.

DATA 607: Week 6 Assignment - Tidying and Transforming Data

Keith Folsom

March 5, 2016

Assignment Overview

Setup

Load the the airline flight data from GitHub

Tidy the dataset using `tidyr`

Use `dplyr` to transform the dataset for analysis

DATA 607: Week 6 Assignment - Tidying and Transforming Data

Keith Folsom

March 5, 2016

Assignment Overview

Setup

Load the the airline flight data from GitHub

Tidy the dataset using tidyr

Use dplyr to transform the dataset for analysis

Tidy the dataset using `tidyr`

Use `dplyr` to transform the dataset for analysis