Tidying Airline Data


Introduction

Overview

The raw airline data below describes the on time and delay departure rates of two airlines, Alaksa and America West, at four airports.

V1 V2 Los Angeles Phoenix San Diego San Francisco Seatle
ALASKA ON TIME 497 221 212 503 1841
NA DELAYED 62 12 20 102 305
AM WEST ON TIME 694 4840 383 320 201
NA DELAYED 117 415 65 129 61

This is an acceptable format for human inspection, but the wide layout and untidy format is not firendly for computational analysis of the airlines’ performance. Namely, mutiple observation of airline performance at all airports are on one row and the statuses, i.e., ON TIME and DELAYED at specific airports are described across two lines. The purpose of this assignment is to restrucutre this data in to a “tidy” format in order to carry out basic analyses of performance.

Relevence

A lot of publicly available data is formatted for human consumption. Wide formats are sometimes easier for the human eye to generalize informaiton, but it is not necessarily the case for computational anlalysis. The ability to transform such data from “wide”" to “long”“, or vice versa, is a necessary skill in analyzing the wide array of data formats available for analyses.

Data Reference and Libraries

The raw airline data is available from the this GitHub repo. Further reading of the underlying problems and solutions for cleaning data is available from Hadley Wickham’s paper, Tidy Data. In addition,

Outside of formatting packages, the libraries used for this project are:

  • data.table: Light-weight, fast development for data structure; not neccesarily needed for this data set, but recomended under best practices.
  • dtplyr: Data table back-end of dplyr for data manipulation.
  • tidyr: To produce tidy data from the dplyr pipeline.
  • ggplot2: Provides some advanced plotting utilities.

Creating Tidy Data

Fill Data

Prior to tidying our data, it would be easier to start off with data with proper column headings and values filled for the NULL observations:

airline_dt$V1 <- rep(airline_dt$V1[!is.na(airline_dt$V1)], each = 2)

colnames(airline_dt)[1:2] <- c("Airline", "Status")

knitr::kable(airline_dt)
Airline Status Los Angeles Phoenix San Diego San Francisco Seatle
ALASKA ON TIME 497 221 212 503 1841
ALASKA DELAYED 62 12 20 102 305
AM WEST ON TIME 694 4840 383 320 201
AM WEST DELAYED 117 415 65 129 61

Melting and Restructuring Data

Under best practices for a tidy data set, i.e. Hadley Wickam’s three point rule:

    1. Each variable froms a column.
    1. Each observation forms a row.
    1. Each type of observation unit forms a table.

the above data do not conform to the rules. The issues with our airline_dt data mainly vioates rules 1 and 2. Specificaly, rule 1 is violated by the airlines’ status of ON TIME or DELAYED being split over two rows for each respective airline and rule 2 is violated by the multiple departing airports being aggregated under one observation when they should be represented under unique observations. tidyr supplies utilitiies to “melt” the data structure and reform it from its “molten” state. First we will melt the respective airports and their accompanied count’s to be underneath two columns:

airline_dt <- airline_dt %>% gather(Airport, Count, -c(Airline, Status))
airline_dt <- airline_dt %>% arrange(Airline, Airport)

knitr::kable(airline_dt)
Airline Status Airport Count
ALASKA ON TIME Los Angeles 497
ALASKA DELAYED Los Angeles 62
ALASKA ON TIME Phoenix 221
ALASKA DELAYED Phoenix 12
ALASKA ON TIME San Diego 212
ALASKA DELAYED San Diego 20
ALASKA ON TIME San Francisco 503
ALASKA DELAYED San Francisco 102
ALASKA ON TIME Seatle 1841
ALASKA DELAYED Seatle 305
AM WEST ON TIME Los Angeles 694
AM WEST DELAYED Los Angeles 117
AM WEST ON TIME Phoenix 4840
AM WEST DELAYED Phoenix 415
AM WEST ON TIME San Diego 383
AM WEST DELAYED San Diego 65
AM WEST ON TIME San Francisco 320
AM WEST DELAYED San Francisco 129
AM WEST ON TIME Seatle 201
AM WEST DELAYED Seatle 61

In order to conform to rule 2 above, on time and delayed statuses of the airlines should be on under one single observation, being that an airlines performance at a given aiport should be a singular observation.

airline_dt <- airline_dt %>% spread(Status, Count)
colnames(airline_dt)[3:4] <- c("Delayed", "OnTime")

knitr::kable(airline_dt)
Airline Airport Delayed OnTime
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seatle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seatle 61 201

Relative Performance of Airlines

In order make a comparison between the respective airlines, we will compare their delay rates for all flights:

airline_overall <- summarise(group_by(airline_dt, Airline), rate = sum(Delayed)/sum(Delayed + 
    OnTime))

ggplot(airline_overall, aes(Airline, rate)) + geom_bar(aes(fill = Airline), 
    position = "dodge", stat = "identity")

From the plot above, ALASKA has an overall higher delay rate than AM WEST. We should take a look at the respective delays per airport to draw any conclusion.

airline_dt <- airline_dt %>% mutate(DelayRate = round(Delayed/(Delayed + OnTime), 
    2))

ggplot(airline_dt, aes(Airport, DelayRate)) + geom_bar(aes(fill = Airline), 
    position = "dodge", stat = "identity")

Comparing the two plots seems unusual. ALASKA has the largest aggregate delay rate relative AM WEST, but AM WEST has the largest delay rate throughout airports. The reason behind this could be the skewed effect due to specific traffic towards an airport. Let’s look at the number of flights for each respective airport and the airlines’ performance.

airport_delay <- airline_dt %>% gather(status, depart, Delayed, OnTime)

ggplot(airport_delay, aes(x = Airline, y = depart, fill = status)) + geom_bar(position = "stack", 
    stat = "identity") + facet_grid(~Airport)

From above, it is obvious that that the major hub of Phoneix for AM WEST skewed the reults against ALASKA for the overall delay rate.

Liam M. Byrne

2016-10-02