Tidying Airline Data

Introduction
Creating Tidy Data
- Fill Data
- Melting and Restructuring Data
Relative Performance of Airlines

Introduction

Overview

The raw airline data below describes the on time and delay departure rates of two airlines, Alaksa and America West, at four airports.

V1	V2	Los Angeles	Phoenix	San Diego	San Francisco	Seatle
ALASKA	ON TIME	497	221	212	503	1841
NA	DELAYED	62	12	20	102	305
AM WEST	ON TIME	694	4840	383	320	201
NA	DELAYED	117	415	65	129	61

This is an acceptable format for human inspection, but the wide layout and untidy format is not firendly for computational analysis of the airlines’ performance. Namely, mutiple observation of airline performance at all airports are on one row and the statuses, i.e., ON TIME and DELAYED at specific airports are described across two lines. The purpose of this assignment is to restrucutre this data in to a “tidy” format in order to carry out basic analyses of performance.

Relevence

A lot of publicly available data is formatted for human consumption. Wide formats are sometimes easier for the human eye to generalize informaiton, but it is not necessarily the case for computational anlalysis. The ability to transform such data from “wide”" to “long”“, or vice versa, is a necessary skill in analyzing the wide array of data formats available for analyses.

Data Reference and Libraries

The raw airline data is available from the this GitHub repo. Further reading of the underlying problems and solutions for cleaning data is available from Hadley Wickham’s paper, Tidy Data. In addition,

Outside of formatting packages, the libraries used for this project are:

data.table: Light-weight, fast development for data structure; not neccesarily needed for this data set, but recomended under best practices.
dtplyr: Data table back-end of dplyr for data manipulation.
tidyr: To produce tidy data from the dplyr pipeline.
ggplot2: Provides some advanced plotting utilities.

Creating Tidy Data

Fill Data

Prior to tidying our data, it would be easier to start off with data with proper column headings and values filled for the NULL observations:

airline_dt$V1 <- rep(airline_dt$V1[!is.na(airline_dt$V1)], each = 2)

colnames(airline_dt)[1:2] <- c("Airline", "Status")

knitr::kable(airline_dt)

Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seatle
ALASKA	ON TIME	497	221	212	503	1841
ALASKA	DELAYED	62	12	20	102	305
AM WEST	ON TIME	694	4840	383	320	201
AM WEST	DELAYED	117	415	65	129	61

Melting and Restructuring Data

Under best practices for a tidy data set, i.e. Hadley Wickam’s three point rule:

1. Each variable froms a column.
1. Each observation forms a row.
1. Each type of observation unit forms a table.

the above data do not conform to the rules. The issues with our airline_dt data mainly vioates rules 1 and 2. Specificaly, rule 1 is violated by the airlines’ status of ON TIME or DELAYED being split over two rows for each respective airline and rule 2 is violated by the multiple departing airports being aggregated under one observation when they should be represented under unique observations. tidyr supplies utilitiies to “melt” the data structure and reform it from its “molten” state. First we will melt the respective airports and their accompanied count’s to be underneath two columns:

airline_dt <- airline_dt %>% gather(Airport, Count, -c(Airline, Status))
airline_dt <- airline_dt %>% arrange(Airline, Airport)

knitr::kable(airline_dt)

Airline	Status	Airport	Count
ALASKA	ON TIME	Los Angeles	497
ALASKA	DELAYED	Los Angeles	62
ALASKA	ON TIME	Phoenix	221
ALASKA	DELAYED	Phoenix	12
ALASKA	ON TIME	San Diego	212
ALASKA	DELAYED	San Diego	20
ALASKA	ON TIME	San Francisco	503
ALASKA	DELAYED	San Francisco	102
ALASKA	ON TIME	Seatle	1841
ALASKA	DELAYED	Seatle	305
AM WEST	ON TIME	Los Angeles	694
AM WEST	DELAYED	Los Angeles	117
AM WEST	ON TIME	Phoenix	4840
AM WEST	DELAYED	Phoenix	415
AM WEST	ON TIME	San Diego	383
AM WEST	DELAYED	San Diego	65
AM WEST	ON TIME	San Francisco	320
AM WEST	DELAYED	San Francisco	129
AM WEST	ON TIME	Seatle	201
AM WEST	DELAYED	Seatle	61

In order to conform to rule 2 above, on time and delayed statuses of the airlines should be on under one single observation, being that an airlines performance at a given aiport should be a singular observation.

airline_dt <- airline_dt %>% spread(Status, Count)
colnames(airline_dt)[3:4] <- c("Delayed", "OnTime")

knitr::kable(airline_dt)

Airline	Airport	Delayed	OnTime
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seatle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seatle	61	201

Relative Performance of Airlines

In order make a comparison between the respective airlines, we will compare their delay rates for all flights:

airline_overall <- summarise(group_by(airline_dt, Airline), rate = sum(Delayed)/sum(Delayed + 
    OnTime))

ggplot(airline_overall, aes(Airline, rate)) + geom_bar(aes(fill = Airline), 
    position = "dodge", stat = "identity")

From the plot above, ALASKA has an overall higher delay rate than AM WEST. We should take a look at the respective delays per airport to draw any conclusion.

airline_dt <- airline_dt %>% mutate(DelayRate = round(Delayed/(Delayed + OnTime), 
    2))

ggplot(airline_dt, aes(Airport, DelayRate)) + geom_bar(aes(fill = Airline), 
    position = "dodge", stat = "identity")

Comparing the two plots seems unusual. ALASKA has the largest aggregate delay rate relative AM WEST, but AM WEST has the largest delay rate throughout airports. The reason behind this could be the skewed effect due to specific traffic towards an airport. Let’s look at the number of flights for each respective airport and the airlines’ performance.

airport_delay <- airline_dt %>% gather(status, depart, Delayed, OnTime)

ggplot(airport_delay, aes(x = Airline, y = depart, fill = status)) + geom_bar(position = "stack", 
    stat = "identity") + facet_grid(~Airport)

From above, it is obvious that that the major hub of Phoneix for AM WEST skewed the reults against ALASKA for the overall delay rate.