Tidying Airline Data
Introduction
Overview
The raw airline data below describes the on time and delay departure rates of two airlines, Alaksa and America West, at four airports.
| V1 | V2 | Los Angeles | Phoenix | San Diego | San Francisco | Seatle |
|---|---|---|---|---|---|---|
| ALASKA | ON TIME | 497 | 221 | 212 | 503 | 1841 |
| NA | DELAYED | 62 | 12 | 20 | 102 | 305 |
| AM WEST | ON TIME | 694 | 4840 | 383 | 320 | 201 |
| NA | DELAYED | 117 | 415 | 65 | 129 | 61 |
This is an acceptable format for human inspection, but the wide layout and untidy format is not firendly for computational analysis of the airlines’ performance. Namely, mutiple observation of airline performance at all airports are on one row and the statuses, i.e., ON TIME and DELAYED at specific airports are described across two lines. The purpose of this assignment is to restrucutre this data in to a “tidy” format in order to carry out basic analyses of performance.
Relevence
A lot of publicly available data is formatted for human consumption. Wide formats are sometimes easier for the human eye to generalize informaiton, but it is not necessarily the case for computational anlalysis. The ability to transform such data from “wide”" to “long”“, or vice versa, is a necessary skill in analyzing the wide array of data formats available for analyses.
Data Reference and Libraries
The raw airline data is available from the this GitHub repo. Further reading of the underlying problems and solutions for cleaning data is available from Hadley Wickham’s paper, Tidy Data. In addition,
Outside of formatting packages, the libraries used for this project are:
data.table: Light-weight, fast development for data structure; not neccesarily needed for this data set, but recomended under best practices.
dtplyr: Data table back-end ofdplyrfor data manipulation.tidyr: To produce tidy data from thedplyrpipeline.ggplot2: Provides some advanced plotting utilities.
Creating Tidy Data
Fill Data
Prior to tidying our data, it would be easier to start off with data with proper column headings and values filled for the NULL observations:
airline_dt$V1 <- rep(airline_dt$V1[!is.na(airline_dt$V1)], each = 2)
colnames(airline_dt)[1:2] <- c("Airline", "Status")
knitr::kable(airline_dt)| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seatle |
|---|---|---|---|---|---|---|
| ALASKA | ON TIME | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | DELAYED | 62 | 12 | 20 | 102 | 305 |
| AM WEST | ON TIME | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | DELAYED | 117 | 415 | 65 | 129 | 61 |
Melting and Restructuring Data
Under best practices for a tidy data set, i.e. Hadley Wickam’s three point rule:
- Each variable froms a column.
- Each observation forms a row.
- Each type of observation unit forms a table.
the above data do not conform to the rules. The issues with our airline_dt data mainly vioates rules 1 and 2. Specificaly, rule 1 is violated by the airlines’ status of ON TIME or DELAYED being split over two rows for each respective airline and rule 2 is violated by the multiple departing airports being aggregated under one observation when they should be represented under unique observations. tidyr supplies utilitiies to “melt” the data structure and reform it from its “molten” state. First we will melt the respective airports and their accompanied count’s to be underneath two columns:
airline_dt <- airline_dt %>% gather(Airport, Count, -c(Airline, Status))
airline_dt <- airline_dt %>% arrange(Airline, Airport)
knitr::kable(airline_dt)| Airline | Status | Airport | Count |
|---|---|---|---|
| ALASKA | ON TIME | Los Angeles | 497 |
| ALASKA | DELAYED | Los Angeles | 62 |
| ALASKA | ON TIME | Phoenix | 221 |
| ALASKA | DELAYED | Phoenix | 12 |
| ALASKA | ON TIME | San Diego | 212 |
| ALASKA | DELAYED | San Diego | 20 |
| ALASKA | ON TIME | San Francisco | 503 |
| ALASKA | DELAYED | San Francisco | 102 |
| ALASKA | ON TIME | Seatle | 1841 |
| ALASKA | DELAYED | Seatle | 305 |
| AM WEST | ON TIME | Los Angeles | 694 |
| AM WEST | DELAYED | Los Angeles | 117 |
| AM WEST | ON TIME | Phoenix | 4840 |
| AM WEST | DELAYED | Phoenix | 415 |
| AM WEST | ON TIME | San Diego | 383 |
| AM WEST | DELAYED | San Diego | 65 |
| AM WEST | ON TIME | San Francisco | 320 |
| AM WEST | DELAYED | San Francisco | 129 |
| AM WEST | ON TIME | Seatle | 201 |
| AM WEST | DELAYED | Seatle | 61 |
In order to conform to rule 2 above, on time and delayed statuses of the airlines should be on under one single observation, being that an airlines performance at a given aiport should be a singular observation.
airline_dt <- airline_dt %>% spread(Status, Count)
colnames(airline_dt)[3:4] <- c("Delayed", "OnTime")
knitr::kable(airline_dt)| Airline | Airport | Delayed | OnTime |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seatle | 305 | 1841 |
| AM WEST | Los Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San Diego | 65 | 383 |
| AM WEST | San Francisco | 129 | 320 |
| AM WEST | Seatle | 61 | 201 |
Relative Performance of Airlines
In order make a comparison between the respective airlines, we will compare their delay rates for all flights:
airline_overall <- summarise(group_by(airline_dt, Airline), rate = sum(Delayed)/sum(Delayed +
OnTime))
ggplot(airline_overall, aes(Airline, rate)) + geom_bar(aes(fill = Airline),
position = "dodge", stat = "identity")From the plot above, ALASKA has an overall higher delay rate than AM WEST. We should take a look at the respective delays per airport to draw any conclusion.
airline_dt <- airline_dt %>% mutate(DelayRate = round(Delayed/(Delayed + OnTime),
2))
ggplot(airline_dt, aes(Airport, DelayRate)) + geom_bar(aes(fill = Airline),
position = "dodge", stat = "identity")Comparing the two plots seems unusual. ALASKA has the largest aggregate delay rate relative AM WEST, but AM WEST has the largest delay rate throughout airports. The reason behind this could be the skewed effect due to specific traffic towards an airport. Let’s look at the number of flights for each respective airport and the airlines’ performance.
airport_delay <- airline_dt %>% gather(status, depart, Delayed, OnTime)
ggplot(airport_delay, aes(x = Airline, y = depart, fill = status)) + geom_bar(position = "stack",
stat = "identity") + facet_grid(~Airport)From above, it is obvious that that the major hub of Phoneix for AM WEST skewed the reults against ALASKA for the overall delay rate.