In this project, the task is to transform a table with the number of flights per airline and destination from a “wide” format to a “long” tidy data format. Specifically, the packages dplyr and tidyr are used for this purpose.
The tidy table is then used to analyze arrival delays per airline.

1.Generate data and store CSV

First we generate the assignment CSV file

# Load the required packages
library(dplyr)
library(readr)
library(tidyr)
library(knitr)
# Generate and store the CSV file

dat = rbind(c("","","Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
                c("ALASKA", "on time", "497","221","212","503","1841"),
                c("", "delayed", "62","12","20","102","305"),
                c("","","","","","",""),
                c("AM WEST", "on time", "694", "4840","383","320","201"),
                c("", "delayed", "117","415","65","129","61"))
dat = data.frame(dat)

The initial untidy data frame looks like this

dat
##        X1      X2          X3      X4        X5            X6      X7
## 1                 Los Angeles Phoenix San Diego San Francisco Seattle
## 2  ALASKA on time         497     221       212           503    1841
## 3         delayed          62      12        20           102     305
## 4                                                                    
## 5 AM WEST on time         694    4840       383           320     201
## 6         delayed         117     415        65           129      61

This data does not conform to the tidy data format, or Codd’s 3rd normal form 1

# Store the file
write_csv(dat, "path_to_directory")
rm(dat)

2.Read the CSV

Simply read the CSV file created in the first step from GitHub, and then process the header to add correct column names.

# Read the CSV

dat = read_csv("https://raw.githubusercontent.com/datafeelings/data607/master/Week_5/flights_wide.csv",trim_ws = T)

# Fix header by taking the values from the first line, then drop the header row
names(dat) = unlist(append(c("airline", "status"),dat[1,3:7]))

dat = dat[-1,]

# Remove completely empty rows only

dat = dat[rowSums(is.na(dat)) != ncol(dat),] # Source for this solution is provided at the bottom [2]

The resulting data frame looks like this at this step:

kable(dat)
airline status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

We see that the values of the possible variable “airport” form multiple columns.

3.Transpose into tidy format

The goal is to have a table in which 1

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

First, we need to fill the first column with the last non NA value. As the assignment asks to use dplyr and tidyr I’m using the following hack. Actually there is a handy function na.locf() from the zoo package that fills the last observed non-NA value in a column.

# Fill the NA values with preceding non NA values
dat = dat %>% 
  mutate(airline = ifelse(is.na(airline)==T & is.na(lag(airline))==F,lag(airline),airline)) 

Now we just gather() observations into tidy format and fix the count variable

dat = dat %>% 
  gather("airport", "flights", 3:7) %>% 
  mutate(flights = as.integer(flights))

The resulting table is displayed below. We see that now each of the variables is in its own column, which makes the analysis of data much easier.

kable(dat)
airline status airport flights
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61

4.Analyze the delays per airline

In total, Am West made more flights than Alaska, but the bulk of the difference came from the very large number of flights to Phoenix by Am West (the airline had its hub there3). However, Alaska made many more flights to Seattle than Am West.

# Make a plot of flights per airport
library(plotly)
plot_ly(data = dat, x = airport, y = flights, group = airline, type = "bar") %>% 
  layout(title = "Total flights per airline")

In the chart below, the vertical axis represents the percentage of delayed flights, and the size of the squares shows the number of flights per destination made by each airline.

# Make a plot of share of delays

dat = dat %>% 
  group_by(airline,airport) %>% 
  mutate(total_flights = sum(flights)) %>% 
  mutate(share = round(flights / total_flights,2)*100)

plot_ly(data = filter(dat, status == "delayed"), x = airport, y = share, 
        group = airline, 
        text = paste("Total flights:", total_flights, "<br>","Delayed:", share,"%" ),
        mode = "markers+lines", 
        marker=list(size = total_flights, sizeref = 90, symbol = "square" ) ) %>% 
  layout(title = "Share (%) of delayed flights per airline and destination")

We see that Am West had a higher percentage of delayed flights than Alaska on each of the destinations, even on the ones like Seattle, where it had considerably less flights.

Surprisingly, it seems that the airport itself and not the number of flights per airline has something to do with the share of delayed flights. For example, the Phoenix airport had the least delayed flights for both airlines (even though Am West landed 23 times more planes there than Alaska), and the San Francisco airport had the most delays from both airlines.