Flights Data Tidying

We are presented with a small flights dataset in a wide format and tasked with wrangling it into a tidy format.

Data Intake

The first step is clearly to load the data into R and get a feel for the issues the data has.

url <- 'https://raw.githubusercontent.com/sbellows1/607/master/Week5/flights%20sample.csv?token=ALKCMBPVKUXMZX4S3PIRRUS6LB5Y4'
filename <- 'flights sample.csv'
if (!file.exists(filename)){
  download.file(url, filename)
}
df <- read.csv('flights sample.csv', na.strings = '')

head(df, 10)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA    <NA>        NA            NA    <NA>
## 4 AM WEST on time         694   4,840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

We see that one variable is being spread across multiple columns (Airport Location) and as such we must gather that into a single column to abide by tidy data principles. Before doing this however, we should remove the 3rd row that is simply whitespace and provides no information

df <- df[-3,]

head(df)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2    <NA> delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

Now with the useless row removed we can begin to tidy the data

Changing Column Types

The clear first step necessary is to gather the location columns into a single variable. We will accomplish this with the pivot_longer function of the tidyr library. My first implementation of this failed because the columns are of different types as we can see from the tibble. Phoenix and Seattle are factor variables whereas the other Airports are integer variables. The reason for this is the commas in the large numbers for these airports, so we must remove these and change their types before proceeding.

library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

##This is a scalable implementation that will iterate over all columns and change to integers if we were 
##to receive more data

for (i in 3:ncol(df)){
  if (class(df[,i]) == 'factor'){
    df[,i] <- as.integer(gsub(',','',df[,i]))
  }
}

head(df)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

Inputting Blank Cells

Again at this point, I went to gather the location columns but realized that another issue is that the first column has blank cells that need to be filled in.

df <- df %>% fill(X, .direction = 'down')

head(df)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Gathering Airport Variable

Now that we have fixed all our other issues, we can gather the airport variable using pivot_longer. This will combine the 5 airports into one variable.

df <- df %>% pivot_longer(c('Los.Angeles', 'Phoenix', 'San.Diego', 'San.Francisco', 'Seattle'), names_to = 'airport', values_to = 'count')

head(df, 20)

## # A tibble: 20 x 4
##    X       X.1     airport       count
##    <fct>   <fct>   <chr>         <int>
##  1 ALASKA  on time Los.Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San.Diego       212
##  4 ALASKA  on time San.Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los.Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San.Diego        20
##  9 ALASKA  delayed San.Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los.Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San.Diego       383
## 14 AM WEST on time San.Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los.Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San.Diego        65
## 19 AM WEST delayed San.Francisco   129
## 20 AM WEST delayed Seattle          61

Clean column names and row information

Last but not least, we must provide column names to the two columns missing names and fix up the airport variable slightly.

df <- df %>% rename('airline' = 'X', 'delayed' = 'X.1')
df$airport <- gsub('\\.', ' ', df$airport)

head(df)

## # A tibble: 6 x 4
##   airline delayed airport       count
##   <fct>   <fct>   <chr>         <int>
## 1 ALASKA  on time Los Angeles     497
## 2 ALASKA  on time Phoenix         221
## 3 ALASKA  on time San Diego       212
## 4 ALASKA  on time San Francisco   503
## 5 ALASKA  on time Seattle        1841
## 6 ALASKA  delayed Los Angeles      62

Analysis

Now with a tidy dataset, we can begin analysis. First I would like to create a summary of the proportion of delayed flights for each airline and airport.

airline_delayed <- df %>% filter(delayed == 'delayed') %>% group_by(airline) %>% summarize(n = sum(count))
airline_total <- df %>% group_by(airline) %>% summarize(n = sum(count))
airport_delayed <- df %>% filter(delayed == 'delayed') %>% group_by(airport) %>% summarize(n = sum(count))
airport_total <- df %>% group_by(airport) %>% summarize(n = sum(count))

airline <- airline_delayed %>% inner_join(airline_total, by = 'airline')
airport <- airport_delayed %>% inner_join(airport_total, by = 'airport')

airline <- airline %>% mutate(delayed_prop = n.x/n.y)
airport <- airport %>% mutate(delayed_prop = n.x/n.y)

print(airline)

## # A tibble: 2 x 4
##   airline   n.x   n.y delayed_prop
##   <fct>   <int> <int>        <dbl>
## 1 ALASKA    501  3775        0.133
## 2 AM WEST   787  7225        0.109

print(airport)

## # A tibble: 5 x 4
##   airport         n.x   n.y delayed_prop
##   <chr>         <int> <int>        <dbl>
## 1 Los Angeles     179  1370       0.131 
## 2 Phoenix         427  5488       0.0778
## 3 San Diego        85   680       0.125 
## 4 San Francisco   231  1054       0.219 
## 5 Seattle         366  2408       0.152

Here we can see that on the whole, More flights from Alaska airlines are delayed than flights from American Airlines. We also see that San Francisco has the highest proportion of delayed flights and Phoenix the lowest proportion.

airline %>% ggplot(aes(airline, delayed_prop)) + geom_bar(stat = 'identity', fill = 'palegreen1') + xlab('Airline') + ylab('Proportion of Flights Delayed')

airport %>% ggplot(aes(reorder(airport, delayed_prop), delayed_prop)) + 
  geom_bar(stat = 'identity', fill = 'palegreen1') + xlab('Airport') + ylab('Proportion of Flights Delayed')

An interesting question came to mind after seeing these visualizations. Are the different airlines represented a higher or lower amount at each airport? That could potentially account for some of the differences between airline and airports. To answer this I found the number of flights from each airline at each airport.

counts <- df %>% group_by(airline, airport) %>% summarize(n = sum(count))

counts %>% ggplot(aes(reorder(airport, n), y = n, fill = airline)) + geom_bar(stat = 'identity', position = 'dodge') + xlab('Airport') + ylab('Number of Flights')

Interesting! We see that American Airlines flies the vast majority of the flights out of Phoenix, the airport with by far the least delays. Alaska Airlines flies more flights out of San Francisco and Seattle, airports that tend to have high rates of delay.

However from our analysis we cannot tell whether it is the airlines causing airports to have higher average delay proportions or the airports causing airlines to have higher delay proportions. In order to discover this, we would need to to acquire data that has causal standing.