Comparing airlines using a tidied flight delay data set

Data 607 assignment

Heather Geiger ; February 28, 2018

Preface - Comparing different ways to tidy and transform data

Inspired by this blog post (https://www.r-bloggers.com/how-to-reshape-data-in-r-tidyr-vs-reshape2/), where appropriate I will show how to perform the same operation using two different functions, either reshape2 or base R versus the tidyverse tidyr/dplyr.

My main experience is using base R, with a bit of reshape2 for simple operations like switching between wide and long. Here I typically first looked into how to perform the function using my existing knowledge, before also searching for the equivalent in the tidyverse.

Load libraries.

Load all libraries we will need.

library(stringr)
library(reshape2)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(scales)

Initial processing

Start by reading in the CSV file.

This CSV file was created by putting the information into Excel exactly as it looked in the description, then converting this sheet to CSV.

airlines <- read.csv("https://raw.githubusercontent.com/heathergeiger/Tidying_airlines_data/master/alaska_vs_am_west_airlines.csv",check.names=FALSE,stringsAsFactors=FALSE)
airlines

##                   Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 3                          NA                NA            NA        
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Remove empty lines and make column headers more descriptive.

airlines <- airlines[which(airlines[,2] == "on time" | airlines[,2] == "delayed"),]
colnames(airlines)[1:2] <- c("Airline","Status")
airlines

##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Right now some columns are going to be character that should be numeric, because numbers greater than 1,000 are written with a comma.

Remove the commas using str_replace from the stringr package.

Then apply as.numeric function to convert all the city columns to numeric.

for(i in 3:ncol(airlines))
{
airlines[,i] <- str_replace(airlines[,i],pattern=",",replacement="")
airlines[,i] <- as.numeric(airlines[,i])
}

airlines

##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Check the sum of one of the columns that formerly had a comma in number to be sure this worked.

sum(airlines[,4])

## [1] 5488

221 + 12 + 4840 + 415

## [1] 5488

Looks like removing the commas, then applying as.numeric, worked well.

Tidying data

Before we can convert to long format, we need to repeat the value in Airline column that is there for on time, immediately after for delayed.

airlines$Airline <- rep(airlines$Airline[which(airlines$Airline != "")],each=2)
airlines

##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

Now ready to convert to long format.

One way to convert from wide to long format is using function melt from the reshape2 package.

Another is using the gather function.

melt_result <- melt(airlines,variable.name="City",value.name="Flights",id.vars=c("Airline","Status"))
melt_result

##    Airline  Status          City Flights
## 1   Alaska on time   Los Angeles     497
## 2   Alaska delayed   Los Angeles      62
## 3  AM WEST on time   Los Angeles     694
## 4  AM WEST delayed   Los Angeles     117
## 5   Alaska on time       Phoenix     221
## 6   Alaska delayed       Phoenix      12
## 7  AM WEST on time       Phoenix    4840
## 8  AM WEST delayed       Phoenix     415
## 9   Alaska on time     San Diego     212
## 10  Alaska delayed     San Diego      20
## 11 AM WEST on time     San Diego     383
## 12 AM WEST delayed     San Diego      65
## 13  Alaska on time San Francisco     503
## 14  Alaska delayed San Francisco     102
## 15 AM WEST on time San Francisco     320
## 16 AM WEST delayed San Francisco     129
## 17  Alaska on time       Seattle    1841
## 18  Alaska delayed       Seattle     305
## 19 AM WEST on time       Seattle     201
## 20 AM WEST delayed       Seattle      61

gather_result <- gather(airlines,City,Flights,-Airline,-Status)
gather_result

##    Airline  Status          City Flights
## 1   Alaska on time   Los Angeles     497
## 2   Alaska delayed   Los Angeles      62
## 3  AM WEST on time   Los Angeles     694
## 4  AM WEST delayed   Los Angeles     117
## 5   Alaska on time       Phoenix     221
## 6   Alaska delayed       Phoenix      12
## 7  AM WEST on time       Phoenix    4840
## 8  AM WEST delayed       Phoenix     415
## 9   Alaska on time     San Diego     212
## 10  Alaska delayed     San Diego      20
## 11 AM WEST on time     San Diego     383
## 12 AM WEST delayed     San Diego      65
## 13  Alaska on time San Francisco     503
## 14  Alaska delayed San Francisco     102
## 15 AM WEST on time San Francisco     320
## 16 AM WEST delayed San Francisco     129
## 17  Alaska on time       Seattle    1841
## 18  Alaska delayed       Seattle     305
## 19 AM WEST on time       Seattle     201
## 20 AM WEST delayed       Seattle      61

These are identical. Since this assignment is supposed to be based on tidyverse, let’s use the gather result.

airlines <- gather_result

This is mostly what we want, except we really want two columns for on time vs. delayed for each instance of airline + city.

For this, we could use either the spread function from tidyr or the dcast function from reshape2.

One we show that they make the same result, reassign airlines to the result of spread.

dcast_result <- dcast(airlines,formula = Airline + City ~ Status,value.var="Flights")
dcast_result

##    Airline          City delayed on time
## 1   Alaska   Los Angeles      62     497
## 2   Alaska       Phoenix      12     221
## 3   Alaska     San Diego      20     212
## 4   Alaska San Francisco     102     503
## 5   Alaska       Seattle     305    1841
## 6  AM WEST   Los Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

spread_result <- spread(airlines,Status,Flights)
spread_result

##    Airline          City delayed on time
## 1   Alaska   Los Angeles      62     497
## 2   Alaska       Phoenix      12     221
## 3   Alaska     San Diego      20     212
## 4   Alaska San Francisco     102     503
## 5   Alaska       Seattle     305    1841
## 6  AM WEST   Los Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

airlines <- spread_result

One last thing - change “on time” to “on.time” to be more R-friendly.

colnames(airlines)[4] <- "on.time"

Transforming data

Now our data is pretty tidy, but we need to do some transformations to get anything meaningful.

Mainly for each airline in each city, we’ll want to get the rate of delays and the total number of flights.

Add this information to airlines using either transform (base R) or mutate (dplyr).

transform_result <-  transform(airlines,total.flights = on.time + delayed,delayed.rate = (delayed*100)/(on.time + delayed))
transform_result

##    Airline          City delayed on.time total.flights delayed.rate
## 1   Alaska   Los Angeles      62     497           559    11.091234
## 2   Alaska       Phoenix      12     221           233     5.150215
## 3   Alaska     San Diego      20     212           232     8.620690
## 4   Alaska San Francisco     102     503           605    16.859504
## 5   Alaska       Seattle     305    1841          2146    14.212488
## 6  AM WEST   Los Angeles     117     694           811    14.426634
## 7  AM WEST       Phoenix     415    4840          5255     7.897241
## 8  AM WEST     San Diego      65     383           448    14.508929
## 9  AM WEST San Francisco     129     320           449    28.730512
## 10 AM WEST       Seattle      61     201           262    23.282443

mutate_result <- mutate(airlines,total.flights = on.time + delayed,delayed.rate = (delayed*100)/(on.time + delayed))
mutate_result

##    Airline          City delayed on.time total.flights delayed.rate
## 1   Alaska   Los Angeles      62     497           559    11.091234
## 2   Alaska       Phoenix      12     221           233     5.150215
## 3   Alaska     San Diego      20     212           232     8.620690
## 4   Alaska San Francisco     102     503           605    16.859504
## 5   Alaska       Seattle     305    1841          2146    14.212488
## 6  AM WEST   Los Angeles     117     694           811    14.426634
## 7  AM WEST       Phoenix     415    4840          5255     7.897241
## 8  AM WEST     San Diego      65     383           448    14.508929
## 9  AM WEST San Francisco     129     320           449    28.730512
## 10 AM WEST       Seattle      61     201           262    23.282443

In this case, the syntax and result are identical.

However, the tidyverse can also be done via piping, which can be useful when we have more complex operations to perform.

Let’s show this here. Then, reassign airlines to the result of mutate.

mutate_via_pipe_result <- airlines %>% mutate(total.flights = on.time + delayed,delayed.rate = (delayed*100)/(on.time + delayed))
mutate_via_pipe_result

##    Airline          City delayed on.time total.flights delayed.rate
## 1   Alaska   Los Angeles      62     497           559    11.091234
## 2   Alaska       Phoenix      12     221           233     5.150215
## 3   Alaska     San Diego      20     212           232     8.620690
## 4   Alaska San Francisco     102     503           605    16.859504
## 5   Alaska       Seattle     305    1841          2146    14.212488
## 6  AM WEST   Los Angeles     117     694           811    14.426634
## 7  AM WEST       Phoenix     415    4840          5255     7.897241
## 8  AM WEST     San Diego      65     383           448    14.508929
## 9  AM WEST San Francisco     129     320           449    28.730512
## 10 AM WEST       Seattle      61     201           262    23.282443

airlines <- mutate_result

Analysis

Let’s make a side-by-side barplot where we plot the delay rate by city, putting the two airlines side-by-side.

ggplot(airlines,aes(x = City,y = delayed.rate,fill=Airline)) +
geom_bar(stat='identity',position='dodge') + 
ylab("Percent of flights delayed") +
theme(axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Based on this, it seems pretty clear that Alaska airlines performs better, having a lower delay rate in every city.

Plotting the same in base R

Just for fun, let’s try to make a similar plot in base R.

We’ll have to convert the data back to wide format first. Then, use barplot function from base R.

delayed_rates_wide <- spread(airlines[,c("Airline","City","delayed.rate")],City,delayed.rate)
rownames(delayed_rates_wide) <- delayed_rates_wide$Airline
delayed_rates_wide <- delayed_rates_wide[,2:ncol(delayed_rates_wide)]

barplot(as.matrix(delayed_rates_wide),beside=T,las=2,ylab="Percent of flights delayed",col=c("red","blue"))
legend("topleft",legend=rownames(delayed_rates_wide),col=c("red","blue"),pch=15,bty="n")

That actually wasn’t so bad. The major issue here though, is that the column names end up not fitting on the page using base R with default settings.

I’m sure there is a way to get them to fit better by fiddling with the margins, but in this case it’s easier to just use ggplot2.

Analysis using combined data across cities

In this case, the conclusion is pretty clear when looking city-by-city.

But what if we didn’t have the city-by-city data, but only aggregate data by airline?

Use the aggregate function from base R or group_by and summarize_all from dplyr to sum delayed and on time per airline.

aggregate_result <- aggregate(.~Airline,FUN=sum,data=airlines[,c("Airline","delayed","on.time")])
aggregate_result

##   Airline delayed on.time
## 1  Alaska     501    3274
## 2 AM WEST     787    6438

summarize_result <- airlines[,c("Airline","delayed","on.time")] %>% group_by(Airline) %>% summarize_all(funs(sum))
summarize_result

## # A tibble: 2 x 3
##   Airline delayed on.time
##     <chr>   <dbl>   <dbl>
## 1  Alaska     501    3274
## 2 AM WEST     787    6438

Now, get a delayed rate similar to how we did previously.

Let’s get fancy and pipe the result of summarize right into mutate, without using object summarize_result.

airlines[,c("Airline","delayed","on.time")] %>% group_by(Airline) %>% summarize_all(funs(sum))  %>% mutate(delayed.rate = delayed*100/(delayed + on.time))

## # A tibble: 2 x 4
##   Airline delayed on.time delayed.rate
##     <chr>   <dbl>   <dbl>        <dbl>
## 1  Alaska     501    3274     13.27152
## 2 AM WEST     787    6438     10.89273

Interesting! If we do not separate by city, AM WEST will appear to be better in terms of delay rate.

Let’s make an explanatory plot to show how this might occur.

Make a stacked barplot with the percent of flights coming from each city per airline.

Combined with the plot above, which shows trends of delay rates by city, we may get a better sense of how this happened.

ggplot(airlines,aes(x = Airline,y=total.flights,fill=City)) +
geom_bar(position = "fill",stat = "identity") +
scale_y_continuous(labels = percent_format()) +
ylab("Percent of flights coming from each city") +
scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2")) #Use colorblind vector as suggested here: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette

We find that AM WEST has the majority of flights coming from Phoenix. And we saw from the first chart that Phoenix tends to be a lower-delay city regardless of airline.

So if we do not separate by city, AM WEST will appear to have a lower delay rate than Alaska. Whereas if we normalize by city, we come to the correct conclusion that Alaska performs better in terms of delays.