Intro

This week an exercise in tidying data. We’re provided a messy csv table. The goal is to transform it into an analyzable table where each row is a single observation and each column a single variable.


1. Data Import and Setup

In this step, we load the necessary R packages and import the data with read.csv(). With the structure command, we can see that the flight counts were properly imported as integers and that airline and status correctly are factors. Table 1 shows the initial table.

# load libraries
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(ggthemes)
library(scales)
library(pander)

# import the data
adat <- read.csv("Data/airlines.csv", header=T)
str(adat)
## 'data.frame':    5 obs. of  7 variables:
##  $ X            : Factor w/ 3 levels "","ALASKA","AM WEST": 2 1 1 3 1
##  $ X.1          : Factor w/ 3 levels "","delayed","on time": 3 2 1 3 2
##  $ Los.Angeles  : int  497 62 NA 694 117
##  $ Phoenix      : int  221 12 NA 4848 415
##  $ San.Diego    : int  212 20 NA 383 65
##  $ San.Francisco: int  503 102 NA 320 129
##  $ Seattle      : int  1841 305 NA 201 61
Table 1: Original airline data
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AM WEST on time 694 4848 383 320 201
delayed 117 415 65 129 61

2. First steps

The table contains empty rows and cells. Rows don’t represent single observations. The variables airline, deparature status and city are spread across multiple columns. Next steps involve removing empty cells and rows, renaming variables, and gathering data using tidyr().

2.1 Clean up empty cells, fix variable names

Delete the empty row and populate the airline name in the rows indicating “delayed” status.

adat <- adat[-3,]
adat[2,1] <- "ALASKA"
adat[4,1] <- "AM WEST"
colnames(adat)[c(1,2)] <- c("Airline", "Status")
rownames(adat) <- 1:4

pander(adat, caption = "Table 2: Blank/NA cells cleaned up")
Table 2: Blank/NA cells cleaned up
Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4848 383 320 201
AM WEST delayed 117 415 65 129 61

2.2 Gather wide data

City should be a variable, and the status counts should be the value for each observation. This is a job for tidyr’s gather() function.

adat <- adat %>%
          gather(City, Flights, 3:7)

# clean up the city names
adat$City <- str_replace(adat$City, "\\.", " ")

pander(head(adat, 10), caption = "Table 3: Tidy data, 10 rows")
Table 3: Tidy data, 10 rows
Airline Status City Flights
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4848
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20

3. One last step: Spread() the data

The data as of the last procedure is better; it is tidy, but it can be even tidier. We can see that there is redundancy in the Status column, where we repeat the “on time” and “delayed” values. The spread() function can get us further – to 10 rows and 4 variables instead of 20 rows and 4 variables.

adat2 <- spread(adat, Status, Flights)

pander(adat2, caption = "Table 3: Tidiest Data Set")
Table 3: Tidiest Data Set
Airline City delayed on time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4848
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

4. Analysis with dplyr()

The data is now in a suitable format: one observation in each row, one variable per column. dplyr’s pipe operator %>% allows flexibility in organizing the data and computing a variety of statistical measures that can be plotted. For this plot, we use dplyr to group by city and compare a count of delayed and on-time flights.

by_city <- adat %>%
             group_by(City, Status) %>%
             summarise(Flights=sum(Flights))

ggplot(by_city, aes(x=City, y=Flights, fill=Status)) +
        geom_bar(stat="identity", position=position_dodge()) +
        xlab(" ") + scale_y_continuous(labels=scales::comma) +
        theme_tufte()
Flight Status by City

Flight Status by City

4.1 Create new statistic with summarise()

Let’s compute each airline’s ratio of delayed flights and plot it.

adat2 <- tbl_df(adat2)
colnames(adat2)[c(3,4)] <- c("Delayed", "Timely")

delayrate <- ungroup(adat2) %>% 
                group_by(Airline, City) %>% 
                summarise(Delay.Rate = sum(Delayed)/sum(Timely))

ggplot(delayrate, aes(x=City, y=Delay.Rate, fill=Airline)) +
        geom_bar(stat="identity", position=position_dodge()) +
        xlab(" ") + scale_y_continuous(labels=scales::percent) +
        ylab("Delay Rate") +
        theme_tufte()
Percent of flight delays

Percent of flight delays


Coda

There are many more functions in the dplyr-tidyr universe. mutate() makes for easy creation of new variables, similar to the Delay Rate example above. filter() allows applicaiton of logical functions to rows. A range of window functions can be applied to each item in a column, similar base R’s apply family of functions. Following are some examples.

Make a new variable

adat2 <- mutate(adat2, Total = Delayed + Timely)

pander(head(adat2), caption = "New variable added")
New variable added
Airline City Delayed Timely Total
ALASKA Los Angeles 62 497 559
ALASKA Phoenix 12 221 233
ALASKA San Diego 20 212 232
ALASKA San Francisco 102 503 605
ALASKA Seattle 305 1841 2146
AM WEST Los Angeles 117 694 811

Filter and select results

big_flights <- ungroup(adat2) %>% 
                  filter(Airline == 'AM WEST', Total > 500) %>% 
                  select(Airline, City, Total) %>% 
                  arrange(desc(Total))

pander(big_flights, caption = "AM West cities with 500+ flights") 
AM West cities with 500+ flights
Airline City Total
AM WEST Phoenix 5263
AM WEST Los Angeles 811

Apply a window function

adat2 <- mutate(adat2, Rank = dense_rank(Total), PctRank = percent_rank(Total))
  
pander(ungroup(adat2) %>% 
        select(-PctRank) %>% 
        arrange(desc(Rank)), caption = "dplyr rank function")
dplyr rank function
Airline City Delayed Timely Total Rank
AM WEST Phoenix 415 4848 5263 10
ALASKA Seattle 305 1841 2146 9
AM WEST Los Angeles 117 694 811 8
ALASKA San Francisco 102 503 605 7
ALASKA Los Angeles 62 497 559 6
AM WEST San Francisco 129 320 449 5
AM WEST San Diego 65 383 448 4
AM WEST Seattle 61 201 262 3
ALASKA Phoenix 12 221 233 2
ALASKA San Diego 20 212 232 1