DATA 607: Week 5, Tidy Data

Intro
1. Data Import and Setup
2. First steps
3. One last step: Spread() the data
4. Analysis with dplyr()
Coda

Intro

This week an exercise in tidying data. We’re provided a messy csv table. The goal is to transform it into an analyzable table where each row is a single observation and each column a single variable.

1. Data Import and Setup

In this step, we load the necessary R packages and import the data with read.csv(). With the structure command, we can see that the flight counts were properly imported as integers and that airline and status correctly are factors. Table 1 shows the initial table.

# load libraries
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(ggthemes)
library(scales)
library(pander)

# import the data
adat <- read.csv("Data/airlines.csv", header=T)
str(adat)

## 'data.frame':    5 obs. of  7 variables:
##  $ X            : Factor w/ 3 levels "","ALASKA","AM WEST": 2 1 1 3 1
##  $ X.1          : Factor w/ 3 levels "","delayed","on time": 3 2 1 3 2
##  $ Los.Angeles  : int  497 62 NA 694 117
##  $ Phoenix      : int  221 12 NA 4848 415
##  $ San.Diego    : int  212 20 NA 383 65
##  $ San.Francisco: int  503 102 NA 320 129
##  $ Seattle      : int  1841 305 NA 201 61

Table 1: Original airline data
X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AM WEST	on time	694	4848	383	320	201
	delayed	117	415	65	129	61

2. First steps

The table contains empty rows and cells. Rows don’t represent single observations. The variables airline, deparature status and city are spread across multiple columns. Next steps involve removing empty cells and rows, renaming variables, and gathering data using tidyr().

2.1 Clean up empty cells, fix variable names

Delete the empty row and populate the airline name in the rows indicating “delayed” status.

adat <- adat[-3,]
adat[2,1] <- "ALASKA"
adat[4,1] <- "AM WEST"
colnames(adat)[c(1,2)] <- c("Airline", "Status")
rownames(adat) <- 1:4

pander(adat, caption = "Table 2: Blank/NA cells cleaned up")

Table 2: Blank/NA cells cleaned up
Airline	Status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4848	383	320	201
AM WEST	delayed	117	415	65	129	61

2.2 Gather wide data

City should be a variable, and the status counts should be the value for each observation. This is a job for tidyr’s gather() function.

adat <- adat %>%
          gather(City, Flights, 3:7)

# clean up the city names
adat$City <- str_replace(adat$City, "\\.", " ")

pander(head(adat, 10), caption = "Table 3: Tidy data, 10 rows")

Table 3: Tidy data, 10 rows
Airline	Status	City	Flights
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on time	Phoenix	4848
AM WEST	delayed	Phoenix	415
ALASKA	on time	San Diego	212
ALASKA	delayed	San Diego	20

3. One last step: Spread() the data

The data as of the last procedure is better; it is tidy, but it can be even tidier. We can see that there is redundancy in the Status column, where we repeat the “on time” and “delayed” values. The spread() function can get us further – to 10 rows and 4 variables instead of 20 rows and 4 variables.

adat2 <- spread(adat, Status, Flights)

pander(adat2, caption = "Table 3: Tidiest Data Set")

Table 3: Tidiest Data Set
Airline	City	delayed	on time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4848
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	61	201

4. Analysis with dplyr()

The data is now in a suitable format: one observation in each row, one variable per column. dplyr’s pipe operator %>% allows flexibility in organizing the data and computing a variety of statistical measures that can be plotted. For this plot, we use dplyr to group by city and compare a count of delayed and on-time flights.

by_city <- adat %>%
             group_by(City, Status) %>%
             summarise(Flights=sum(Flights))

ggplot(by_city, aes(x=City, y=Flights, fill=Status)) +
        geom_bar(stat="identity", position=position_dodge()) +
        xlab(" ") + scale_y_continuous(labels=scales::comma) +
        theme_tufte()

Flight Status by City

4.1 Create new statistic with summarise()

Let’s compute each airline’s ratio of delayed flights and plot it.

adat2 <- tbl_df(adat2)
colnames(adat2)[c(3,4)] <- c("Delayed", "Timely")

delayrate <- ungroup(adat2) %>% 
                group_by(Airline, City) %>% 
                summarise(Delay.Rate = sum(Delayed)/sum(Timely))

ggplot(delayrate, aes(x=City, y=Delay.Rate, fill=Airline)) +
        geom_bar(stat="identity", position=position_dodge()) +
        xlab(" ") + scale_y_continuous(labels=scales::percent) +
        ylab("Delay Rate") +
        theme_tufte()

Percent of flight delays

Coda

There are many more functions in the dplyr-tidyr universe. mutate() makes for easy creation of new variables, similar to the Delay Rate example above. filter() allows applicaiton of logical functions to rows. A range of window functions can be applied to each item in a column, similar base R’s apply family of functions. Following are some examples.

Make a new variable

adat2 <- mutate(adat2, Total = Delayed + Timely)

pander(head(adat2), caption = "New variable added")

New variable added
Airline	City	Delayed	Timely	Total
ALASKA	Los Angeles	62	497	559
ALASKA	Phoenix	12	221	233
ALASKA	San Diego	20	212	232
ALASKA	San Francisco	102	503	605
ALASKA	Seattle	305	1841	2146
AM WEST	Los Angeles	117	694	811

Filter and select results

big_flights <- ungroup(adat2) %>% 
                  filter(Airline == 'AM WEST', Total > 500) %>% 
                  select(Airline, City, Total) %>% 
                  arrange(desc(Total))

pander(big_flights, caption = "AM West cities with 500+ flights")

AM West cities with 500+ flights
Airline	City	Total
AM WEST	Phoenix	5263
AM WEST	Los Angeles	811

Apply a window function

adat2 <- mutate(adat2, Rank = dense_rank(Total), PctRank = percent_rank(Total))
  
pander(ungroup(adat2) %>% 
        select(-PctRank) %>% 
        arrange(desc(Rank)), caption = "dplyr rank function")

dplyr rank function
Airline	City	Delayed	Timely	Total	Rank
AM WEST	Phoenix	415	4848	5263	10
ALASKA	Seattle	305	1841	2146	9
AM WEST	Los Angeles	117	694	811	8
ALASKA	San Francisco	102	503	605	7
ALASKA	Los Angeles	62	497	559	6
AM WEST	San Francisco	129	320	449	5
AM WEST	San Diego	65	383	448	4
AM WEST	Seattle	61	201	262	3
ALASKA	Phoenix	12	221	233	2
ALASKA	San Diego	20	212	232	1