This week an exercise in tidying data. We’re provided a messy csv table. The goal is to transform it into an analyzable table where each row is a single observation and each column a single variable.
In this step, we load the necessary R packages and import the data with read.csv(). With the structure command, we can see that the flight counts were properly imported as integers and that airline and status correctly are factors. Table 1 shows the initial table.
# load libraries
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(ggthemes)
library(scales)
library(pander)
# import the data
adat <- read.csv("Data/airlines.csv", header=T)
str(adat)
## 'data.frame': 5 obs. of 7 variables:
## $ X : Factor w/ 3 levels "","ALASKA","AM WEST": 2 1 1 3 1
## $ X.1 : Factor w/ 3 levels "","delayed","on time": 3 2 1 3 2
## $ Los.Angeles : int 497 62 NA 694 117
## $ Phoenix : int 221 12 NA 4848 415
## $ San.Diego : int 212 20 NA 383 65
## $ San.Francisco: int 503 102 NA 320 129
## $ Seattle : int 1841 305 NA 201 61
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| NA | NA | NA | NA | NA | ||
| AM WEST | on time | 694 | 4848 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
The table contains empty rows and cells. Rows don’t represent single observations. The variables airline, deparature status and city are spread across multiple columns. Next steps involve removing empty cells and rows, renaming variables, and gathering data using tidyr().
Delete the empty row and populate the airline name in the rows indicating “delayed” status.
adat <- adat[-3,]
adat[2,1] <- "ALASKA"
adat[4,1] <- "AM WEST"
colnames(adat)[c(1,2)] <- c("Airline", "Status")
rownames(adat) <- 1:4
pander(adat, caption = "Table 2: Blank/NA cells cleaned up")
| Airline | Status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4848 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
City should be a variable, and the status counts should be the value for each observation. This is a job for tidyr’s gather() function.
adat <- adat %>%
gather(City, Flights, 3:7)
# clean up the city names
adat$City <- str_replace(adat$City, "\\.", " ")
pander(head(adat, 10), caption = "Table 3: Tidy data, 10 rows")
| Airline | Status | City | Flights |
|---|---|---|---|
| ALASKA | on time | Los Angeles | 497 |
| ALASKA | delayed | Los Angeles | 62 |
| AM WEST | on time | Los Angeles | 694 |
| AM WEST | delayed | Los Angeles | 117 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | delayed | Phoenix | 12 |
| AM WEST | on time | Phoenix | 4848 |
| AM WEST | delayed | Phoenix | 415 |
| ALASKA | on time | San Diego | 212 |
| ALASKA | delayed | San Diego | 20 |
The data as of the last procedure is better; it is tidy, but it can be even tidier. We can see that there is redundancy in the Status column, where we repeat the “on time” and “delayed” values. The spread() function can get us further – to 10 rows and 4 variables instead of 20 rows and 4 variables.
adat2 <- spread(adat, Status, Flights)
pander(adat2, caption = "Table 3: Tidiest Data Set")
| Airline | City | delayed | on time |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4848 |
| AM WEST | San Diego | 65 | 383 |
| AM WEST | San Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
The data is now in a suitable format: one observation in each row, one variable per column. dplyr’s pipe operator %>% allows flexibility in organizing the data and computing a variety of statistical measures that can be plotted. For this plot, we use dplyr to group by city and compare a count of delayed and on-time flights.
by_city <- adat %>%
group_by(City, Status) %>%
summarise(Flights=sum(Flights))
ggplot(by_city, aes(x=City, y=Flights, fill=Status)) +
geom_bar(stat="identity", position=position_dodge()) +
xlab(" ") + scale_y_continuous(labels=scales::comma) +
theme_tufte()
Flight Status by City
Let’s compute each airline’s ratio of delayed flights and plot it.
adat2 <- tbl_df(adat2)
colnames(adat2)[c(3,4)] <- c("Delayed", "Timely")
delayrate <- ungroup(adat2) %>%
group_by(Airline, City) %>%
summarise(Delay.Rate = sum(Delayed)/sum(Timely))
ggplot(delayrate, aes(x=City, y=Delay.Rate, fill=Airline)) +
geom_bar(stat="identity", position=position_dodge()) +
xlab(" ") + scale_y_continuous(labels=scales::percent) +
ylab("Delay Rate") +
theme_tufte()
Percent of flight delays
There are many more functions in the dplyr-tidyr universe. mutate() makes for easy creation of new variables, similar to the Delay Rate example above. filter() allows applicaiton of logical functions to rows. A range of window functions can be applied to each item in a column, similar base R’s apply family of functions. Following are some examples.
adat2 <- mutate(adat2, Total = Delayed + Timely)
pander(head(adat2), caption = "New variable added")
| Airline | City | Delayed | Timely | Total |
|---|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 | 559 |
| ALASKA | Phoenix | 12 | 221 | 233 |
| ALASKA | San Diego | 20 | 212 | 232 |
| ALASKA | San Francisco | 102 | 503 | 605 |
| ALASKA | Seattle | 305 | 1841 | 2146 |
| AM WEST | Los Angeles | 117 | 694 | 811 |
big_flights <- ungroup(adat2) %>%
filter(Airline == 'AM WEST', Total > 500) %>%
select(Airline, City, Total) %>%
arrange(desc(Total))
pander(big_flights, caption = "AM West cities with 500+ flights")
| Airline | City | Total |
|---|---|---|
| AM WEST | Phoenix | 5263 |
| AM WEST | Los Angeles | 811 |
adat2 <- mutate(adat2, Rank = dense_rank(Total), PctRank = percent_rank(Total))
pander(ungroup(adat2) %>%
select(-PctRank) %>%
arrange(desc(Rank)), caption = "dplyr rank function")
| Airline | City | Delayed | Timely | Total | Rank |
|---|---|---|---|---|---|
| AM WEST | Phoenix | 415 | 4848 | 5263 | 10 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 9 |
| AM WEST | Los Angeles | 117 | 694 | 811 | 8 |
| ALASKA | San Francisco | 102 | 503 | 605 | 7 |
| ALASKA | Los Angeles | 62 | 497 | 559 | 6 |
| AM WEST | San Francisco | 129 | 320 | 449 | 5 |
| AM WEST | San Diego | 65 | 383 | 448 | 4 |
| AM WEST | Seattle | 61 | 201 | 262 | 3 |
| ALASKA | Phoenix | 12 | 221 | 233 | 2 |
| ALASKA | San Diego | 20 | 212 | 232 | 1 |