We begin with a table that looks like this (some cities left out for brevity, dots represent blank leading cells):
| . | Los Angeles | Phoenix | San Diego | |
|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 |
| . | delayed | 62 | 12 | 20 |
| . | ||||
| AM WEST | on time | 694 | 4,840 | 383 |
| . | delayed | 117 | 415 | 65 |
We begin with the end in mind. We will want a table that looks like this, instead:
| Airline | City | On Time | Delayed |
|---|---|---|---|
| Alaska | Los Angeles | 497 | 62 |
| Alaska | Phoenix | 221 | 12 |
| Alaska | San Diego | 212 | 20 |
| Am West | Los Angeles | 694 | 117 |
| Am West | Phoenix | 4840 | 415 |
| Am West | San Diego | 383 | 65 |
This kind of table can be analyzed easily for on-time performance by city or by airline.
We assume that this is an example table and that the actual table could be wider (more cities) or longer (more airlines). The code in this R Markdown is intended to work for any length of data that follows the same format as the example given and read in.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
download.file("https://raw.githubusercontent.com/pm0kjp/IS607/df95386b98f2605b63947b41d670d20c1a9b9e94/flight_ontime_data.csv", "flight_ontime_data.csv", method="curl")
flight_data<-read.csv("flight_ontime_data.csv")
head(flight_data)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Keep in mind that this is an example. We want to avoid referencing specific rows / columns or doing manipulation that would be difficult or time consuming if the table were instead to include 40 airlines and 75 cities.
Remove any blank lines (instead of just removing the third row, I tried to do this in such a way that it would work for any length of table that continues the data in the same format):
flight_data<-filter(flight_data, X.1!="")
head(flight_data)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
Now I have an “every other row” named / nameless problem in my first column. This prevents me from using spread() effectively. I’ll solve this by first coming up with a subset of rows with names and a subset of rows without names. I do this in such a way that regardless of the number of rows, this will work:
name_rows<-seq(from = 1, to = nrow(flight_data), by=2)
nameless_rows<-seq(from = 2, to = nrow(flight_data), by=2)
Now I simply copy the corresponding names from the named to the unnamed rows.
flight_data[nameless_rows,1]<-flight_data[name_rows,1]
head(flight_data)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
As shown in my desired final format, I want city to be a variable, instead of being a value spread in the header. I’ll use gather to do this. As always, I don’t hard code in the length but refer to the actual dimension of the data frame so that this code is portable to larger datasets with the same format.
flight_data<-gather(flight_data, "city", "n", 3:ncol(flight_data))
head(flight_data)
## X X.1 city n
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
Again, I referenced my desired output table, and look at the second column (still named X.1). I want to make these into two columns that will each contain the corresponding count.
flight_data<-spread(flight_data, X.1, n)
head(flight_data)
## X city delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
We’ve got a couple of small updates to do: We need to name the first column, remove dots from the city names, and replace the space in “on time” with an underscore:
colnames(flight_data)[1]<-"airline"
flight_data$city<-gsub("\\."," ",flight_data$city)
colnames(flight_data)[4]<-"on_time"
flight_data
## airline city delayed on_time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
We are asked to do on-time performance analysis. First, let’s summarize overall on time performance (otp) across all cities and airlines:
summarise(flight_data, on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)
## on_time_total delayed_total otp
## 1 9712 1288 88.29091
We can also use group_by and summarise together to summarize overall performance by airline:
flight_data %>% group_by(airline) %>% summarise(on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)
## Source: local data frame [2 x 4]
##
## airline on_time_total delayed_total otp
## (fctr) (int) (int) (dbl)
## 1 ALASKA 3274 501 86.72848
## 2 AM WEST 6438 787 89.10727
Or, by city:
flight_data %>% group_by(city) %>% summarise(on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)
## Source: local data frame [5 x 4]
##
## city on_time_total delayed_total otp
## (chr) (int) (int) (dbl)
## 1 Los Angeles 1191 179 86.93431
## 2 Phoenix 5061 427 92.21939
## 3 San Diego 595 85 87.50000
## 4 San Francisco 823 231 78.08349
## 5 Seattle 2042 366 84.80066
Or, by both city and airline:
flight_data %>% group_by(city, airline) %>% summarise(on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)
## Source: local data frame [10 x 5]
## Groups: city [?]
##
## city airline on_time_total delayed_total otp
## (chr) (fctr) (int) (int) (dbl)
## 1 Los Angeles ALASKA 497 62 88.90877
## 2 Los Angeles AM WEST 694 117 85.57337
## 3 Phoenix ALASKA 221 12 94.84979
## 4 Phoenix AM WEST 4840 415 92.10276
## 5 San Diego ALASKA 212 20 91.37931
## 6 San Diego AM WEST 383 65 85.49107
## 7 San Francisco ALASKA 503 102 83.14050
## 8 San Francisco AM WEST 320 129 71.26949
## 9 Seattle ALASKA 1841 305 85.78751
## 10 Seattle AM WEST 201 61 76.71756
What about some data visualization? We could start really simple, with qplot. We can plot the on time performance (on time divided by all flights) on a per-city basis, with each airline having a color.
library(ggplot2)
qplot(city,on_time/(on_time+delayed), data=flight_data, color=airline)
Well, this helps us see visually that Alaska has a better on-time performance in each market, compared to Am. West, but the plot itself is unattractive. The y-axis label is ugly, the dots are small, and the plot isn’t titled. Let’s do a grouped bar plot, and update the y axis and legend:
ggplot(flight_data, aes(x=factor(city), fill=factor(airline), y=on_time/(on_time+delayed)*100)) + geom_bar(position="dodge", stat="identity") + ylab("Percent On-Time Performance") + scale_y_continuous(breaks = seq(0, 100, by = 5)) + xlab ("") + scale_fill_discrete(name="Airline")