Homework, Week 6

Assignment:
Planning:
Load up required packages:
Download CSV and read it into an R object
Basic, Initial Cleanup
Using gather to make city into a variable
Using spread to make on time / delayed into two separate columns
Final cleanup
Analysis
Visualization

Assignment:

Read the information from an untidy .CSV file into R
Use tidyr and dplyr as needed to tidy and transform the data.
Perform analysis to compare the arrival delays for the two airlines.

Planning:

Starting format:

We begin with a table that looks like this (some cities left out for brevity, dots represent blank leading cells):

.		Los Angeles	Phoenix	San Diego
ALASKA	on time	497	221	212
.	delayed	62	12	20
.
AM WEST	on time	694	4,840	383
.	delayed	117	415	65

Final format:

We begin with the end in mind. We will want a table that looks like this, instead:

Airline	City	On Time	Delayed
Alaska	Los Angeles	497	62
Alaska	Phoenix	221	12
Alaska	San Diego	212	20
Am West	Los Angeles	694	117
Am West	Phoenix	4840	415
Am West	San Diego	383	65

This kind of table can be analyzed easily for on-time performance by city or by airline.

Assumptions:

We assume that this is an example table and that the actual table could be wider (more cities) or longer (more airlines). The code in this R Markdown is intended to work for any length of data that follows the same format as the example given and read in.

Load up required packages:

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

Download CSV and read it into an R object

download.file("https://raw.githubusercontent.com/pm0kjp/IS607/df95386b98f2605b63947b41d670d20c1a9b9e94/flight_ontime_data.csv", "flight_ontime_data.csv", method="curl")
flight_data<-read.csv("flight_ontime_data.csv")
head(flight_data)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Basic, Initial Cleanup

Keep in mind that this is an example. We want to avoid referencing specific rows / columns or doing manipulation that would be difficult or time consuming if the table were instead to include 40 airlines and 75 cities.

Remove any blank lines (instead of just removing the third row, I tried to do this in such a way that it would work for any length of table that continues the data in the same format):

flight_data<-filter(flight_data, X.1!="")
head(flight_data)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

Now I have an “every other row” named / nameless problem in my first column. This prevents me from using spread() effectively. I’ll solve this by first coming up with a subset of rows with names and a subset of rows without names. I do this in such a way that regardless of the number of rows, this will work:

name_rows<-seq(from = 1, to = nrow(flight_data), by=2)
nameless_rows<-seq(from = 2, to = nrow(flight_data), by=2)

Now I simply copy the corresponding names from the named to the unnamed rows.

flight_data[nameless_rows,1]<-flight_data[name_rows,1]
head(flight_data)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Using gather to make city into a variable

As shown in my desired final format, I want city to be a variable, instead of being a value spread in the header. I’ll use gather to do this. As always, I don’t hard code in the length but refer to the actual dimension of the data frame so that this code is portable to larger datasets with the same format.

flight_data<-gather(flight_data, "city", "n", 3:ncol(flight_data))
head(flight_data)

##         X     X.1        city   n
## 1  ALASKA on time Los.Angeles 497
## 2  ALASKA delayed Los.Angeles  62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5  ALASKA on time     Phoenix 221
## 6  ALASKA delayed     Phoenix  12

Using spread to make on time / delayed into two separate columns

Again, I referenced my desired output table, and look at the second column (still named X.1). I want to make these into two columns that will each contain the corresponding count.

flight_data<-spread(flight_data, X.1, n)
head(flight_data)

##         X          city delayed on time
## 1  ALASKA   Los.Angeles      62     497
## 2  ALASKA       Phoenix      12     221
## 3  ALASKA     San.Diego      20     212
## 4  ALASKA San.Francisco     102     503
## 5  ALASKA       Seattle     305    1841
## 6 AM WEST   Los.Angeles     117     694

Final cleanup

We’ve got a couple of small updates to do: We need to name the first column, remove dots from the city names, and replace the space in “on time” with an underscore:

colnames(flight_data)[1]<-"airline"
flight_data$city<-gsub("\\."," ",flight_data$city)
colnames(flight_data)[4]<-"on_time"
flight_data

##    airline          city delayed on_time
## 1   ALASKA   Los Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San Diego      20     212
## 4   ALASKA San Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

Analysis

We are asked to do on-time performance analysis. First, let’s summarize overall on time performance (otp) across all cities and airlines:

summarise(flight_data, on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)

##   on_time_total delayed_total      otp
## 1          9712          1288 88.29091

We can also use group_by and summarise together to summarize overall performance by airline:

flight_data %>% group_by(airline) %>% summarise(on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)

## Source: local data frame [2 x 4]
## 
##   airline on_time_total delayed_total      otp
##    (fctr)         (int)         (int)    (dbl)
## 1  ALASKA          3274           501 86.72848
## 2 AM WEST          6438           787 89.10727

Or, by city:

flight_data %>% group_by(city) %>% summarise(on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)

## Source: local data frame [5 x 4]
## 
##            city on_time_total delayed_total      otp
##           (chr)         (int)         (int)    (dbl)
## 1   Los Angeles          1191           179 86.93431
## 2       Phoenix          5061           427 92.21939
## 3     San Diego           595            85 87.50000
## 4 San Francisco           823           231 78.08349
## 5       Seattle          2042           366 84.80066

Or, by both city and airline:

flight_data %>% group_by(city, airline) %>% summarise(on_time_total=sum(on_time), delayed_total=sum(delayed), otp=on_time_total/(on_time_total+delayed_total)*100)

## Source: local data frame [10 x 5]
## Groups: city [?]
## 
##             city airline on_time_total delayed_total      otp
##            (chr)  (fctr)         (int)         (int)    (dbl)
## 1    Los Angeles  ALASKA           497            62 88.90877
## 2    Los Angeles AM WEST           694           117 85.57337
## 3        Phoenix  ALASKA           221            12 94.84979
## 4        Phoenix AM WEST          4840           415 92.10276
## 5      San Diego  ALASKA           212            20 91.37931
## 6      San Diego AM WEST           383            65 85.49107
## 7  San Francisco  ALASKA           503           102 83.14050
## 8  San Francisco AM WEST           320           129 71.26949
## 9        Seattle  ALASKA          1841           305 85.78751
## 10       Seattle AM WEST           201            61 76.71756

Visualization

What about some data visualization? We could start really simple, with qplot. We can plot the on time performance (on time divided by all flights) on a per-city basis, with each airline having a color.

library(ggplot2)
qplot(city,on_time/(on_time+delayed), data=flight_data, color=airline)

Well, this helps us see visually that Alaska has a better on-time performance in each market, compared to Am. West, but the plot itself is unattractive. The y-axis label is ugly, the dots are small, and the plot isn’t titled. Let’s do a grouped bar plot, and update the y axis and legend:

ggplot(flight_data, aes(x=factor(city), fill=factor(airline), y=on_time/(on_time+delayed)*100)) + geom_bar(position="dodge", stat="identity") + ylab("Percent On-Time Performance") + scale_y_continuous(breaks = seq(0, 100, by = 5)) + xlab ("") + scale_fill_discrete(name="Airline")