Import a .csv file that uses a “wide” structure (see below). We will tidy and transform this data.
The csv data pictured previously is stored in the flights dataframe.
raw <- read.csv("https://raw.githubusercontent.com/mkivenson/Data-Acquisition-and-Management/master/Assignment%204/airlines.csv", na.strings="")
raw
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Using the dplyr and zoo libraries, we tidy this data so that:
flights <- (gather(raw,"city","count","Los.Angeles","Phoenix","San.Diego","San.Francisco","Seattle"))
colnames(flights) <- c('airline','status','city','count')
flights <- flights %>% do(na.locf(flights))
flights <- flights %>% distinct()
This gives us the tidy data table you see below.
datatable(flights)
With three categorical variables and one numerical variable, it may be challenging to create a visual that depicts all three. First, I created a facet grid showing the count of occurances of each combination of categorical variables (airline, city, and delay status). Observe that AM West flights at Phoenix and Alaska flights at Seattle seem to be on time the most.
ggplot(flights, aes(city, airline, size = count, color = count)) + geom_point() + facet_grid(rows = vars(status)) + scale_color_gradient(low="light grey", high="black")
In the previous chart, we see that AM West at Phoenix has the most on time flights - however the same airline at the same city also has more delays than any other city for that airline. It is likely that Phoenix just has a more AM West and Alaska flights. This information is not very useful in determining how punctual flights at a certain airline or city are. Instead, we use dplyr to turn the counts into a proportion of delayed and on time flights for each combination of city and airline.
flightsprop <- flights %>%
group_by(city, airline, status) %>%
summarise (count) %>%
mutate(freq = count / sum(count))
datatable(flightsprop)
Using the flight proportions calculated above, we create another facet grid to show which airlines and cities have the highest proportion of on time flights. This shows us that both AM West and Alaska are both reliable in Phoenix. These charts show that Phoenix has the most flights and highest reliability for both airlines.
flightsprop_ontime <- subset(flightsprop, status == 'on time')
ggplot(flightsprop_ontime, aes(city, airline, size = freq, color = freq)) + geom_point() + facet_grid(rows = vars(status)) + scale_color_gradient(low="light blue", high="dark blue")
We create a similar table to the one above, except this facet grid shows the proportions of delayed flights for each airline and city. Although it is slightly redundant, it is helpful to review the inverse grid. We see that AM West at San Francisco is the least reliable airline/city combination, followed by AM West at Seattle.
flightsprop_delay <- subset(flightsprop, status == 'delayed')
ggplot(flightsprop_delay, aes(city, airline, size = freq, color = freq)) + geom_point() + facet_grid(rows = vars(status)) + scale_color_gradient(low="pink", high="dark red")