R program to tidy up a wide data for analysis
delay_wide=read.csv(file="https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/delays.csv", na.strings = "")
delay_wide
## X X.1 Los.Angels Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Let’s transform the data that will allow as to compare the delays of two airlines
library(tidyr)
library(dplyr)
library(reshape2)
delay <- delay_wide %>%
gather(airport,flights,3:7,na.rm=T) %>% #Convert Airport from multi column to a column value pair
fill(1) %>% #If a value is blank then fill with non blank value from previous row
spread(X.1,flights) %>% #Convert one column key value to multi column value
rename(airline=X) %>% #Rename column X to airline
melt %>%
dcast(airline + airport ~ variable,sum,margins="airport") %>% #Melt and Recast to add totals by airline
mutate(airport=gsub("\\."," ",airport),delay_pct=delayed/(delayed+`on time`))
#Let's look at transformed Data
delay
## airline airport delayed on time delay_pct
## 1 ALASKA Los Angels 62 497 0.11091234
## 2 ALASKA Phoenix 12 221 0.05150215
## 3 ALASKA San Diego 20 212 0.08620690
## 4 ALASKA San Francisco 102 503 0.16859504
## 5 ALASKA Seattle 305 1841 0.14212488
## 6 ALASKA (all) 501 3274 0.13271523
## 7 AM WEST Los Angels 117 694 0.14426634
## 8 AM WEST Phoenix 415 4840 0.07897241
## 9 AM WEST San Diego 65 383 0.14508929
## 10 AM WEST San Francisco 129 320 0.28730512
## 11 AM WEST Seattle 61 201 0.23282443
## 12 AM WEST (all) 787 6438 0.10892734
From below chart we could see that at every airport the delay percentage for “AM WEST”" is higher than ALASKA airline. But if we look at all flights (the first bar on the chart) then ALASKA has more arrival delays than AM WEST. So who is the winner?
library(ggplot2)
ggplot(delay,aes(x=airport,y=delay_pct))+geom_bar(aes(fill=airline),stat="identity",position="dodge")+geom_text(aes(label=format(100*..y.., digits=2)),stat="identity",vjust=-.5,hjust=.5)+labs(x="Airport",y="Delay",fill="Airline")