R program to tidy up a wide data for analysis

Wide Data

delay_wide=read.csv(file="https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/delays.csv", na.strings = "")
delay_wide
##         X     X.1 Los.Angels Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time        497     221       212           503    1841
## 2    <NA> delayed         62      12        20           102     305
## 3    <NA>    <NA>         NA      NA        NA            NA      NA
## 4 AM WEST on time        694    4840       383           320     201
## 5    <NA> delayed        117     415        65           129      61

Tidy up Data

Let’s transform the data that will allow as to compare the delays of two airlines

library(tidyr)
library(dplyr)
library(reshape2)


delay <- delay_wide  %>%
  gather(airport,flights,3:7,na.rm=T) %>% #Convert Airport from multi column to a column value pair
  fill(1) %>% #If a value is blank then fill with non blank value from previous row
  spread(X.1,flights) %>% #Convert one column key value to multi column value
  rename(airline=X) %>% #Rename column X to airline
  melt %>% 
  dcast(airline + airport ~ variable,sum,margins="airport") %>% #Melt and Recast to add totals by airline
  mutate(airport=gsub("\\."," ",airport),delay_pct=delayed/(delayed+`on time`))
   
#Let's look at transformed Data
delay
##    airline       airport delayed on time  delay_pct
## 1   ALASKA    Los Angels      62     497 0.11091234
## 2   ALASKA       Phoenix      12     221 0.05150215
## 3   ALASKA     San Diego      20     212 0.08620690
## 4   ALASKA San Francisco     102     503 0.16859504
## 5   ALASKA       Seattle     305    1841 0.14212488
## 6   ALASKA         (all)     501    3274 0.13271523
## 7  AM WEST    Los Angels     117     694 0.14426634
## 8  AM WEST       Phoenix     415    4840 0.07897241
## 9  AM WEST     San Diego      65     383 0.14508929
## 10 AM WEST San Francisco     129     320 0.28730512
## 11 AM WEST       Seattle      61     201 0.23282443
## 12 AM WEST         (all)     787    6438 0.10892734

Analysis of Delay

From below chart we could see that at every airport the delay percentage for “AM WEST”" is higher than ALASKA airline. But if we look at all flights (the first bar on the chart) then ALASKA has more arrival delays than AM WEST. So who is the winner?

library(ggplot2)
ggplot(delay,aes(x=airport,y=delay_pct))+geom_bar(aes(fill=airline),stat="identity",position="dodge")+geom_text(aes(label=format(100*..y.., digits=2)),stat="identity",vjust=-.5,hjust=.5)+labs(x="Airport",y="Delay",fill="Airline")