Overview: Working with Tidy Data.

This assignment will consider a wide dataset describing arrival delays for two airlines across five destinations. The purpose of this assignment is to use the tidyr and dplyr packages to tidy and transform the flights datatset and perform comparative analysis of the two airlines.

Load the following libraries

library(stringr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Read the the airline csv file from GitHub

flight_schedule <- read.csv(file="https://raw.githubusercontent.com/raghu74us/607_1/master/flt.csv", header=TRUE, sep=",")

flight_schedule
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61
str(flight_schedule)
## 'data.frame':    4 obs. of  7 variables:
##  $ X            : Factor w/ 2 levels "Alaska","AM WEST": 1 1 2 2
##  $ X.1          : Factor w/ 2 levels "delayed","on time": 2 1 2 1
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61
#change the name of the 1st two columns
names(flight_schedule)[1:2] <- c('Airlines','Status')
flight_schedule
##   Airlines  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1   Alaska on time         497     221       212           503    1841
## 2   Alaska delayed          62      12        20           102     305
## 3  AM WEST on time         694    4840       383           320     201
## 4  AM WEST delayed         117     415        65           129      61

Tidy the dataset using tidyr

#filter -- selected rows
#arrange  -- reorder rows
#mutate -- add variables or columns
#gather --gather the non-variable columns into a two-column key-value pair(wide to short).
#spread -- opposite of gather
Tidy1<-gather(flight_schedule, "City","Flights",3:7) %>%
  filter(Airlines == 'Alaska' || Airlines == 'AM WEST' ) %>%
  spread( Status, Flights) %>% 
  mutate( City = gsub( "\\.", " ", City), 
          percent_delayed=round( ((delayed/(delayed+`on time`))*100),2)
                                                          ) %>%
  arrange(City, Airlines)
Tidy1
##    Airlines          City delayed on time percent_delayed
## 1    Alaska   Los Angeles      62     497           11.09
## 2   AM WEST   Los Angeles     117     694           14.43
## 3    Alaska       Phoenix      12     221            5.15
## 4   AM WEST       Phoenix     415    4840            7.90
## 5    Alaska     San Diego      20     212            8.62
## 6   AM WEST     San Diego      65     383           14.51
## 7    Alaska San Francisco     102     503           16.86
## 8   AM WEST San Francisco     129     320           28.73
## 9    Alaska       Seattle     305    1841           14.21
## 10  AM WEST       Seattle      61     201           23.28

Plot the Data and analyze

Tidy1 %>% ggplot( aes(x=percent_delayed, y=City)) + 
    geom_segment(aes(yend=City), xend=0, color='blue') + 
    geom_point(size=4, aes(color=Airlines)) + 
    geom_text(aes(label=percent_delayed), vjust=-1, hjust=.5,color='black') +
    scale_color_brewer(palette="Set2", limits=c('Alaska', 'AM WEST')) + 
     ggtitle("Percentage of Flights Delayed by Airline by City") +
    xlab("Percentage of Flights Delayed") + ylab("City") 

Conclusion:

Based on the data, we can see that %delay of Alaska airlines is lower compared to AM West. Also, SFO has the highest delay and phoneix has the lower percentage delay. On an average, there is a 7% difference in delay between the two airlines.