This assignment will consider a wide dataset describing arrival delays for two airlines across five destinations. The purpose of this assignment is to use the tidyr and dplyr packages to tidy and transform the flights datatset and perform comparative analysis of the two airlines.
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
flight_schedule <- read.csv(file="https://raw.githubusercontent.com/raghu74us/607_1/master/flt.csv", header=TRUE, sep=",")
flight_schedule
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
str(flight_schedule)
## 'data.frame': 4 obs. of 7 variables:
## $ X : Factor w/ 2 levels "Alaska","AM WEST": 1 1 2 2
## $ X.1 : Factor w/ 2 levels "delayed","on time": 2 1 2 1
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 61
#change the name of the 1st two columns
names(flight_schedule)[1:2] <- c('Airlines','Status')
flight_schedule
## Airlines Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
#filter -- selected rows
#arrange -- reorder rows
#mutate -- add variables or columns
#gather --gather the non-variable columns into a two-column key-value pair(wide to short).
#spread -- opposite of gather
Tidy1<-gather(flight_schedule, "City","Flights",3:7) %>%
filter(Airlines == 'Alaska' || Airlines == 'AM WEST' ) %>%
spread( Status, Flights) %>%
mutate( City = gsub( "\\.", " ", City),
percent_delayed=round( ((delayed/(delayed+`on time`))*100),2)
) %>%
arrange(City, Airlines)
Tidy1
## Airlines City delayed on time percent_delayed
## 1 Alaska Los Angeles 62 497 11.09
## 2 AM WEST Los Angeles 117 694 14.43
## 3 Alaska Phoenix 12 221 5.15
## 4 AM WEST Phoenix 415 4840 7.90
## 5 Alaska San Diego 20 212 8.62
## 6 AM WEST San Diego 65 383 14.51
## 7 Alaska San Francisco 102 503 16.86
## 8 AM WEST San Francisco 129 320 28.73
## 9 Alaska Seattle 305 1841 14.21
## 10 AM WEST Seattle 61 201 23.28
Tidy1 %>% ggplot( aes(x=percent_delayed, y=City)) +
geom_segment(aes(yend=City), xend=0, color='blue') +
geom_point(size=4, aes(color=Airlines)) +
geom_text(aes(label=percent_delayed), vjust=-1, hjust=.5,color='black') +
scale_color_brewer(palette="Set2", limits=c('Alaska', 'AM WEST')) +
ggtitle("Percentage of Flights Delayed by Airline by City") +
xlab("Percentage of Flights Delayed") + ylab("City")
Based on the data, we can see that %delay of Alaska airlines is lower compared to AM West. Also, SFO has the highest delay and phoneix has the lower percentage delay. On an average, there is a 7% difference in delay between the two airlines.