The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
● The URL to the .Rmd file in your GitHub repository. and
● The URL for your rpubs.com web page.
rawdata<-read.csv('https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-5/master/data.csv', stringsAsFactors = FALSE)
rawdata
The following actions are performed to the raw dataset: 1. Remove empty row; 2. Rename columns; 3. Fill down the aire line names as the name only appear once in the “Name” column; 4. Perform ‘gather’ and ‘spread’ to reshape data, changing locations from variables to values, and arrival status from values to variables. 5. Compute delay rate of each data row.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
data<-na.omit(rawdata) %>%
rename(Air.Line = X, Arrival = X.1) %>%
mutate(Air.Line = na_if(Air.Line,'')) %>%
fill(Air.Line, .direction = 'down') %>%
gather('Location','Flight.Cnt',3:7) %>%
mutate(Flight.Cnt = as.integer(str_replace(Flight.Cnt,',','')),
Arrival = str_replace(Arrival, 'on time','On.Time'),
Arrival = str_replace(Arrival, 'delayed','Delayed')) %>%
spread(Arrival, Flight.Cnt) %>%
mutate(Flight = Delayed+On.Time,
Delay.Rate = Delayed / (Delayed+On.Time))
data
Firstly we calculate the total delayed flights, total flights and overall delay rates of each airline.
data2 <- data %>%
select(Air.Line, Delayed, Flight, On.Time) %>%
group_by(Air.Line) %>%
summarize(Delayed = sum(Delayed), Flight = sum(Flight), Delay.Rate = sum(Delayed)/sum(Flight))
data2
Plot the data2
above in bar charts.
library(scales)
library(ggplot2)
library(ggpubr)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
p0 <- ggplot(data = data2,aes(x=Air.Line, y=Flight, fill=Air.Line, label=Delay.Rate))+
geom_bar(stat='identity')+
ggtitle('Total Flights by Air Line')+
xlab('Air Line')+ylab('Count of Flights')+
ylim(0,8000)+
geom_text(aes(label=Flight),vjust=-0.5, hjust=0.5, position = position_dodge(width = 1), color="black")+
theme(plot.title = element_text(hjust=0.5),
axis.text.y = element_text(angle = 90, hjust = 0.5),
legend.position = "none")+
coord_flip()
p1 <- ggplot(data = data2,aes(x=Air.Line, y=Delay.Rate, fill=Air.Line, label=Delay.Rate))+
geom_bar(stat='identity')+
ggtitle('Delay Rate by Air Line')+
xlab('')+ylab('Delay Rate')+
ylim(0,0.15)+
geom_text(aes(label=percent(Delay.Rate)),vjust=-0.5, hjust=0.5, position = position_dodge(width = 1), color="black")+
theme(plot.title = element_text(hjust=0.5),
axis.text.y = element_blank(),
legend.position = "right")+
coord_flip()
ggarrange(p0,p1,ncol=2)
Secondly, plot the data
in section #3 in bar charts as well.
p2 <- ggplot(data = data,aes(x=Air.Line, y=Flight, fill=Air.Line, label=Flight))+
geom_bar(stat='identity')+
facet_grid(~Location) +
ggtitle('Total number of Flights by Location')+
xlab('')+ylab('Count of Flights') +
ylim(0,6000)+
geom_text(aes(y=Flight),vjust=-0.5, hjust=0.4, position = position_dodge(width = 1), color="black")+
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_blank(),
legend.position = "none")
p3 <- ggplot(data = data,aes(x=Air.Line, y=Delay.Rate, fill=Air.Line, label=Delay.Rate))+
geom_bar(stat='identity')+
facet_grid(~Location)+
ggtitle('Delay Rate by Location')+
xlab('Air Line')+ylab('Delay Rate')+
ylim(0,0.4)+
geom_text(aes(label=percent(Delay.Rate)),vjust=-0.5, hjust=0.5, position = position_dodge(width = 1), color="black")+
theme(plot.title = element_text(hjust=0.5),
#axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
ggarrange(p2,p3,nrow=2)