The process of creating tidy began with uploading the .csv file I created onto github and loading the various packages required. I added na.strings=“” to place NA in the fields that would have been blank.
flightdata <- read.csv('https://raw.githubusercontent.com/jeffnieman11/Data607HW6/master/flight_delay.csv',na.strings="")
flightdata
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AMWEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
require(tidyr)
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 3.2.3
require(zoo)
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
Using the gather function I created a column for the key “city” and another for the value “n”. I then renamed the column headers to better reflect what the data show.
flightdata1 <- gather(flightdata, "city", "n", 3:7)
colnames(flightdata1) <- c("Airline", "Timeliness", "City", "Count")
flightdata1
## Airline Timeliness City Count
## 1 ALASKA on time Los.Angeles 497
## 2 <NA> delayed Los.Angeles 62
## 3 AMWEST on time Los.Angeles 694
## 4 <NA> delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 <NA> delayed Phoenix 12
## 7 AMWEST on time Phoenix 4840
## 8 <NA> delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 <NA> delayed San.Diego 20
## 11 AMWEST on time San.Diego 383
## 12 <NA> delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 <NA> delayed San.Francisco 102
## 15 AMWEST on time San.Francisco 320
## 16 <NA> delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 <NA> delayed Seattle 305
## 19 AMWEST on time Seattle 201
## 20 <NA> delayed Seattle 61
Using the na.locf function from the zoo package to fill the NA fields with the value above it.
flightdata2 <- na.locf(flightdata1)
flightdata2
## Airline Timeliness City Count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AMWEST on time Los.Angeles 694
## 4 AMWEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AMWEST on time Phoenix 4840
## 8 AMWEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AMWEST on time San.Diego 383
## 12 AMWEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AMWEST on time San.Francisco 320
## 16 AMWEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AMWEST on time Seattle 201
## 20 AMWEST delayed Seattle 61
Using the spread function I separated out the timeliness column into new columns separating the ontime and delayed values. I then renamed the columns and made sure the type for the delayed and ontime columns of data was numeric.
flightdata3 <- spread(flightdata2, Timeliness, Count)
colnames(flightdata3) <- c("Airline", "City", "Delayed", "Ontime")
flightdata3$Delayed <- as.numeric(as.character(flightdata3$Delayed))
flightdata3$Ontime <- as.numeric(as.character(flightdata3$Ontime))
flightdata3
## Airline City Delayed Ontime
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AMWEST Los.Angeles 117 694
## 7 AMWEST Phoenix 415 4840
## 8 AMWEST San.Diego 65 383
## 9 AMWEST San.Francisco 129 320
## 10 AMWEST Seattle 61 201
Using the mutate function I created a new column called “Rate”. This is the percentage of delayed flights for each airline by each run.
flightdata4 <- mutate(flightdata3, Rate = Delayed/(Delayed + Ontime))
flightdata4 <- arrange(flightdata4, desc(Rate))
flightdata4
## Airline City Delayed Ontime Rate
## 1 AMWEST San.Francisco 129 320 0.28730512
## 2 AMWEST Seattle 61 201 0.23282443
## 3 ALASKA San.Francisco 102 503 0.16859504
## 4 AMWEST San.Diego 65 383 0.14508929
## 5 AMWEST Los.Angeles 117 694 0.14426634
## 6 ALASKA Seattle 305 1841 0.14212488
## 7 ALASKA Los.Angeles 62 497 0.11091234
## 8 ALASKA San.Diego 20 212 0.08620690
## 9 AMWEST Phoenix 415 4840 0.07897241
## 10 ALASKA Phoenix 12 221 0.05150215
Using ggplot2 I created a bar chart that shows the delayed rate for each city by airline. It shows clearly that Amwest was more often delayed than Alaska in every city.
ggplot(flightdata4,aes(x=City, y=Rate, fill=factor(Airline))) +
geom_bar(stat="identity",position="dodge") +
scale_fill_discrete(name="Airline", breaks =c("ALASKA", "AMWEST"), labels=c("Alaska", "Amwest")) +
xlab("City") +
ylab("Delayed Rate")
Using the group_by and summarise functions I created a summary for all cities by airline. Then I created the delayed rate for each airline using the mutate function.
flightdata5 <- flightdata4 %>%
group_by(Airline) %>%
summarise(Delayed = sum(Delayed), Ontime = sum(Ontime))
flightdata5
## Source: local data frame [2 x 3]
##
## Airline Delayed Ontime
## (chr) (dbl) (dbl)
## 1 ALASKA 501 3274
## 2 AMWEST 787 6438
flightdata6 <- mutate(flightdata5, Rate = Delayed/(Delayed + Ontime))
flightdata6
## Source: local data frame [2 x 4]
##
## Airline Delayed Ontime Rate
## (chr) (dbl) (dbl) (dbl)
## 1 ALASKA 501 3274 0.1327152
## 2 AMWEST 787 6438 0.1089273
Finally once again using ggplot2 I created a bar chart showing the delay rate for all cities for each of the two airlines. Again this clearly shows that Amwest overall was delayed more often than Alaska.
ggplot(flightdata6,aes(x=Airline, y=Rate)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airline") +
ylab("Delayed Rate")
In conclusion, in every city and overall Amwest has a higher percentage of delayed flights than Alaska.