The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
# read .csv file
(flight_df<- tbl_df(read.csv(file="fly.csv",head=T, sep=",")))
## # A tibble: 4 × 7
## Company Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## <fctr> <fctr> <int> <int> <int> <int> <int>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
# tidy data by putting city names as values in column 'city', and set up 'delayed' and 'on time' as variable, and spread into columns.
g <- flight_df %>%
gather(City, flights, 3:ncol(flight_df), -Status) %>%
spread(Status, flights)
# rename the last 2 variables
names(g)[3:4] <- c('Delayed', 'OnTime')
# compare delays using the mean, sum, min, and max data from both airlines.
g %>%
group_by(Company) %>%
summarise_each(funs(mean, sum, min, max), Delayed)
## # A tibble: 2 × 5
## Company mean sum min max
## <fctr> <dbl> <int> <int> <int>
## 1 ALASKA 100.2 501 12 305
## 2 AM WEST 157.4 787 61 415
# visualize the delayed flights by city by company
ggplot(data = g) +
geom_point(aes(x = City, y = Delayed), color = 'blue', size = 3) +
facet_grid(. ~ Company, scales="free") +
coord_flip()
Conclusion:
Overall comparison of the two airlines, ALASKA has less delays comparing to AM WEST with lower mean, totall delays, and the minimum and maximum delay records by cities.