Assignment - Tidying and Transforming Data

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)

# read .csv file
(flight_df<- tbl_df(read.csv(file="fly.csv",head=T, sep=",")))

## # A tibble: 4 × 7
##   Company  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
##    <fctr>  <fctr>       <int>   <int>     <int>         <int>   <int>
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

# tidy data by putting city names as values in column 'city', and set up 'delayed' and 'on time' as variable, and spread into columns. 
g <- flight_df %>%  
  gather(City, flights, 3:ncol(flight_df), -Status) %>%   
      spread(Status, flights)

# rename the last 2 variables
names(g)[3:4] <- c('Delayed', 'OnTime')

Perform analysis to compare the arrival delays for the two airlines.

# compare delays using the mean, sum, min, and max data from both airlines.
g %>% 
  group_by(Company) %>%
      summarise_each(funs(mean, sum, min, max), Delayed)

## # A tibble: 2 × 5
##   Company  mean   sum   min   max
##    <fctr> <dbl> <int> <int> <int>
## 1  ALASKA 100.2   501    12   305
## 2 AM WEST 157.4   787    61   415

# visualize the delayed flights by city by company
ggplot(data = g) +
  geom_point(aes(x = City, y = Delayed), color = 'blue', size = 3) +
  facet_grid(. ~ Company, scales="free") +
  coord_flip()

Conclusion:

Overall comparison of the two airlines, ALASKA has less delays comparing to AM WEST with lower mean, totall delays, and the minimum and maximum delay records by cities.

Assignment - Tidying and Transforming Data - Week 5

Ann Liu-Ferrara

March 1, 2017