Create CSV file or MySQL DB that includes all of required info.

# As we created CSV file that mirrors the briefing doc, we'll read data from that CSV file into R.

flightstatus.sourcefile <- ("G:/OneDrive/Learning/_CUNY_SPS_MSDS/2018_Spring/DATA 607/Week 5/Assignment/Numbersense.csv")
fs <- read.csv(flightstatus.sourcefile, header = T, sep= ",", stringsAsFactors = F)

Prep the table for tidying.

# We add meaningful column names where missing in the source table, using rename before we get to tidying.

fs = fs %>% rename("carrier" = "X", "status" = "X.1")

# Attempted to use filter to remove the spacing row without success, so changed tacks  calling na.omit on the status column.  Status column is best bet because:
# 1) It constitutes a variable
# 2) Carrier values for several rows are blank as a function of the source table, so they don't constitute a good test
# 3) There's a possibility that city columns might not be populated, and that those NAs actually constitued 0s

fs <- na.omit(fs, cols = status)

# We remove space in ontime status.

fs$status <- str_replace_all(fs$status, fixed(" "), "")

# We fill in  blank carrier values - next time will take generalizable approach using conditional logic, but for now we'll do a local fix.

fs$carrier[2] <- c("ALASKA")
fs$carrier[4] <- c("AM WEST")

Use a wide structure to practice tidying and transformation.

# First, we effect a wide structure using gather.

fs.tidy <- gather(fs, "city", "n", 3:7)
tbl_df(fs.tidy)
## # A tibble: 20 x 4
##    carrier status  city              n
##    <chr>   <chr>   <chr>         <int>
##  1 ALASKA  ontime  Los.Angeles     497
##  2 ALASKA  delayed Los.Angeles      62
##  3 AM WEST ontime  Los.Angeles     694
##  4 AM WEST delayed Los.Angeles     117
##  5 ALASKA  ontime  Phoenix         221
##  6 ALASKA  delayed Phoenix          12
##  7 AM WEST ontime  Phoenix        4840
##  8 AM WEST delayed Phoenix         415
##  9 ALASKA  ontime  San.Diego       212
## 10 ALASKA  delayed San.Diego        20
## 11 AM WEST ontime  San.Diego       383
## 12 AM WEST delayed San.Diego        65
## 13 ALASKA  ontime  San.Francisco   503
## 14 ALASKA  delayed San.Francisco   102
## 15 AM WEST ontime  San.Francisco   320
## 16 AM WEST delayed San.Francisco   129
## 17 ALASKA  ontime  Seattle        1841
## 18 ALASKA  delayed Seattle         305
## 19 AM WEST ontime  Seattle         201
## 20 AM WEST delayed Seattle          61

Experimenting a little with tidy-in / tidy-out.

# Spread can reverse the tidying operation, returning the previous data frame.

spread(fs.tidy, city, n)
##   carrier  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA delayed          62      12        20           102     305
## 2  ALASKA  ontime         497     221       212           503    1841
## 3 AM WEST delayed         117     415        65           129      61
## 4 AM WEST  ontime         694    4840       383           320     201
# And it can also cut the data for us in different ways

spread(fs.tidy, status, n)
##    carrier          city delayed ontime
## 1   ALASKA   Los.Angeles      62    497
## 2   ALASKA       Phoenix      12    221
## 3   ALASKA     San.Diego      20    212
## 4   ALASKA San.Francisco     102    503
## 5   ALASKA       Seattle     305   1841
## 6  AM WEST   Los.Angeles     117    694
## 7  AM WEST       Phoenix     415   4840
## 8  AM WEST     San.Diego      65    383
## 9  AM WEST San.Francisco     129    320
## 10 AM WEST       Seattle      61    201
spread(fs.tidy, carrier, n)
##     status          city ALASKA AM WEST
## 1  delayed   Los.Angeles     62     117
## 2  delayed       Phoenix     12     415
## 3  delayed     San.Diego     20      65
## 4  delayed San.Francisco    102     129
## 5  delayed       Seattle    305      61
## 6   ontime   Los.Angeles    497     694
## 7   ontime       Phoenix    221    4840
## 8   ontime     San.Diego    212     383
## 9   ontime San.Francisco    503     320
## 10  ontime       Seattle   1841     201

Perform analysis to compare arrival delays for two airlines.

# First, we'll calculcate total flights by city, irrespective of carrier.  This will give us a lay of the land.

fs.tidy %>% group_by(city) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) -> flightsbycity

# A bar chart will serve nicely to visualize this.

ggplot(flightsbycity, aes(x = reorder(city, -total), y = total)) + 
  geom_bar(stat = "identity") + 
  labs(x = NULL, y = "flights") + 
  scale_y_continuous(breaks = seq(0,6000,1000))

# Phoenix sees the highest volume overall between these two carriers.  The remaining four cities together have the same volume as Phoenix.
# Next, we'll look at carriers, diving into overall on-time performance by carrier.  We use mutate on the tidily vectorized columns, calculating total flights (total) and on-time performance (OTperf).

fs.tidy %>% spread(status, n) %>% 
  group_by(carrier) %>% 
  summarise(delayed = sum(delayed), ontime = sum(ontime)) %>% 
  mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>% 
  select(carrier, OTperf) %>% 
  arrange(desc(OTperf))
## # A tibble: 2 x 2
##   carrier OTperf
##   <chr>    <dbl>
## 1 AM WEST  0.891
## 2 ALASKA   0.867
# With the five cities pooled, the carriers are pretty close in terms of on-time performance.  AM West is slightly ahead of Alaska.  This is consistent with industry benchmarks - few carriers see better performance than 90% nationwide, and many are in the 80% decile.
# Next, we'll look at on-time performance by city, irrespective of carrier.  This will reveal how cities compare with each other.

fs.tidy %>% spread(status, n) %>% 
  group_by(city) %>% 
  summarise(delayed = sum(delayed), ontime = sum(ontime)) %>% 
  mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>% 
  select(city, OTperf) %>% 
  arrange(desc(OTperf)) -> OTPbycity

# Again, a bar chart will serve nicely to visualize this.

ggplot(OTPbycity, aes(x = reorder(city, -OTperf), y = OTperf)) + 
  geom_bar(stat = "identity") + 
  labs(x = NULL, y = "% departed on-time")  

  # scale_y_continuous(breaks = seq(0,6000,1000))

# We might think that higher volume would correlate with poorer on-time performance (i.e. due to logistical complexity or delay cascades); however, the highest trafficked city for these two carriers also has the highest ontime performance.  That said, this data does not seem representative of broader industry patterns - SFO served 55 million passengers in 2017, while Phoenix served 43 million. Between the two carriers, SFO did poorly in terms of on-time departures.  
# Given SFO's on-time perforamnce is poor, we'd like to  understand what's driving this.  Alaska is less on-time across the five cities - does SFO's performance have something to do with Alaska?  We'll total traffic and on-time performance by carrier to find out.

fs.tidy %>% spread(status, n) %>% 
  filter(city %in% c("San.Francisco")) %>% 
  group_by(carrier) %>%
  summarise(delayed = sum(delayed), ontime = sum(ontime)) %>% 
  mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>% 
  select(carrier, total, OTperf) %>% 
  arrange(desc(OTperf))
## # A tibble: 2 x 3
##   carrier total OTperf
##   <chr>   <int>  <dbl>
## 1 ALASKA    605  0.831
## 2 AM WEST   449  0.713
# While Alaska has higher traffic through SFO, it's on-time performance is actually the higher of the two carriers.  AM West has lower throughput but much lower on-time performance.
# Finally, let's look at the on-time perfomance of each carrier by city.  First, Alaska: 

fs.tidy %>% spread(status, n) %>% 
  group_by(city) %>% 
  filter(carrier == "ALASKA") %>% 
  summarise(delayed = sum(delayed), ontime = sum(ontime)) %>% 
  mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>% 
  select(city, total, OTperf) %>% 
  arrange(desc(OTperf)) -> ALASKA.OTPbycity
ALASKA.OTPbycity
## # A tibble: 5 x 3
##   city          total OTperf
##   <chr>         <int>  <dbl>
## 1 Phoenix         233  0.948
## 2 San.Diego       232  0.914
## 3 Los.Angeles     559  0.889
## 4 Seattle        2146  0.858
## 5 San.Francisco   605  0.831
# For Alaska, Seattle (a high volume city) and San Francisco (in the middle of the pack) are the poorest performers.  As Seattle is a high volume hub for Alaska, Seattle's lower performance drags down Alaska's average.
# Finally, let's look at the on-time perfomance of each carrier by city.  First, Alaska: 

fs.tidy %>% spread(status, n) %>% 
  group_by(city) %>% 
  filter(carrier == "AM WEST") %>% 
  summarise(delayed = sum(delayed), ontime = sum(ontime)) %>% 
  mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>% 
  select(city, total, OTperf) %>% 
  arrange(desc(OTperf)) -> AMWESTA.OTPbycity
AMWESTA.OTPbycity
## # A tibble: 5 x 3
##   city          total OTperf
##   <chr>         <int>  <dbl>
## 1 Phoenix        5255  0.921
## 2 Los.Angeles     811  0.856
## 3 San.Diego       448  0.855
## 4 Seattle         262  0.767
## 5 San.Francisco   449  0.713
# For AM West, Seattle (the lowest volume city) and San Francisco (middle of the pack) are the poorest performers.  As Phoenix is responsible for the highest flight volume for AM West, Phoenix's higher performance elevates AM West's average in spite of lower perforamnce at other cities.
Conclusions

Code should be in R Markdow file posted to rpubs.com Include narrative descriptions of data cleanup work, analysis, and conclusions