#Data 607: Tidying and transforming Data

initial chart

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines

Import Libraries

library(dplyr, warn.conflicts = F)
library(tidyr)
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

Loading the CSV

I copied the provided data to a csv and stored it on my github account.

flight_data <- read.csv("https://raw.githubusercontent.com/georg4re/DS607/master/data/flight-hw5-data.csv", stringsAsFactors = FALSE)
head(flight_data)
##        X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time         497     221       212           503    1841
## 2 ALASKA delayed          62      12        20           102     305
## 3                         NA      NA        NA            NA      NA
## 4 AMWEST on time         694    4840       383           320     201
## 5 AMWEST delayed         117     415        65           129      61

Initial Data Cleanup

flight_data <- flight_data %>%
  rename(airline = X, status = X.1) %>%
  fill(airline) %>%
  na.omit()
flight_data
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4  AMWEST on time         694    4840       383           320     201
## 5  AMWEST delayed         117     415        65           129      61

Transforming the Data

Flights by Airline and city

flightsbyairline <- flight_data %>%
  gather("city", "Number", 3:7) %>%
  spread("status", "Number", 3:7) %>%
  rename(on_time='on time')
## Warning in if (!is.na(fill)) {: the condition has length > 1 and only the first
## element will be used
flightsbyairline
##    airline          city delayed on_time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6   AMWEST   Los.Angeles     117     694
## 7   AMWEST       Phoenix     415    4840
## 8   AMWEST     San.Diego      65     383
## 9   AMWEST San.Francisco     129     320
## 10  AMWEST       Seattle      61     201

Perform Calculations

Now, with the data “normalized” we want to perform some calculations.

flightsbyairline <- flightsbyairline %>%
  mutate(flights = delayed + on_time, onTimeAvg = (on_time/flights)*100) %>%
  arrange(desc(onTimeAvg))
flightsbyairline
##    airline          city delayed on_time flights onTimeAvg
## 1   ALASKA       Phoenix      12     221     233  94.84979
## 2   AMWEST       Phoenix     415    4840    5255  92.10276
## 3   ALASKA     San.Diego      20     212     232  91.37931
## 4   ALASKA   Los.Angeles      62     497     559  88.90877
## 5   ALASKA       Seattle     305    1841    2146  85.78751
## 6   AMWEST   Los.Angeles     117     694     811  85.57337
## 7   AMWEST     San.Diego      65     383     448  85.49107
## 8   ALASKA San.Francisco     102     503     605  83.14050
## 9   AMWEST       Seattle      61     201     262  76.71756
## 10  AMWEST San.Francisco     129     320     449  71.26949

By calculating the Avg of on time flights and sorting the table from high to low average we can start to see that Alaska seems to have a better average per city than AMWest. Let’s see it in a histogram:

ggplot(flightsbyairline, aes(x = city, y = onTimeAvg, fill=airline)) + 
  geom_bar(stat="identity", position = position_dodge2())

Flights per City

flightsbycity2 <- flight_data %>%
  gather("city", "Number", 3:7) %>%
  spread("airline", "Number", 3:7) %>%
  mutate(flights = ALASKA + AMWEST) %>%
  select(-c(ALASKA, AMWEST)) %>%
  spread(status, flights) %>% 
  rename(on_time="on time") %>%
  mutate(flights = delayed + on_time, onTimeAvg = (on_time/flights)*100) %>%
  arrange(desc(onTimeAvg))
## Warning in if (!is.na(fill)) {: the condition has length > 1 and only the first
## element will be used
flightsbycity2
##            city delayed on_time flights onTimeAvg
## 1       Phoenix     427    5061    5488  92.21939
## 2     San.Diego      85     595     680  87.50000
## 3   Los.Angeles     179    1191    1370  86.93431
## 4       Seattle     366    2042    2408  84.80066
## 5 San.Francisco     231     823    1054  78.08349

When we group the data by city, we can see that the city with the best on time average is Phoenix. Let’s compare all cities in a graph.

ggplot(flightsbycity2, aes(x = reorder(city, -onTimeAvg), y = onTimeAvg, fill=city)) + 
  geom_bar(stat="identity", position = position_dodge2()) + 
  theme(axis.title.x = element_blank())

Conclusions

Irrespective to Airline, Phoenix is the airport with the best average of on time flights and San Francisco has the worst average of the cities compared. Alaska does better than AMWEST in each of the cities compared.

LS0tDQp0aXRsZTogIkRTNjA3LUhXNSINCmF1dGhvcjogIkdlb3JnZSBDcnV6Ig0KZGF0ZTogImByIFN5cy5EYXRlKClgIg0Kb3V0cHV0OiBvcGVuaW50cm86OmxhYl9yZXBvcnQNCi0tLQ0KI0RhdGEgNjA3OiBUaWR5aW5nIGFuZCB0cmFuc2Zvcm1pbmcgRGF0YQ0KDQohW2luaXRpYWwgY2hhcnRdKGluaXRpYWxfY2hhcnQuSlBHKQ0KDQpUaGUgY2hhcnQgYWJvdmUgZGVzY3JpYmVzIGFycml2YWwgZGVsYXlzIGZvciB0d28gYWlybGluZXMgYWNyb3NzIGZpdmUgZGVzdGluYXRpb25zLiBZb3VyIHRhc2sgaXMgdG86DQoNCjEuIENyZWF0ZSBhIC5DU1YgZmlsZSAob3Igb3B0aW9uYWxseSwgYSBNeVNRTCBkYXRhYmFzZSEpIHRoYXQgaW5jbHVkZXMgYWxsIG9mIHRoZSBpbmZvcm1hdGlvbiBhYm92ZS4NCllvdeKAmXJlIGVuY291cmFnZWQgdG8gdXNlIGEg4oCcd2lkZeKAnSBzdHJ1Y3R1cmUgc2ltaWxhciB0byBob3cgdGhlIGluZm9ybWF0aW9uIGFwcGVhcnMgYWJvdmUsIHNvDQp0aGF0IHlvdSBjYW4gcHJhY3RpY2UgdGlkeWluZyBhbmQgdHJhbnNmb3JtYXRpb25zIGFzIGRlc2NyaWJlZCBiZWxvdy4NCjIuIFJlYWQgdGhlIGluZm9ybWF0aW9uIGZyb20geW91ciAuQ1NWIGZpbGUgaW50byBSLCBhbmQgdXNlIHRpZHlyIGFuZCBkcGx5ciBhcyBuZWVkZWQgdG8gdGlkeQ0KYW5kIHRyYW5zZm9ybSB5b3VyIGRhdGEuDQozLiBQZXJmb3JtIGFuYWx5c2lzIHRvIGNvbXBhcmUgdGhlIGFycml2YWwgZGVsYXlzIGZvciB0aGUgdHdvIGFpcmxpbmVzDQoNCiMjIyBJbXBvcnQgTGlicmFyaWVzDQpgYGB7ciBsb2FkLXBhY2thZ2VzfQ0KbGlicmFyeShkcGx5ciwgd2Fybi5jb25mbGljdHMgPSBGKQ0KbGlicmFyeSh0aWR5cikNCmxpYnJhcnkoZ2dwbG90MikNCmxpYnJhcnkocmVzaGFwZTIpDQpgYGANCg0KIyMjIExvYWRpbmcgdGhlIENTVg0KDQpJIGNvcGllZCB0aGUgcHJvdmlkZWQgZGF0YSB0byBhIGNzdiBhbmQgc3RvcmVkIGl0IG9uIG15IGdpdGh1YiBhY2NvdW50Lg0KDQpgYGB7ciBjb2RlLWNodW5rLWxhYmVsfQ0KZmxpZ2h0X2RhdGEgPC0gcmVhZC5jc3YoImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9nZW9yZzRyZS9EUzYwNy9tYXN0ZXIvZGF0YS9mbGlnaHQtaHc1LWRhdGEuY3N2Iiwgc3RyaW5nc0FzRmFjdG9ycyA9IEZBTFNFKQ0KaGVhZChmbGlnaHRfZGF0YSkNCmBgYA0KDQojIyMgSW5pdGlhbCBEYXRhIENsZWFudXANCmBgYHtyfQ0KZmxpZ2h0X2RhdGEgPC0gZmxpZ2h0X2RhdGEgJT4lDQogIHJlbmFtZShhaXJsaW5lID0gWCwgc3RhdHVzID0gWC4xKSAlPiUNCiAgZmlsbChhaXJsaW5lKSAlPiUNCiAgbmEub21pdCgpDQpmbGlnaHRfZGF0YQ0KYGBgDQojIyMgVHJhbnNmb3JtaW5nIHRoZSBEYXRhDQoNCioqRmxpZ2h0cyBieSBBaXJsaW5lIGFuZCBjaXR5KioNCmBgYHtyfQ0KZmxpZ2h0c2J5YWlybGluZSA8LSBmbGlnaHRfZGF0YSAlPiUNCiAgZ2F0aGVyKCJjaXR5IiwgIk51bWJlciIsIDM6NykgJT4lDQogIHNwcmVhZCgic3RhdHVzIiwgIk51bWJlciIsIDM6NykgJT4lDQogIHJlbmFtZShvbl90aW1lPSdvbiB0aW1lJykNCg0KZmxpZ2h0c2J5YWlybGluZQ0KYGBgDQojIyMgUGVyZm9ybSBDYWxjdWxhdGlvbnMNCk5vdywgd2l0aCB0aGUgZGF0YSAibm9ybWFsaXplZCIgd2Ugd2FudCB0byBwZXJmb3JtIHNvbWUgY2FsY3VsYXRpb25zLg0KDQpgYGB7cn0NCg0KZmxpZ2h0c2J5YWlybGluZSA8LSBmbGlnaHRzYnlhaXJsaW5lICU+JQ0KICBtdXRhdGUoZmxpZ2h0cyA9IGRlbGF5ZWQgKyBvbl90aW1lLCBvblRpbWVBdmcgPSAob25fdGltZS9mbGlnaHRzKSoxMDApICU+JQ0KICBhcnJhbmdlKGRlc2Mob25UaW1lQXZnKSkNCmZsaWdodHNieWFpcmxpbmUNCmBgYA0KQnkgY2FsY3VsYXRpbmcgdGhlIEF2ZyBvZiBvbiB0aW1lIGZsaWdodHMgYW5kIHNvcnRpbmcgdGhlIHRhYmxlIGZyb20gaGlnaCB0byBsb3cgYXZlcmFnZSB3ZSBjYW4gc3RhcnQgdG8gc2VlIHRoYXQgQWxhc2thIHNlZW1zIHRvIGhhdmUgYSBiZXR0ZXIgYXZlcmFnZSBwZXIgY2l0eSB0aGFuIEFNV2VzdC4gTGV0J3Mgc2VlIGl0IGluIGEgaGlzdG9ncmFtOg0KDQpgYGB7cn0NCmdncGxvdChmbGlnaHRzYnlhaXJsaW5lLCBhZXMoeCA9IGNpdHksIHkgPSBvblRpbWVBdmcsIGZpbGw9YWlybGluZSkpICsgDQogIGdlb21fYmFyKHN0YXQ9ImlkZW50aXR5IiwgcG9zaXRpb24gPSBwb3NpdGlvbl9kb2RnZTIoKSkNCmBgYA0KDQoNCiMjIyBGbGlnaHRzIHBlciBDaXR5DQpgYGB7cn0NCmZsaWdodHNieWNpdHkyIDwtIGZsaWdodF9kYXRhICU+JQ0KICBnYXRoZXIoImNpdHkiLCAiTnVtYmVyIiwgMzo3KSAlPiUNCiAgc3ByZWFkKCJhaXJsaW5lIiwgIk51bWJlciIsIDM6NykgJT4lDQogIG11dGF0ZShmbGlnaHRzID0gQUxBU0tBICsgQU1XRVNUKSAlPiUNCiAgc2VsZWN0KC1jKEFMQVNLQSwgQU1XRVNUKSkgJT4lDQogIHNwcmVhZChzdGF0dXMsIGZsaWdodHMpICU+JSANCiAgcmVuYW1lKG9uX3RpbWU9Im9uIHRpbWUiKSAlPiUNCiAgbXV0YXRlKGZsaWdodHMgPSBkZWxheWVkICsgb25fdGltZSwgb25UaW1lQXZnID0gKG9uX3RpbWUvZmxpZ2h0cykqMTAwKSAlPiUNCiAgYXJyYW5nZShkZXNjKG9uVGltZUF2ZykpDQpmbGlnaHRzYnljaXR5Mg0KYGBgDQpXaGVuIHdlIGdyb3VwIHRoZSBkYXRhIGJ5IGNpdHksIHdlIGNhbiBzZWUgdGhhdCB0aGUgY2l0eSB3aXRoIHRoZSBiZXN0IG9uIHRpbWUgYXZlcmFnZSBpcyBQaG9lbml4Lg0KTGV0J3MgY29tcGFyZSBhbGwgY2l0aWVzIGluIGEgZ3JhcGguDQoNCmBgYHtyfQ0KZ2dwbG90KGZsaWdodHNieWNpdHkyLCBhZXMoeCA9IHJlb3JkZXIoY2l0eSwgLW9uVGltZUF2ZyksIHkgPSBvblRpbWVBdmcsIGZpbGw9Y2l0eSkpICsgDQogIGdlb21fYmFyKHN0YXQ9ImlkZW50aXR5IiwgcG9zaXRpb24gPSBwb3NpdGlvbl9kb2RnZTIoKSkgKyANCiAgdGhlbWUoYXhpcy50aXRsZS54ID0gZWxlbWVudF9ibGFuaygpKQ0KYGBgDQoNCiMjIyBDb25jbHVzaW9ucw0KDQpJcnJlc3BlY3RpdmUgdG8gQWlybGluZSwgUGhvZW5peCBpcyB0aGUgYWlycG9ydCB3aXRoIHRoZSBiZXN0IGF2ZXJhZ2Ugb2Ygb24gdGltZSBmbGlnaHRzIGFuZCBTYW4gRnJhbmNpc2NvIGhhcyB0aGUgd29yc3QgYXZlcmFnZSBvZiB0aGUgY2l0aWVzIGNvbXBhcmVkLiBBbGFza2EgZG9lcyBiZXR0ZXIgdGhhbiBBTVdFU1QgaW4gZWFjaCBvZiB0aGUgY2l0aWVzIGNvbXBhcmVkLg0KDQoNCi4uLg0KDQo=