Tidy Data

Reading in the csv file and editing the headers so Airline and Location are included, as well as cleaner city names.

airlines <- read.csv(file='/Users/Michele/Desktop/Airlines.csv', sep = ",", header = TRUE)
colnames(airlines) <- c("Airline", "OnTime", "Los Angeles", "Phoenix", "San Diego", "San Fransisco", "Seattle")
airlines
##   Airline  OnTime Los Angeles Phoenix San Diego San Fransisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

Using zoo library and editted code found on stackoverflow (link below), here we are filling in the blank values with “NA” on airline’s heirarchical index and replacing it (using na.locf) with the value directly above.

https://stackoverflow.com/questions/9514504/add-missing-value-in-column-with-value-from-row-above

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
airlines$Airline[airlines$Airline == ""] <- NA
airlines$Airline <- na.locf(airlines$Airline, option="locf")
airlines_fixed <- airlines
airlines_fixed
##   Airline  OnTime Los Angeles Phoenix San Diego San Fransisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

With our cleaner dataset, we can move on to tidying it up. The biggest problem is outlined in Hadley Wickham’s paper section 3.1 stating that Column headers are values, not variable names. In our airlines dataset, the Destination Cities make up column headers, but they are also values for analysis. To fix it, we’re going to “melt”, or stack it. Columns become a row named Destination. The values become a column called Frequency. I have also decided to arrange the dataset using the OnTime column for heightened readability.

library('tidyr')
library('dplyr')
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
airlines_gathered <- gather(airlines_fixed, "Destination", "Frequency", 3:7)
tidy_airlines <- arrange(airlines_gathered, OnTime)
tidy_airlines
##    Airline  OnTime   Destination Frequency
## 1   ALASKA delayed   Los Angeles        62
## 2  AM WEST delayed   Los Angeles       117
## 3   ALASKA delayed       Phoenix        12
## 4  AM WEST delayed       Phoenix       415
## 5   ALASKA delayed     San Diego        20
## 6  AM WEST delayed     San Diego        65
## 7   ALASKA delayed San Fransisco       102
## 8  AM WEST delayed San Fransisco       129
## 9   ALASKA delayed       Seattle       305
## 10 AM WEST delayed       Seattle        61
## 11  ALASKA on time   Los Angeles       497
## 12 AM WEST on time   Los Angeles       694
## 13  ALASKA on time       Phoenix       221
## 14 AM WEST on time       Phoenix      4840
## 15  ALASKA on time     San Diego       212
## 16 AM WEST on time     San Diego       383
## 17  ALASKA on time San Fransisco       503
## 18 AM WEST on time San Fransisco       320
## 19  ALASKA on time       Seattle      1841
## 20 AM WEST on time       Seattle       201

Next, using the dplyr group_by and summarise function, we can easily determine the mean, min, max, median, and standard deviation for both airlines.

tidy_airlines %>%
  group_by(Airline) %>%
  filter(OnTime == "delayed") %>%
  summarise(mean = mean(Frequency), 
            min = min(Frequency),
            max = max(Frequency),
            median = median(Frequency),
            standard_deviation = sd(Frequency),
            sum = sum(Frequency)
            )
## # A tibble: 2 x 7
##   Airline  mean   min   max median standard_deviation   sum
##    <fctr> <dbl> <dbl> <dbl>  <int>              <dbl> <int>
## 1  ALASKA 100.2    12   305     62           120.0175   501
## 2 AM WEST 157.4    61   415    117           147.1625   787

Comparing the two airlines, we can notice that over the span of all locations, AM West has the most delays, having a larger value in every city. However, it appears as through AM West also has more flights than ALASKA. I decided to generate another column using dplyr’s mutate function, which is called Ratio. This takes the frequency of flights and divides it by the sum of flights per city. This shows a percentage of delayed or on time flights for each city.

tidy_airlines_1 <- tidy_airlines %>%
  group_by(Airline, Destination) %>%
  arrange(Airline) %>%
  mutate(FlightsPerCity = sum(Frequency),
         Ratio = Frequency/FlightsPerCity)
tidy_airlines_1
## # A tibble: 20 x 6
## # Groups:   Airline, Destination [10]
##    Airline  OnTime   Destination Frequency FlightsPerCity      Ratio
##     <fctr>  <fctr>         <chr>     <int>          <int>      <dbl>
##  1  ALASKA delayed   Los Angeles        62            559 0.11091234
##  2  ALASKA delayed       Phoenix        12            233 0.05150215
##  3  ALASKA delayed     San Diego        20            232 0.08620690
##  4  ALASKA delayed San Fransisco       102            605 0.16859504
##  5  ALASKA delayed       Seattle       305           2146 0.14212488
##  6  ALASKA on time   Los Angeles       497            559 0.88908766
##  7  ALASKA on time       Phoenix       221            233 0.94849785
##  8  ALASKA on time     San Diego       212            232 0.91379310
##  9  ALASKA on time San Fransisco       503            605 0.83140496
## 10  ALASKA on time       Seattle      1841           2146 0.85787512
## 11 AM WEST delayed   Los Angeles       117            811 0.14426634
## 12 AM WEST delayed       Phoenix       415           5255 0.07897241
## 13 AM WEST delayed     San Diego        65            448 0.14508929
## 14 AM WEST delayed San Fransisco       129            449 0.28730512
## 15 AM WEST delayed       Seattle        61            262 0.23282443
## 16 AM WEST on time   Los Angeles       694            811 0.85573366
## 17 AM WEST on time       Phoenix      4840           5255 0.92102759
## 18 AM WEST on time     San Diego       383            448 0.85491071
## 19 AM WEST on time San Fransisco       320            449 0.71269488
## 20 AM WEST on time       Seattle       201            262 0.76717557

The same summarise function is used below on the Ratio column. Now, the data tells us a different story.

tidy_airlines_1 %>%
  group_by(Airline) %>%
  filter(OnTime == "delayed") %>%
  summarise(mean = mean(Ratio), 
            min = min(Ratio),
            max = max(Ratio),
            median = median(Ratio),
            standard_deviation = sd(Ratio)
            )
## # A tibble: 2 x 6
##   Airline      mean        min       max    median standard_deviation
##    <fctr>     <dbl>      <dbl>     <dbl>     <dbl>              <dbl>
## 1  ALASKA 0.1118683 0.05150215 0.1685950 0.1109123         0.04592624
## 2 AM WEST 0.1776915 0.07897241 0.2873051 0.1450893         0.08212854

Now, it appears as though AM WEST has more delays than ALASKA Airlines. All of its percent values are higher.