Reading in the csv file and editing the headers so Airline and Location are included, as well as cleaner city names.
airlines <- read.csv(file='/Users/Michele/Desktop/Airlines.csv', sep = ",", header = TRUE)
colnames(airlines) <- c("Airline", "OnTime", "Los Angeles", "Phoenix", "San Diego", "San Fransisco", "Seattle")
airlines
## Airline OnTime Los Angeles Phoenix San Diego San Fransisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
Using zoo library and editted code found on stackoverflow (link below), here we are filling in the blank values with “NA” on airline’s heirarchical index and replacing it (using na.locf) with the value directly above.
https://stackoverflow.com/questions/9514504/add-missing-value-in-column-with-value-from-row-above
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
airlines$Airline[airlines$Airline == ""] <- NA
airlines$Airline <- na.locf(airlines$Airline, option="locf")
airlines_fixed <- airlines
airlines_fixed
## Airline OnTime Los Angeles Phoenix San Diego San Fransisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
With our cleaner dataset, we can move on to tidying it up. The biggest problem is outlined in Hadley Wickham’s paper section 3.1 stating that Column headers are values, not variable names. In our airlines dataset, the Destination Cities make up column headers, but they are also values for analysis. To fix it, we’re going to “melt”, or stack it. Columns become a row named Destination. The values become a column called Frequency. I have also decided to arrange the dataset using the OnTime column for heightened readability.
library('tidyr')
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
airlines_gathered <- gather(airlines_fixed, "Destination", "Frequency", 3:7)
tidy_airlines <- arrange(airlines_gathered, OnTime)
tidy_airlines
## Airline OnTime Destination Frequency
## 1 ALASKA delayed Los Angeles 62
## 2 AM WEST delayed Los Angeles 117
## 3 ALASKA delayed Phoenix 12
## 4 AM WEST delayed Phoenix 415
## 5 ALASKA delayed San Diego 20
## 6 AM WEST delayed San Diego 65
## 7 ALASKA delayed San Fransisco 102
## 8 AM WEST delayed San Fransisco 129
## 9 ALASKA delayed Seattle 305
## 10 AM WEST delayed Seattle 61
## 11 ALASKA on time Los Angeles 497
## 12 AM WEST on time Los Angeles 694
## 13 ALASKA on time Phoenix 221
## 14 AM WEST on time Phoenix 4840
## 15 ALASKA on time San Diego 212
## 16 AM WEST on time San Diego 383
## 17 ALASKA on time San Fransisco 503
## 18 AM WEST on time San Fransisco 320
## 19 ALASKA on time Seattle 1841
## 20 AM WEST on time Seattle 201
Next, using the dplyr group_by and summarise function, we can easily determine the mean, min, max, median, and standard deviation for both airlines.
tidy_airlines %>%
group_by(Airline) %>%
filter(OnTime == "delayed") %>%
summarise(mean = mean(Frequency),
min = min(Frequency),
max = max(Frequency),
median = median(Frequency),
standard_deviation = sd(Frequency),
sum = sum(Frequency)
)
## # A tibble: 2 x 7
## Airline mean min max median standard_deviation sum
## <fctr> <dbl> <dbl> <dbl> <int> <dbl> <int>
## 1 ALASKA 100.2 12 305 62 120.0175 501
## 2 AM WEST 157.4 61 415 117 147.1625 787
Comparing the two airlines, we can notice that over the span of all locations, AM West has the most delays, having a larger value in every city. However, it appears as through AM West also has more flights than ALASKA. I decided to generate another column using dplyr’s mutate function, which is called Ratio. This takes the frequency of flights and divides it by the sum of flights per city. This shows a percentage of delayed or on time flights for each city.
tidy_airlines_1 <- tidy_airlines %>%
group_by(Airline, Destination) %>%
arrange(Airline) %>%
mutate(FlightsPerCity = sum(Frequency),
Ratio = Frequency/FlightsPerCity)
tidy_airlines_1
## # A tibble: 20 x 6
## # Groups: Airline, Destination [10]
## Airline OnTime Destination Frequency FlightsPerCity Ratio
## <fctr> <fctr> <chr> <int> <int> <dbl>
## 1 ALASKA delayed Los Angeles 62 559 0.11091234
## 2 ALASKA delayed Phoenix 12 233 0.05150215
## 3 ALASKA delayed San Diego 20 232 0.08620690
## 4 ALASKA delayed San Fransisco 102 605 0.16859504
## 5 ALASKA delayed Seattle 305 2146 0.14212488
## 6 ALASKA on time Los Angeles 497 559 0.88908766
## 7 ALASKA on time Phoenix 221 233 0.94849785
## 8 ALASKA on time San Diego 212 232 0.91379310
## 9 ALASKA on time San Fransisco 503 605 0.83140496
## 10 ALASKA on time Seattle 1841 2146 0.85787512
## 11 AM WEST delayed Los Angeles 117 811 0.14426634
## 12 AM WEST delayed Phoenix 415 5255 0.07897241
## 13 AM WEST delayed San Diego 65 448 0.14508929
## 14 AM WEST delayed San Fransisco 129 449 0.28730512
## 15 AM WEST delayed Seattle 61 262 0.23282443
## 16 AM WEST on time Los Angeles 694 811 0.85573366
## 17 AM WEST on time Phoenix 4840 5255 0.92102759
## 18 AM WEST on time San Diego 383 448 0.85491071
## 19 AM WEST on time San Fransisco 320 449 0.71269488
## 20 AM WEST on time Seattle 201 262 0.76717557
The same summarise function is used below on the Ratio column. Now, the data tells us a different story.
tidy_airlines_1 %>%
group_by(Airline) %>%
filter(OnTime == "delayed") %>%
summarise(mean = mean(Ratio),
min = min(Ratio),
max = max(Ratio),
median = median(Ratio),
standard_deviation = sd(Ratio)
)
## # A tibble: 2 x 6
## Airline mean min max median standard_deviation
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 0.1118683 0.05150215 0.1685950 0.1109123 0.04592624
## 2 AM WEST 0.1776915 0.07897241 0.2873051 0.1450893 0.08212854
Now, it appears as though AM WEST has more delays than ALASKA Airlines. All of its percent values are higher.