I will be working with tidyr and dplyr to do most of my data tidying and transformation as per the assignment instructions.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
I wrote the data exactly as it was presented in the assignment instructions and loaded it into R using read.csv(). The data is written in a wide format as suggested. I set stringsAsFactors to False because it makes it easier to wrangle the data when it isn’t wrapped up into factors.
ArrivalsDelays = read.csv("ArrivalsDelays.csv",stringsAsFactors = F)
ArrivalsDelays
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
The data has a blank row with na values across all columns, so I removed that blank row using na.omit(). Also, column names were missing for the airline companies and whether a flight was delayed or on time, so I used rename() to rename the columns.
ArrivalsDelays = na.omit(ArrivalsDelays)
ArrivalsDelays = rename(ArrivalsDelays, Airline = X, IsOnTime = X.1)
ArrivalsDelays
## Airline IsOnTime Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
The raw table omits the airline name for the delayed flights to improve visual clarity when reading the table in a book. However, since I am doing analysis on the table, I need to reformat the Airline column to include the airline name for delayed flights. The code below subsets all unique non-blank airline names and fills in the Airline vector. The vector is then reattached to the dataframe.
Airlines = ArrivalsDelays$Airline[ArrivalsDelays$Airline != ""]
AirlineVector = c(rep(Airlines[1],2),rep(Airlines[2],2))
ArrivalsDelays$Airline = AirlineVector
ArrivalsDelays
## Airline IsOnTime Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
After filling in the blank values, removing empty rows, and renaming columns, I can finally gather the data together into a tidy format. I used the gather() function to reorganize the data into a tidy format.
ArrivalsDelays = gather(ArrivalsDelays,"City","Count",3:7)
ArrivalsDelays
## Airline IsOnTime City Count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
I am interested in seeing which airline is more reliable to fly to each city. The way I am going to determine this is by calculating the rate of “on time” flights for each airline in each city. I will then compare them in a nice chart at the end.
The first step in creating this new metric is to group the data by Airline and by City and then summing the “on time” and “delayed” flights to create the TotalFlights metric.
The next step is to join the new dataframe onto the original one. This allows the calculation between count and TotalFlights to be easy using mutate().
The final step is to mutate() Count and TotalFlights to calculate the percentage of “on time” and “delayed” flights.
TotalFlights =
ArrivalsDelays %>%
group_by(Airline,City) %>%
summarize(TotalFlights = sum(Count))
ArrivalsDelays = left_join(ArrivalsDelays,TotalFlights, by = c("Airline","City"))
ArrivalsDelays = mutate(ArrivalsDelays, Percentage = Count/TotalFlights)
ArrivalsDelays
## Airline IsOnTime City Count TotalFlights Percentage
## 1 ALASKA on time Los.Angeles 497 559 0.88908766
## 2 ALASKA delayed Los.Angeles 62 559 0.11091234
## 3 AM WEST on time Los.Angeles 694 811 0.85573366
## 4 AM WEST delayed Los.Angeles 117 811 0.14426634
## 5 ALASKA on time Phoenix 221 233 0.94849785
## 6 ALASKA delayed Phoenix 12 233 0.05150215
## 7 AM WEST on time Phoenix 4840 5255 0.92102759
## 8 AM WEST delayed Phoenix 415 5255 0.07897241
## 9 ALASKA on time San.Diego 212 232 0.91379310
## 10 ALASKA delayed San.Diego 20 232 0.08620690
## 11 AM WEST on time San.Diego 383 448 0.85491071
## 12 AM WEST delayed San.Diego 65 448 0.14508929
## 13 ALASKA on time San.Francisco 503 605 0.83140496
## 14 ALASKA delayed San.Francisco 102 605 0.16859504
## 15 AM WEST on time San.Francisco 320 449 0.71269488
## 16 AM WEST delayed San.Francisco 129 449 0.28730512
## 17 ALASKA on time Seattle 1841 2146 0.85787512
## 18 ALASKA delayed Seattle 305 2146 0.14212488
## 19 AM WEST on time Seattle 201 262 0.76717557
## 20 AM WEST delayed Seattle 61 262 0.23282443
Below, I have written a little ggplot barchart comparing “on time” flight percentage in each city by airline. According to the plot, Alaska is more reliable across the board for every city.
ggplot(filter(ArrivalsDelays, IsOnTime == "on time")) +
geom_bar(aes(x = City, y = Percentage, fill = Airline),stat = "identity", position = "dodge", width = 0.7) +
labs(title = "On Time Flight Percentage by City")