Assignment – Tidying and Transforming Data

The chart above describes arrival delays for two airlines across five destinations.

Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.
  4. Your code should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Read in the table from the .CSV file:

Reading in the table and removing the na’s from the empty row.

library(tidyverse)
arrivals <- read.csv("Arrivals.csv", sep = ",", stringsAsFactors = FALSE) 

arrivals_wide  <- arrivals %>% drop_na()
arrivals_wide
##   Airlines Arrivals Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1   Alaska  on time         497     221       212           503    1841
## 2           delayed          62      12        20           102     305
## 4  AM West on time          694    4840       383           320     201
## 5           delayed         117     415        65           129      61
str(arrivals_wide)
## 'data.frame':    4 obs. of  7 variables:
##  $ Airlines     : chr  "Alaska" "" "AM West" ""
##  $ Arrivals     : chr  "on time" "delayed" "on time " "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61

Tidy the Data

Tidying the wide data by first cleaning up the Arrivals, replacing empty spaces with NA and trimming the empty space around Arrivals, followed by filling in the Airlines for each arrival. The final step is gathering the on time and delayed arrivals in each city and arranging by Airline:

arrivals_long <- arrivals_wide %>% 
    mutate(Airline = ifelse(Airlines == "", NA, Airlines)) %>%
    mutate(Arrival = str_trim(Arrivals, side="both")) %>%
    select(-Airlines, -Arrivals) %>%
    fill(Airline, .direction = "down") %>%
    gather("City", "NumFlights", Los.Angeles:Seattle) %>%
    arrange(Airline)

arrivals_long
##    Airline Arrival          City NumFlights
## 1   Alaska on time   Los.Angeles        497
## 2   Alaska delayed   Los.Angeles         62
## 3   Alaska on time       Phoenix        221
## 4   Alaska delayed       Phoenix         12
## 5   Alaska on time     San.Diego        212
## 6   Alaska delayed     San.Diego         20
## 7   Alaska on time San.Francisco        503
## 8   Alaska delayed San.Francisco        102
## 9   Alaska on time       Seattle       1841
## 10  Alaska delayed       Seattle        305
## 11 AM West on time   Los.Angeles        694
## 12 AM West delayed   Los.Angeles        117
## 13 AM West on time       Phoenix       4840
## 14 AM West delayed       Phoenix        415
## 15 AM West on time     San.Diego        383
## 16 AM West delayed     San.Diego         65
## 17 AM West on time San.Francisco        320
## 18 AM West delayed San.Francisco        129
## 19 AM West on time       Seattle        201
## 20 AM West delayed       Seattle         61

Plot the Arrivals by Airline:

Comparing the on time arrival percentages for each airline.

ggplot(arrivals_long, aes(x=Airline, y=NumFlights)) + 
    geom_bar(stat="identity", aes(fill=Arrival), position="fill") +
    ylab("Ratio of Flights") +
    coord_flip() +
    theme(legend.position = "top") 

AM West’s on time percentages are higher than Alaska’s when comparing the overall set of arrivals.

Plot the Arrivals by City:

And further comparing the arrival on time arrival percentages by city.

ggplot(arrivals_long, aes(x=City, y=NumFlights)) + 
    geom_bar(stat="identity", aes(fill=Arrival), position="fill") +
    ylab("Ratio of Flights") +
    coord_flip() +
    theme(legend.position = "top") 

Phoenix has the best on time percentage while San Francsico has the worst on time arrival percentage.