Introduction

This assignment presents the capabilities of the R packages of tidyr and dplyr in order to tidy an input CSV file and provide analysis through transformations of the tidy dataframe. The tidy nature of the data follows the definition of Tidy Data as defined by Hadley Wickham. The analysis compares the flight arrivals at five American airports from two airlines.

Tidy Input File

In an attempt to remain as true as possible to the assignment’s example input file, the raw CSV file does not contain two column headers, an empty row, and only lists the two airlines once each. In order to tidy the input data, column headers were added for the first two columns, Airline and Arrival. The arrival column header indicates the category of delayed or on-time. Next, the empty row was removed from the data. Finally, airline names were populated for each subsequent row that initially did not include the airline name. These tidy tactics allowed for full population of the input data table.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the csv file from the local directory
arrivals_raw <- read.csv(file = 'arrivals.csv', header=TRUE, check.names=FALSE, fill=TRUE, na.strings=c(""))

names(arrivals_raw)[1] <- "Airline"
names(arrivals_raw)[2] <- "Arrival"

arrivals_raw_rm <- arrivals_raw %>% filter_all(any_vars(!is.na(.)))

arrivals_raw_filled <- arrivals_raw_rm %>% fill(Airline)

Create Long Tidy Dataframe

From the full table based on the input file, the data was denormalized to meet the tidy long standards as defined by Wickham. This form of the data follows the rules in which each column is a variable, each observation forms a row, and each observational type forms a table. Because of the tidy rules, the rows/observations do contain data redundancy, but in doing so, the tidy form ensures for each analysis of the numerical data by observation. The long tidy form of the data identifies the number of cases of on-time and delayed flights by the combination of airline and destination airport.

table_tidy_long <- arrivals_raw_filled %>% 
  pivot_longer(c(`Los Angeles`, `Phoenix`, `San Diego`, `San Francisco`, `Seattle`), names_to = "Destination", values_to = "Cases")

table_tidy_long
## # A tibble: 20 x 4
##    Airline Arrival Destination   Cases
##    <fct>   <fct>   <chr>         <int>
##  1 ALASKA  on time Los Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San Diego       212
##  4 ALASKA  on time San Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San Diego        20
##  9 ALASKA  delayed San Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San Diego       383
## 14 AM WEST on time San Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San Diego        65
## 19 AM WEST delayed San Francisco   129
## 20 AM WEST delayed Seattle          61
# Transform tibble to dataframe
table_tidy_long_df <- tbl_df(table_tidy_long)

Filter Long Dataframe

Based on the long tidy form of the data, the data was filtered to only include west coast airports, i.e., airports located in the Pacific time zone, which excluded Phoenix. The decision to use west coast airports was arbitrary as just an exercise of using the filter functionality of the dplyr package.

(filter(table_tidy_long_df, Destination == 'Los Angeles'))
## # A tibble: 4 x 4
##   Airline Arrival Destination Cases
##   <fct>   <fct>   <chr>       <int>
## 1 ALASKA  on time Los Angeles   497
## 2 ALASKA  delayed Los Angeles    62
## 3 AM WEST on time Los Angeles   694
## 4 AM WEST delayed Los Angeles   117
(filter(table_tidy_long_df, Airline == 'ALASKA'))
## # A tibble: 10 x 4
##    Airline Arrival Destination   Cases
##    <fct>   <fct>   <chr>         <int>
##  1 ALASKA  on time Los Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San Diego       212
##  4 ALASKA  on time San Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San Diego        20
##  9 ALASKA  delayed San Francisco   102
## 10 ALASKA  delayed Seattle         305
# Filter for West Coast airpots
(filter(table_tidy_long_df, Destination %in% c('Los Angeles', 'San Francisco', 'San Diego', 'Seattle')))
## # A tibble: 16 x 4
##    Airline Arrival Destination   Cases
##    <fct>   <fct>   <chr>         <int>
##  1 ALASKA  on time Los Angeles     497
##  2 ALASKA  on time San Diego       212
##  3 ALASKA  on time San Francisco   503
##  4 ALASKA  on time Seattle        1841
##  5 ALASKA  delayed Los Angeles      62
##  6 ALASKA  delayed San Diego        20
##  7 ALASKA  delayed San Francisco   102
##  8 ALASKA  delayed Seattle         305
##  9 AM WEST on time Los Angeles     694
## 10 AM WEST on time San Diego       383
## 11 AM WEST on time San Francisco   320
## 12 AM WEST on time Seattle         201
## 13 AM WEST delayed Los Angeles     117
## 14 AM WEST delayed San Diego        65
## 15 AM WEST delayed San Francisco   129
## 16 AM WEST delayed Seattle          61

Select Long Dataframe

The select functionality was used to order the long tidy dataframe by the destination airport, followed by a combined ordering based on the destination airport and the number of cases. The final selection exercise outlined the capability of selecting consecutive columns, which produced a long tidy table without the airlines.

# Re-order by Destination
(arrange(table_tidy_long_df, desc(Destination)))
## # A tibble: 20 x 4
##    Airline Arrival Destination   Cases
##    <fct>   <fct>   <chr>         <int>
##  1 ALASKA  on time Seattle        1841
##  2 ALASKA  delayed Seattle         305
##  3 AM WEST on time Seattle         201
##  4 AM WEST delayed Seattle          61
##  5 ALASKA  on time San Francisco   503
##  6 ALASKA  delayed San Francisco   102
##  7 AM WEST on time San Francisco   320
##  8 AM WEST delayed San Francisco   129
##  9 ALASKA  on time San Diego       212
## 10 ALASKA  delayed San Diego        20
## 11 AM WEST on time San Diego       383
## 12 AM WEST delayed San Diego        65
## 13 ALASKA  on time Phoenix         221
## 14 ALASKA  delayed Phoenix          12
## 15 AM WEST on time Phoenix        4840
## 16 AM WEST delayed Phoenix         415
## 17 ALASKA  on time Los Angeles     497
## 18 ALASKA  delayed Los Angeles      62
## 19 AM WEST on time Los Angeles     694
## 20 AM WEST delayed Los Angeles     117
# Re-order by Destination and Cases
(arrange(table_tidy_long_df, Destination, Cases))
## # A tibble: 20 x 4
##    Airline Arrival Destination   Cases
##    <fct>   <fct>   <chr>         <int>
##  1 ALASKA  delayed Los Angeles      62
##  2 AM WEST delayed Los Angeles     117
##  3 ALASKA  on time Los Angeles     497
##  4 AM WEST on time Los Angeles     694
##  5 ALASKA  delayed Phoenix          12
##  6 ALASKA  on time Phoenix         221
##  7 AM WEST delayed Phoenix         415
##  8 AM WEST on time Phoenix        4840
##  9 ALASKA  delayed San Diego        20
## 10 AM WEST delayed San Diego        65
## 11 ALASKA  on time San Diego       212
## 12 AM WEST on time San Diego       383
## 13 ALASKA  delayed San Francisco   102
## 14 AM WEST delayed San Francisco   129
## 15 AM WEST on time San Francisco   320
## 16 ALASKA  on time San Francisco   503
## 17 AM WEST delayed Seattle          61
## 18 AM WEST on time Seattle         201
## 19 ALASKA  delayed Seattle         305
## 20 ALASKA  on time Seattle        1841
# Select columns by name
(select(table_tidy_long_df, Arrival:Cases))
## # A tibble: 20 x 3
##    Arrival Destination   Cases
##    <fct>   <chr>         <int>
##  1 on time Los Angeles     497
##  2 on time Phoenix         221
##  3 on time San Diego       212
##  4 on time San Francisco   503
##  5 on time Seattle        1841
##  6 delayed Los Angeles      62
##  7 delayed Phoenix          12
##  8 delayed San Diego        20
##  9 delayed San Francisco   102
## 10 delayed Seattle         305
## 11 on time Los Angeles     694
## 12 on time Phoenix        4840
## 13 on time San Diego       383
## 14 on time San Francisco   320
## 15 on time Seattle         201
## 16 delayed Los Angeles     117
## 17 delayed Phoenix         415
## 18 delayed San Diego        65
## 19 delayed San Francisco   129
## 20 delayed Seattle          61

Summarise Long Dataframe

To provide some meanginful analysis of the sample data, an initial summary was performed to count the total number of on-time and delayed flights by airline. The next summary attempt, similar to the first, calculated the count of on-time and delayed flights by destination airport. The final analysis performed grouped the data by destination airport and arrival type, then counting the instances before providing the sum, average, maximum, and minimum of each airport and arrival type pairing before excluding the Phoenix airport. The chained functions allowed for the display of the performance of flight arrivals by airport.

# Group by and Summarise work
by_airline_arrival <- group_by(table_tidy_long_df, Airline, Arrival)
summarise(by_airline_arrival, total = sum(Cases, na.rm = TRUE))
## # A tibble: 4 x 3
## # Groups:   Airline [2]
##   Airline Arrival total
##   <fct>   <fct>   <int>
## 1 ALASKA  delayed   501
## 2 ALASKA  on time  3274
## 3 AM WEST delayed   787
## 4 AM WEST on time  6438
by_dest_arrival <- group_by(table_tidy_long_df, Destination, Arrival)
summarise(by_dest_arrival, total = sum(Cases, na.rm = TRUE))
## # A tibble: 10 x 3
## # Groups:   Destination [5]
##    Destination   Arrival total
##    <chr>         <fct>   <int>
##  1 Los Angeles   delayed   179
##  2 Los Angeles   on time  1191
##  3 Phoenix       delayed   427
##  4 Phoenix       on time  5061
##  5 San Diego     delayed    85
##  6 San Diego     on time   595
##  7 San Francisco delayed   231
##  8 San Francisco on time   823
##  9 Seattle       delayed   366
## 10 Seattle       on time  2042
# Group flights by destination and arrival type.
# Summarise to compute total count, mean, maximum and minimum
# Filter to remove Phoenix airport
arrivals <- table_tidy_long_df %>% 
  group_by(Destination, Arrival) %>% 
  summarise(
    count = n(),
    total_arrivals = sum(Cases, na.rm = TRUE),
    avg = mean(Cases, na.rm = TRUE),
    max = max(Cases, na.rm = TRUE),
    min = min(Cases, na.rm = TRUE)
  ) %>% 
  filter( Destination != 'Phoenix')

arrivals
## # A tibble: 8 x 7
## # Groups:   Destination [4]
##   Destination   Arrival count total_arrivals    avg   max   min
##   <chr>         <fct>   <int>          <int>  <dbl> <int> <int>
## 1 Los Angeles   delayed     2            179   89.5   117    62
## 2 Los Angeles   on time     2           1191  596.    694   497
## 3 San Diego     delayed     2             85   42.5    65    20
## 4 San Diego     on time     2            595  298.    383   212
## 5 San Francisco delayed     2            231  116.    129   102
## 6 San Francisco on time     2            823  412.    503   320
## 7 Seattle       delayed     2            366  183     305    61
## 8 Seattle       on time     2           2042 1021    1841   201

Create Wide Tidy Dataframe

Out of curiosity, I created a wide tidy file in which the arrival types are defined as columns for each combination of airline and destination airport. I understand this doesn’t quite follow Wickham’s definition of tidy as each row captures more than one observation. I chose to create this form of a near tidy table to utilize the mutate functionality and create additional columns for analysis.

table_tidy_long <- arrivals_raw_filled %>% 
  pivot_longer(c(`Los Angeles`, `Phoenix`, `San Diego`, `San Francisco`, `Seattle`), names_to = "Destination", values_to = "Cases")

table_tidy_wide <- table_tidy_long  %>%
    pivot_wider(names_from = 'Arrival', values_from = 'Cases')

# Transform tibble to dataframe
table_tidy_wide_df <- tbl_df(table_tidy_wide)

table_tidy_wide_df
## # A tibble: 10 x 4
##    Airline Destination   `on time` delayed
##    <fct>   <chr>             <int>   <int>
##  1 ALASKA  Los Angeles         497      62
##  2 ALASKA  Phoenix             221      12
##  3 ALASKA  San Diego           212      20
##  4 ALASKA  San Francisco       503     102
##  5 ALASKA  Seattle            1841     305
##  6 AM WEST Los Angeles         694     117
##  7 AM WEST Phoenix            4840     415
##  8 AM WEST San Diego           383      65
##  9 AM WEST San Francisco       320     129
## 10 AM WEST Seattle             201      61

Mutate Wide File

Based on the wide tidy dataframe, the mutate functionality performs calculations to sum the total arrivals by airline and destination airport while also calculating the percentage of on-time flights based on the sum calculation in the same mutate statement. The resulting table displays two additional columns for the two calculations performed. The wide format tidy table allows for ease of analysis when multiple variables are captured in a single row to calculate aggregate values. Finally, the resulting analysis was ordered by the highest on-time arrival percentage by airline and destination airport.

# Create column for Total Arrivals by Airline and Destination
# Create column for percentage of on-time arrivals by airline and destination
df_with_cols <- mutate(table_tidy_wide_df,
  `Total Arrivals` = `on time` + delayed,
  `On Time Pct` = `on time` / `Total Arrivals` * 100)

df_with_cols
## # A tibble: 10 x 6
##    Airline Destination   `on time` delayed `Total Arrivals` `On Time Pct`
##    <fct>   <chr>             <int>   <int>            <int>         <dbl>
##  1 ALASKA  Los Angeles         497      62              559          88.9
##  2 ALASKA  Phoenix             221      12              233          94.8
##  3 ALASKA  San Diego           212      20              232          91.4
##  4 ALASKA  San Francisco       503     102              605          83.1
##  5 ALASKA  Seattle            1841     305             2146          85.8
##  6 AM WEST Los Angeles         694     117              811          85.6
##  7 AM WEST Phoenix            4840     415             5255          92.1
##  8 AM WEST San Diego           383      65              448          85.5
##  9 AM WEST San Francisco       320     129              449          71.3
## 10 AM WEST Seattle             201      61              262          76.7
(arrange(df_with_cols, desc(`On Time Pct`)))
## # A tibble: 10 x 6
##    Airline Destination   `on time` delayed `Total Arrivals` `On Time Pct`
##    <fct>   <chr>             <int>   <int>            <int>         <dbl>
##  1 ALASKA  Phoenix             221      12              233          94.8
##  2 AM WEST Phoenix            4840     415             5255          92.1
##  3 ALASKA  San Diego           212      20              232          91.4
##  4 ALASKA  Los Angeles         497      62              559          88.9
##  5 ALASKA  Seattle            1841     305             2146          85.8
##  6 AM WEST Los Angeles         694     117              811          85.6
##  7 AM WEST San Diego           383      65              448          85.5
##  8 ALASKA  San Francisco       503     102              605          83.1
##  9 AM WEST Seattle             201      61              262          76.7
## 10 AM WEST San Francisco       320     129              449          71.3

Conclusions

This assignment clearly demonstrated the value of tidy data. Coming from a software development background, I always attempted to normalize data for storage in relational databases. The week’s reading along with this assignment demonstrated the value of denormalizing data when appropriate in order to make data analysis easier. Also, transforming input data into a tidy form can make the data easier and simpler to read, which in turn can make the analysis tasks easier to write in R and understand when displayed. The tidyr and dplyr packages are powerful tools in the R programming language that with some practice can transform messy input data into data structures more suited for data analysis. As for the conclusion regarding the sample data, the Phoenix airport has the best on-time percentages, and the Alaska airline outperforms the American West airline.