This assignment presents the capabilities of the R packages of tidyr and dplyr in order to tidy an input CSV file and provide analysis through transformations of the tidy dataframe. The tidy nature of the data follows the definition of Tidy Data as defined by Hadley Wickham. The analysis compares the flight arrivals at five American airports from two airlines.
In an attempt to remain as true as possible to the assignment’s example input file, the raw CSV file does not contain two column headers, an empty row, and only lists the two airlines once each. In order to tidy the input data, column headers were added for the first two columns, Airline and Arrival. The arrival column header indicates the category of delayed or on-time. Next, the empty row was removed from the data. Finally, airline names were populated for each subsequent row that initially did not include the airline name. These tidy tactics allowed for full population of the input data table.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the csv file from the local directory
arrivals_raw <- read.csv(file = 'arrivals.csv', header=TRUE, check.names=FALSE, fill=TRUE, na.strings=c(""))
names(arrivals_raw)[1] <- "Airline"
names(arrivals_raw)[2] <- "Arrival"
arrivals_raw_rm <- arrivals_raw %>% filter_all(any_vars(!is.na(.)))
arrivals_raw_filled <- arrivals_raw_rm %>% fill(Airline)
From the full table based on the input file, the data was denormalized to meet the tidy long standards as defined by Wickham. This form of the data follows the rules in which each column is a variable, each observation forms a row, and each observational type forms a table. Because of the tidy rules, the rows/observations do contain data redundancy, but in doing so, the tidy form ensures for each analysis of the numerical data by observation. The long tidy form of the data identifies the number of cases of on-time and delayed flights by the combination of airline and destination airport.
table_tidy_long <- arrivals_raw_filled %>%
pivot_longer(c(`Los Angeles`, `Phoenix`, `San Diego`, `San Francisco`, `Seattle`), names_to = "Destination", values_to = "Cases")
table_tidy_long
## # A tibble: 20 x 4
## Airline Arrival Destination Cases
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
# Transform tibble to dataframe
table_tidy_long_df <- tbl_df(table_tidy_long)
Based on the long tidy form of the data, the data was filtered to only include west coast airports, i.e., airports located in the Pacific time zone, which excluded Phoenix. The decision to use west coast airports was arbitrary as just an exercise of using the filter functionality of the dplyr package.
(filter(table_tidy_long_df, Destination == 'Los Angeles'))
## # A tibble: 4 x 4
## Airline Arrival Destination Cases
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
(filter(table_tidy_long_df, Airline == 'ALASKA'))
## # A tibble: 10 x 4
## Airline Arrival Destination Cases
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
# Filter for West Coast airpots
(filter(table_tidy_long_df, Destination %in% c('Los Angeles', 'San Francisco', 'San Diego', 'Seattle')))
## # A tibble: 16 x 4
## Airline Arrival Destination Cases
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time San Diego 212
## 3 ALASKA on time San Francisco 503
## 4 ALASKA on time Seattle 1841
## 5 ALASKA delayed Los Angeles 62
## 6 ALASKA delayed San Diego 20
## 7 ALASKA delayed San Francisco 102
## 8 ALASKA delayed Seattle 305
## 9 AM WEST on time Los Angeles 694
## 10 AM WEST on time San Diego 383
## 11 AM WEST on time San Francisco 320
## 12 AM WEST on time Seattle 201
## 13 AM WEST delayed Los Angeles 117
## 14 AM WEST delayed San Diego 65
## 15 AM WEST delayed San Francisco 129
## 16 AM WEST delayed Seattle 61
The select functionality was used to order the long tidy dataframe by the destination airport, followed by a combined ordering based on the destination airport and the number of cases. The final selection exercise outlined the capability of selecting consecutive columns, which produced a long tidy table without the airlines.
# Re-order by Destination
(arrange(table_tidy_long_df, desc(Destination)))
## # A tibble: 20 x 4
## Airline Arrival Destination Cases
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Seattle 1841
## 2 ALASKA delayed Seattle 305
## 3 AM WEST on time Seattle 201
## 4 AM WEST delayed Seattle 61
## 5 ALASKA on time San Francisco 503
## 6 ALASKA delayed San Francisco 102
## 7 AM WEST on time San Francisco 320
## 8 AM WEST delayed San Francisco 129
## 9 ALASKA on time San Diego 212
## 10 ALASKA delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 ALASKA on time Phoenix 221
## 14 ALASKA delayed Phoenix 12
## 15 AM WEST on time Phoenix 4840
## 16 AM WEST delayed Phoenix 415
## 17 ALASKA on time Los Angeles 497
## 18 ALASKA delayed Los Angeles 62
## 19 AM WEST on time Los Angeles 694
## 20 AM WEST delayed Los Angeles 117
# Re-order by Destination and Cases
(arrange(table_tidy_long_df, Destination, Cases))
## # A tibble: 20 x 4
## Airline Arrival Destination Cases
## <fct> <fct> <chr> <int>
## 1 ALASKA delayed Los Angeles 62
## 2 AM WEST delayed Los Angeles 117
## 3 ALASKA on time Los Angeles 497
## 4 AM WEST on time Los Angeles 694
## 5 ALASKA delayed Phoenix 12
## 6 ALASKA on time Phoenix 221
## 7 AM WEST delayed Phoenix 415
## 8 AM WEST on time Phoenix 4840
## 9 ALASKA delayed San Diego 20
## 10 AM WEST delayed San Diego 65
## 11 ALASKA on time San Diego 212
## 12 AM WEST on time San Diego 383
## 13 ALASKA delayed San Francisco 102
## 14 AM WEST delayed San Francisco 129
## 15 AM WEST on time San Francisco 320
## 16 ALASKA on time San Francisco 503
## 17 AM WEST delayed Seattle 61
## 18 AM WEST on time Seattle 201
## 19 ALASKA delayed Seattle 305
## 20 ALASKA on time Seattle 1841
# Select columns by name
(select(table_tidy_long_df, Arrival:Cases))
## # A tibble: 20 x 3
## Arrival Destination Cases
## <fct> <chr> <int>
## 1 on time Los Angeles 497
## 2 on time Phoenix 221
## 3 on time San Diego 212
## 4 on time San Francisco 503
## 5 on time Seattle 1841
## 6 delayed Los Angeles 62
## 7 delayed Phoenix 12
## 8 delayed San Diego 20
## 9 delayed San Francisco 102
## 10 delayed Seattle 305
## 11 on time Los Angeles 694
## 12 on time Phoenix 4840
## 13 on time San Diego 383
## 14 on time San Francisco 320
## 15 on time Seattle 201
## 16 delayed Los Angeles 117
## 17 delayed Phoenix 415
## 18 delayed San Diego 65
## 19 delayed San Francisco 129
## 20 delayed Seattle 61
To provide some meanginful analysis of the sample data, an initial summary was performed to count the total number of on-time and delayed flights by airline. The next summary attempt, similar to the first, calculated the count of on-time and delayed flights by destination airport. The final analysis performed grouped the data by destination airport and arrival type, then counting the instances before providing the sum, average, maximum, and minimum of each airport and arrival type pairing before excluding the Phoenix airport. The chained functions allowed for the display of the performance of flight arrivals by airport.
# Group by and Summarise work
by_airline_arrival <- group_by(table_tidy_long_df, Airline, Arrival)
summarise(by_airline_arrival, total = sum(Cases, na.rm = TRUE))
## # A tibble: 4 x 3
## # Groups: Airline [2]
## Airline Arrival total
## <fct> <fct> <int>
## 1 ALASKA delayed 501
## 2 ALASKA on time 3274
## 3 AM WEST delayed 787
## 4 AM WEST on time 6438
by_dest_arrival <- group_by(table_tidy_long_df, Destination, Arrival)
summarise(by_dest_arrival, total = sum(Cases, na.rm = TRUE))
## # A tibble: 10 x 3
## # Groups: Destination [5]
## Destination Arrival total
## <chr> <fct> <int>
## 1 Los Angeles delayed 179
## 2 Los Angeles on time 1191
## 3 Phoenix delayed 427
## 4 Phoenix on time 5061
## 5 San Diego delayed 85
## 6 San Diego on time 595
## 7 San Francisco delayed 231
## 8 San Francisco on time 823
## 9 Seattle delayed 366
## 10 Seattle on time 2042
# Group flights by destination and arrival type.
# Summarise to compute total count, mean, maximum and minimum
# Filter to remove Phoenix airport
arrivals <- table_tidy_long_df %>%
group_by(Destination, Arrival) %>%
summarise(
count = n(),
total_arrivals = sum(Cases, na.rm = TRUE),
avg = mean(Cases, na.rm = TRUE),
max = max(Cases, na.rm = TRUE),
min = min(Cases, na.rm = TRUE)
) %>%
filter( Destination != 'Phoenix')
arrivals
## # A tibble: 8 x 7
## # Groups: Destination [4]
## Destination Arrival count total_arrivals avg max min
## <chr> <fct> <int> <int> <dbl> <int> <int>
## 1 Los Angeles delayed 2 179 89.5 117 62
## 2 Los Angeles on time 2 1191 596. 694 497
## 3 San Diego delayed 2 85 42.5 65 20
## 4 San Diego on time 2 595 298. 383 212
## 5 San Francisco delayed 2 231 116. 129 102
## 6 San Francisco on time 2 823 412. 503 320
## 7 Seattle delayed 2 366 183 305 61
## 8 Seattle on time 2 2042 1021 1841 201
Out of curiosity, I created a wide tidy file in which the arrival types are defined as columns for each combination of airline and destination airport. I understand this doesn’t quite follow Wickham’s definition of tidy as each row captures more than one observation. I chose to create this form of a near tidy table to utilize the mutate functionality and create additional columns for analysis.
table_tidy_long <- arrivals_raw_filled %>%
pivot_longer(c(`Los Angeles`, `Phoenix`, `San Diego`, `San Francisco`, `Seattle`), names_to = "Destination", values_to = "Cases")
table_tidy_wide <- table_tidy_long %>%
pivot_wider(names_from = 'Arrival', values_from = 'Cases')
# Transform tibble to dataframe
table_tidy_wide_df <- tbl_df(table_tidy_wide)
table_tidy_wide_df
## # A tibble: 10 x 4
## Airline Destination `on time` delayed
## <fct> <chr> <int> <int>
## 1 ALASKA Los Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San Diego 212 20
## 4 ALASKA San Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los Angeles 694 117
## 7 AM WEST Phoenix 4840 415
## 8 AM WEST San Diego 383 65
## 9 AM WEST San Francisco 320 129
## 10 AM WEST Seattle 201 61
Based on the wide tidy dataframe, the mutate functionality performs calculations to sum the total arrivals by airline and destination airport while also calculating the percentage of on-time flights based on the sum calculation in the same mutate statement. The resulting table displays two additional columns for the two calculations performed. The wide format tidy table allows for ease of analysis when multiple variables are captured in a single row to calculate aggregate values. Finally, the resulting analysis was ordered by the highest on-time arrival percentage by airline and destination airport.
# Create column for Total Arrivals by Airline and Destination
# Create column for percentage of on-time arrivals by airline and destination
df_with_cols <- mutate(table_tidy_wide_df,
`Total Arrivals` = `on time` + delayed,
`On Time Pct` = `on time` / `Total Arrivals` * 100)
df_with_cols
## # A tibble: 10 x 6
## Airline Destination `on time` delayed `Total Arrivals` `On Time Pct`
## <fct> <chr> <int> <int> <int> <dbl>
## 1 ALASKA Los Angeles 497 62 559 88.9
## 2 ALASKA Phoenix 221 12 233 94.8
## 3 ALASKA San Diego 212 20 232 91.4
## 4 ALASKA San Francisco 503 102 605 83.1
## 5 ALASKA Seattle 1841 305 2146 85.8
## 6 AM WEST Los Angeles 694 117 811 85.6
## 7 AM WEST Phoenix 4840 415 5255 92.1
## 8 AM WEST San Diego 383 65 448 85.5
## 9 AM WEST San Francisco 320 129 449 71.3
## 10 AM WEST Seattle 201 61 262 76.7
(arrange(df_with_cols, desc(`On Time Pct`)))
## # A tibble: 10 x 6
## Airline Destination `on time` delayed `Total Arrivals` `On Time Pct`
## <fct> <chr> <int> <int> <int> <dbl>
## 1 ALASKA Phoenix 221 12 233 94.8
## 2 AM WEST Phoenix 4840 415 5255 92.1
## 3 ALASKA San Diego 212 20 232 91.4
## 4 ALASKA Los Angeles 497 62 559 88.9
## 5 ALASKA Seattle 1841 305 2146 85.8
## 6 AM WEST Los Angeles 694 117 811 85.6
## 7 AM WEST San Diego 383 65 448 85.5
## 8 ALASKA San Francisco 503 102 605 83.1
## 9 AM WEST Seattle 201 61 262 76.7
## 10 AM WEST San Francisco 320 129 449 71.3
This assignment clearly demonstrated the value of tidy data. Coming from a software development background, I always attempted to normalize data for storage in relational databases. The week’s reading along with this assignment demonstrated the value of denormalizing data when appropriate in order to make data analysis easier. Also, transforming input data into a tidy form can make the data easier and simpler to read, which in turn can make the analysis tasks easier to write in R and understand when displayed. The tidyr and dplyr packages are powerful tools in the R programming language that with some practice can transform messy input data into data structures more suited for data analysis. As for the conclusion regarding the sample data, the Phoenix airport has the best on-time percentages, and the Alaska airline outperforms the American West airline.