Week Five - Working with Tidy Data

1. Overview

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. Use a “wide” structure similar to how the information appears below, so that tidying and transformations can be used on the data.

# Libraries to be used
library(tidyverse)

2. Read untidy data

2.1 Read the arrival data

Read the arrival data from your .CSV file

src_file_path <- "C:\\tmp\\Airline_Arrival_Delays.csv"
untidy_flights <- read_csv(src_file_path)

2.2 Peak at the read data

Take a peak at the data read from the CSV file.

Table: Untidy data for flight arrivals
Airline	Flight Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

3 Tidy the data

Use “tidyr” and “dplyr” as needed to tidy up and transform the data

flights <- untidy_flights %>% 
  # remove blank lines
  drop_na() %>%
  # change format from wide to long to tidy up data
  pivot_longer(`Los Angeles`:`Seattle`, names_to = "destination", values_to = "count") %>%
    # rename column names
  rename(airline = Airline, flight_status = `Flight Status`)

Show a glimpse of the transformed data.

Table: Tidy data for flight arrivals
airline	flight_status	destination	count
ALASKA	on time	Los Angeles	497
ALASKA	on time	Phoenix	221
ALASKA	on time	San Diego	212
ALASKA	on time	San Francisco	503
ALASKA	on time	Seattle	1841
ALASKA	delayed	Los Angeles	62
ALASKA	delayed	Phoenix	12
ALASKA	delayed	San Diego	20
ALASKA	delayed	San Francisco	102
ALASKA	delayed	Seattle	305

4 Perform data analysis

Perform data analysis to compare the arrival delays for the two airlines.

4.1 Build list of delayed flights

delayed_flights <- flights %>%
  filter(flight_status == "delayed")

4.2 Overall delays per Airline

Which airline has the most overall delays and which one the least?

by_airline <- delayed_flights %>%
  group_by(airline) %>%
  summarise(total_count = sum(count)) %>%
  arrange(desc(total_count))

Overall, “AM West” airline has the highest number of delays compared to “ALASKA”, which has the lowest.

Table: Total delays per airline
airline	total_count
AM WEST	787
ALASKA	501

We can see the discrepancy between each airline’s delays graphically

ggplot(by_airline, aes(x = airline, y = total_count, fill = airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  ggtitle("Total Overall Flight Delays Per Airline") +
  labs(x = "Airport Destinations", y = "Total Delays") +
  theme(legend.position = "none")

4.3 Delays per destination and airline

Which airport destinations had the most and least delays?

by_destination <- delayed_flights %>%
  group_by(destination, airline) %>%
   summarise(total_count = sum(count)) %>%
   arrange(desc(total_count))

Overall, the Phoenix destination has highest number of delays (“AM West” airline) as well as the lowest number of delays (“ALASKA” airline).

Table: Total delays per destination per airline in descending order
destination	airline	total_count
Phoenix	AM WEST	415
Seattle	ALASKA	305
San Francisco	AM WEST	129
Los Angeles	AM WEST	117
San Francisco	ALASKA	102
San Diego	AM WEST	65
Los Angeles	ALASKA	62
Seattle	AM WEST	61
San Diego	ALASKA	20
Phoenix	ALASKA	12

Below is the graphical representation of delays per destination in descending order.

ggplot(by_destination, aes(x = reorder(destination, total_count), y = total_count, fill = airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  ggtitle("Total Flight Delays Per Destination Per Airline") +
  labs(x = "Airport Destinations", y = "Total Delays")

5. Conclusion

It was demonstrated that the packages “tidyr” and “dplyr” can be extremely useful to tidy up and transform the data.