1. Overview

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. Use a “wide” structure similar to how the information appears below, so that tidying and transformations can be used on the data.

# Libraries to be used
library(tidyverse)

2. Read untidy data

2.1 Read the arrival data

Read the arrival data from your .CSV file

src_file_path <- "C:\\tmp\\Airline_Arrival_Delays.csv"
untidy_flights <- read_csv(src_file_path)

2.2 Peak at the read data

Take a peak at the data read from the CSV file.

Table: Untidy data for flight arrivals
Airline Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

3 Tidy the data

Use “tidyr” and “dplyr” as needed to tidy up and transform the data

flights <- untidy_flights %>% 
  # remove blank lines
  drop_na() %>%
  # change format from wide to long to tidy up data
  pivot_longer(`Los Angeles`:`Seattle`, names_to = "destination", values_to = "count") %>%
    # rename column names
  rename(airline = Airline, flight_status = `Flight Status`)

Show a glimpse of the transformed data.

Table: Tidy data for flight arrivals
airline flight_status destination count
ALASKA on time Los Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San Diego 212
ALASKA on time San Francisco 503
ALASKA on time Seattle 1841
ALASKA delayed Los Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San Diego 20
ALASKA delayed San Francisco 102
ALASKA delayed Seattle 305

4 Perform data analysis

Perform data analysis to compare the arrival delays for the two airlines.

4.1 Build list of delayed flights

delayed_flights <- flights %>%
  filter(flight_status == "delayed")

4.2 Overall delays per Airline

Which airline has the most overall delays and which one the least?

by_airline <- delayed_flights %>%
  group_by(airline) %>%
  summarise(total_count = sum(count)) %>%
  arrange(desc(total_count))

Overall, “AM West” airline has the highest number of delays compared to “ALASKA”, which has the lowest.

Table: Total delays per airline
airline total_count
AM WEST 787
ALASKA 501

We can see the discrepancy between each airline’s delays graphically

ggplot(by_airline, aes(x = airline, y = total_count, fill = airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  ggtitle("Total Overall Flight Delays Per Airline") +
  labs(x = "Airport Destinations", y = "Total Delays") +
  theme(legend.position = "none")

4.3 Delays per destination and airline

Which airport destinations had the most and least delays?

by_destination <- delayed_flights %>%
  group_by(destination, airline) %>%
   summarise(total_count = sum(count)) %>%
   arrange(desc(total_count))

Overall, the Phoenix destination has highest number of delays (“AM West” airline) as well as the lowest number of delays (“ALASKA” airline).

Table: Total delays per destination per airline in descending order
destination airline total_count
Phoenix AM WEST 415
Seattle ALASKA 305
San Francisco AM WEST 129
Los Angeles AM WEST 117
San Francisco ALASKA 102
San Diego AM WEST 65
Los Angeles ALASKA 62
Seattle AM WEST 61
San Diego ALASKA 20
Phoenix ALASKA 12

Below is the graphical representation of delays per destination in descending order.

ggplot(by_destination, aes(x = reorder(destination, total_count), y = total_count, fill = airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  ggtitle("Total Flight Delays Per Destination Per Airline") +
  labs(x = "Airport Destinations", y = "Total Delays")

5. Conclusion

It was demonstrated that the packages “tidyr” and “dplyr” can be extremely useful to tidy up and transform the data.