1. Loading Data into a Dataframe

Via Github, using the read_csv function from the tidyverse package, I will be loading data from the article: “Airlines.csv” recreated from Numbersense by Kaiser Fung, 2013. The data is in CSV format and I will be loading it into a dataframe. Additionally, since the data is recreated from a book, this data has been created for educational purposes and is not real data. Additionally, the data is stored in a private repository for use.
The data should provide us with the airline, the number of on time flights, the number of delayed flights, and for each, separated by one of five cities.
The cities are:

airlines <- read_csv("https://raw.githubusercontent.com/Aetherius-rex/DATA607RexWong/refs/heads/main/Assignment%205/Airlines.csv?token=GHSAT0AAAAAAC544NC4WHK63PBQKQKSQEVEZ6HWG3Q")
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
head(airlines)
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61


Quick Preprocessing

The data does not have a very friendly column name for some of the data, so we will rename the columns to make it easier to work with.
First, we will rename the first column to ‘airline’ and the second column to ‘status’. This will make it easier to understand what the data is about.
Third, we will reformat the column names for the cities to make it easier to work with. We will replace spaces with underscores, and make the names lowercase for consistency both for the other columns, and with how the pivoting later will append status into columns with underscores.
Lastly we will rename status ‘on time’ to ‘on_time’ for consistency with the column names.

# Rename columns
colnames(airlines) <- c("airline", "status", "los_angeles", "phoenix", "san_diego",
    "san_francisco", "seattle")
# Reformat items of status column
airlines$status <- sub("on time", "on_time", airlines$status)
head(airlines)
## # A tibble: 5 × 7
##   airline status  los_angeles phoenix san_diego san_francisco seattle
##   <chr>   <chr>         <dbl>   <dbl>     <dbl>         <dbl>   <dbl>
## 1 ALASKA  on_time         497     221       212           503    1841
## 2 <NA>    delayed          62      12        20           102     305
## 3 <NA>    <NA>             NA      NA        NA            NA      NA
## 4 AM WEST on_time         694    4840       383           320     201
## 5 <NA>    delayed         117     415        65           129      61


Tidying the Data

The key issue with this data is its untidy nature. The data for each of the two airlines is stored in different rows, and the cities are stored in different columns. This makes it difficult to analyze the data in its current form as taking a column for a city would not return specifically on time or delayed flights for a particular airline. However, what it does do well is that it separates the data for each city onto each rows, which is a good start, but we can make this better.
Preferably, we’d want each column to not only represent a city but more specifically on time or delayed flights respectively, therefore a city will have two columns. This allows us to filter for on time or delayed flights for a particular airline and city with ease. Whereas, in its current state, we have to isolate cells of data based on three conditions: the rows for an airline, and the row for delayed or on time flights, and the column for a city. This is not ideal for for programming as it may not be robust to misspelled or malformed data.
For example, if the ‘on time’ cell is misspelled as ‘on timee’, then the program would not be able to find the data for on time flights for a particular airline and city. It may be easy to say that we can simply find the pattern to the order the data is kept, in this data set delayed is always the second row for each airline, but this may not always be the case for real data. Should a data set be malformed and have data with no consistent order to it, it would be difficult to filter for the data we want. Lastly, by designing and implementing a system more robust to either of these two problems would present a complex solution that could be easily avoided by simply reformatting the data into a tidy format.


Removing ‘NA’ Values and Imputing Missing Values

  1. Remove the third row of the airlines data as it is just a row of ‘NA’ values.
  2. Impute the airline names downwards to fill in the missing values, associating the airline with their data.
# Drop third row of airlines
airlines <- airlines[-3, ]
# Impute airline names downwards
airlines <- airlines %>%
    fill("airline")

Time to Tidy

Thanks to tidyr we can easily tidy the data with the pivot_wider function. By passing the column names that splits the key or in this case the airlines and what data we want to pivot to new columns, we can easily reformat the data into a tidy format.

# Tidy the data with pivot_wider
airlines_tidy <- airlines %>%
    pivot_wider(names_from = "status", values_from = c("los_angeles", "phoenix",
        "san_diego", "san_francisco", "seattle"))

airlines_tidy
## # A tibble: 2 × 11
##   airline los_angeles_on_time los_angeles_delayed phoenix_on_time
##   <chr>                 <dbl>               <dbl>           <dbl>
## 1 ALASKA                  497                  62             221
## 2 AM WEST                 694                 117            4840
## # ℹ 7 more variables: phoenix_delayed <dbl>, san_diego_on_time <dbl>,
## #   san_diego_delayed <dbl>, san_francisco_on_time <dbl>,
## #   san_francisco_delayed <dbl>, seattle_on_time <dbl>, seattle_delayed <dbl>

Done, Time to Visualize!

Now that we have tidied the data, we can easily filter for on time or delayed flights for a particular airline and city. This makes it easier to analyze the data and find insights that may not have been possible with the data in its original form. To experiment with this, we will do a few quick subsets and summarize the data to see if we can find any interesting insights. I will demonstrate both querying the new transformed data and querying multiple columns at once to find insights.

  1. Compare the number of flights made on time at Seattle by both airlines.
  2. Overlay the number of flights made delayed at Seattle by both airlines.
  3. Find the ratio between delayed and total flights for each airline at Seattle. To isolate probability of a flight being delayed for each airline at Seattle rather than the total number of flights made on time or delayed at Seattle.
  4. Carry out the same ratio examination across all cities to see how well either airline does overall.

1. Compare the number of flights made on time at Seattle by both airlines.


# seattle_on_time shown on bar graph in ggplot
ggplot(airlines_tidy, aes(x = airline, y = seattle_on_time, fill = airline, label = seattle_on_time)) +
    geom_bar(stat = "identity") + geom_text(size = 3, position = position_stack(vjust = 0.5)) +
    labs(title = "Number of Flights Made On Time at Seattle by Airlines", x = "Airline",
        y = "Number of Flights On Time")


Based strictly on this graph, its clear that United Airlines has more flights made on time at Seattle than American Airlines. However that only tells us part of the story, as we do not know how many flights were made delayed at Seattle by each airline.


2. Visualize the number of flights made delayed at Seattle by both airlines.


# seattle_delayed shown on bar graph in ggplot
ggplot(airlines_tidy, aes(x = airline, y = seattle_delayed, fill = airline, label = seattle_delayed)) +
    geom_bar(stat = "identity") + geom_text(size = 3, position = position_stack(vjust = 0.5)) +
    labs(title = "Number of Flights Made Delayed at Seattle by Airlines", x = "Airline",
        y = "Number of Flights Delayed")


Based strictly on this graph, its clear that United Airlines has more flights made delayed at Seattle than American Airlines. However, this is tricky to interpret as if we simply go off this value, its clear that Alaskan has much more delayed flights than AM. To be able to examine how either airline does in general, irrespective of the number of flights made, we can find the ratio of delayed to total flights for each airline at Seattle.


3. Find the ratio between delayed and total flights for each airline at Seattle and Visualize the ratio


# seattle_ratio shown on bar graph in ggplot
ggplot(airlines_tidy, aes(x = airline, y = seattle_delayed/(seattle_on_time + seattle_delayed),
    fill = airline, label = seattle_delayed/(seattle_on_time + seattle_delayed))) +
    geom_bar(stat = "identity") + geom_text(size = 3, position = position_stack(vjust = 0.5)) +
    labs(title = "Ratio of On Time to Delayed Flights at Seattle by Airlines", x = "Airline",
        y = "Ratio of Delayed to Total Flights")


This graph is the culmination of both the previous graphs, it shows the ratio of delayed flights over all flights made at Seattle by either airline. This is a good metric to see how well either airline does in general, irrespective of the number of flights made. Thankfully, the data being handled in a tidy format makes it easy to find these insights and visualize them by simply calling for the columns we want to analyze and summing them or dividing them as needed. With this graph showing overall performance, we can also apply a similar calculation to all 5 cities to see how well either airline does overall, maybe having more flights decreases the delay rate, or maybe the delay rate is consistent across all cities depending on the airline.


4. Find the ratio between delayed and total flights for each airline at each city and Visualize the ratio


Like before, getting the ratios is a simple calculation performed on data for each airline at each city, however, to visualize this data, we will need to store the data in a long format to be able to visualize it. This is because we want to compare the ratio of delayed to total flights for each airline at each city, and stack them on top of each other for a stacked bar graph. This is not possible with the current format of the data, as the data is stored in a wide format where each city is a column, and cannot be addressed simultaneously as the y-axis in a ggplot graph.

# init a new column for the ratio of delayed to total for each airline at each
# city
airlines_tidy <- airlines_tidy %>%
    mutate(los_angeles_ratio = los_angeles_delayed/(los_angeles_on_time + los_angeles_delayed),
        phoenix_ratio = phoenix_delayed/(phoenix_on_time + phoenix_delayed), san_diego_ratio = san_diego_delayed/(san_diego_on_time +
            san_diego_delayed), san_francisco_ratio = san_francisco_delayed/(san_francisco_on_time +
            san_francisco_delayed), seattle_ratio = seattle_delayed/(seattle_on_time +
            seattle_delayed))

# store as a long format for easier visualization
airlines_long <- airlines_tidy %>%
    pivot_longer(cols = c(los_angeles_ratio, phoenix_ratio, san_diego_ratio, san_francisco_ratio,
        seattle_ratio), names_to = "city", values_to = "ratio")

# visualize the data
ggplot(airlines_long, aes(x = airline, y = ratio, fill = airline, label = ratio)) +
    geom_bar(stat = "identity", colour = "black") + geom_text(size = 2, position = position_stack(vjust = 0.7),
    parse = TRUE) + geom_text(size = 3, position = position_stack(vjust = 0.3), check_overlap = TRUE,
    aes(label = city)) + labs(title = "Ratio of On Time to Delayed Flights by Airlines at Cities",
    x = "Airline", y = "Ratio of Delayed to Total Flights")


Through this graph, we can see that Alaska’s lower delay rate compared to AM is consistent across all cities. By comparing these values, to how many flights are made at each city by each airline, I cannot find a pattern based on overall flights flown that may explain Alaska’s better performance overall. However, this process of manipulating the data is only made possible by the initial efforts to tidy the data. Sure, it was necessary at this step to form a long format to visualize the data, but this is a simple step that can be done for specific analysis and is made possible by the data already being in tidy format. This is the power of tidy data, it allows for easy manipulation and analysis of data that would be difficult or impossible.


Conclusion

Through this process, we have taken untidy data, data that is meant for human consumption and not for processing, and turned it into tidy data, data formatted for manipulation and analysis. This processes has allows us to easy generate visualizations without complex data manipulation, and to also allow us to reformat the data for specific analysis that requires it. Like Hadley Wickham states in his paper on Tidy Data, the purpose of tidy data is its standard way of organizing data, and its easy for manipulation, modeling, and visualization. In the end, I believe that in this assignment, I have demonstrated the process of converting raw data into tidy data and the benefits of doing so.