Tidying and Transforming Data

Introduction

I was tasked with creating a database to hold some untidy airline data, then import that data into R, tidy the data, and analysize the performance of the airlines. The two airlines were Alaska and AM West and I was simply comparing on time/delayed data for several destinations.

Code

First, I imported the data from a MySQL database using RMySQL and keyring.

# open the required libraries
library(keyring)
library(RMySQL)

## Loading required package: DBI

library(ggplot2)
library(tidyverse)

## -- Attaching packages ------------------------------------------------ tidyverse 1.3.0 --

## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## v purrr   0.3.3

## -- Conflicts --------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# connect to MySQL and get all of the airline data from the database
con <- dbConnect(MySQL(), dbname = 'airlines', host = 'localhost', port = 3306, user = 'root', password = key_get("airlines", "root"))

sql = "SELECT * FROM flights"
airlines <- dbGetQuery(con, sql)
names(airlines) <- c("airline", "result", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")

Next, I computed the total values for on time and delayed flights for each airline and plotted this just to see an intial overall comparison.

# get the total number of delayed and on-time flights for each airline
airlines$Totals <- rowSums(airlines[3:7])

# compare airlines in a stacked proportional bar graph
ggplot(airlines, aes(x = airline, y = Totals, fill = result)) +
  geom_bar(position = "fill", stat = "identity") +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Airline",
       y = "On Time",
       fill = "Result",
       title = "Comparing On Time Rates of Airlines",
       subtitle = "AM West has a better overall on-time rate")

My next step was to tidy the data using pivot_longer to move the flight destinations to a new column called destination. I then ussed this new data frame to create two more data frames, totalFlights and onTimeFlights. I created these data frames by grouping the data by airline and destination. I then used summarise to get the total flights for each airline for each destination. I created onTimeFlights by mutating to get the on time rate for each airline and destination pairing and then filtering based on result and then selecting everything except the result and total columns.

# use pivot_longer to switch destinations from columns to rows
airlines <- airlines %>%
  select(-Totals) %>%
  pivot_longer(
    c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
    names_to = "destination",
    values_to = "count"
  )

# set up the airlines data frame
airlines <- airlines %>%
  group_by(airline, destination)

# get the total flights based on destination
totalFlights <- airlines %>%
  summarise(count = sum(count))

# filter based result (keep onTime values)
onTimeFlights <- airlines %>%
  mutate(total = sum(count)) %>%
  mutate(rate = round(count/total, digits = 2)) %>%
  filter(result == "onTime") %>%
  select(-result, -total)

Finally, I created some plots using these new data frames to compare the airlines based on destination airports. I started by ceating a scatter plot and bar graph to compare on-time rates at each airport. This made it clear that ALASKA airlines had a better on-time performance for each destination. I then decided to compare the total number of flights by each airline to each destination. This is where it became clear that AM WEST had a superior on-time performance because they sent a drasticlly high number of flights to the airport with the best on-time rate, Phoenix.

# create plots to compare the airlines performace based on destination
ggplot(onTimeFlights, aes(x = destination, y = rate, color = airline)) +
  geom_point() +
  scale_color_brewer(palette = "Dark2") +
  labs(x = "Destination",
       y = "On Time Rate",
       color = "Airline") +
  coord_flip()

ggplot(onTimeFlights, aes(x = reorder(destination, -rate), y = rate, fill = airline)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Destination",
       y = "On Time",
       fill = "Airline") +
  coord_flip()

ggplot(totalFlights, aes(x = airline, y = count, fill = destination)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_fill_brewer(palette = "Spectral") +
  labs(x = "Airline",
       y = "Total Flights",
       fill = "Destination")

ggplot(totalFlights, aes(x = reorder(destination, count), y = count, fill = airline)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Destination",
       y = "Total Flights",
       fill = "Airline") +
  coord_flip()

Conclusion

When I first compared the two airlines, AM West was shown to have a better on-time rate than Alaska. However, when the data is broken down by destination airport, Alaska has a better on time rate for each individual destination.

I was curious as to how this could be so. I decided to compare total flights to each destination. IT became clear that the reason AM West has a better overall on-time rate is due to the massive number of flights it sends to Phoenix (which seems to be a good airport since ALASKA also has its highest on-time rate there). In conclusion, while AM West has the higher overall on-time rate, I would choose Alaska airlines since they still have a better on-time rate at each airport (potentially unless I was headed to Phoenix and I needed to have flexible times).

Tidying and Transforming Data

David Moste

2/25/2020

Introduction

Code

Conclusion