Overview

In this assignment, we will be tidying a dataset containing delay and ontime information for two airlines by city.

The dataset resembles the table below:

Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1,841
delayed 62 12 20 102 305
AM WEST on time 694 4,840 383 320 201
delayed 117 415 65 129 61

1. Setting up the data

Firstly, we will need to have this data available from somewhere in the format above. I decided to use GCP’s BigQuery and initialized the table with the below SQL statements:

drop table if exists data607.assignment4_data; # nolint: error.

CREATE TABLE data607.assignment4_data (
row_id int64,
column1 STRING,
column2 STRING,
`Los Angeles` INT64,
`Phoenix` INT64,
`San Diego` INT64,
`San Francisco` INT64,
`Seattle` INT64
);

INSERT INTO data607.assignment4_data 
(row_id, column1, column2, `Los Angeles`, `Phoenix`, `San Diego`, `San Francisco`, `Seattle`) # nolint: line_length_linter.
VALUES
(1,'ALASKA', 'on time',494,221,212,503,1841),
(2,NULL,'delayed',62,12,20,102,305),
(3,NULL,NULL,NULL,NULL,NULL,NULL,NULL),
(4,'AM West', 'on time',694,4840,383,320,201),
(5,NULL,'delayed',117,415,65,129,61);

A row_id was necessary as the order of the rows mattered and there was an issue where rows would be ordered by alphabetical order on the first column, disrupting the order.

2. Extracting the data from GCP

Since we are using biguery, we will need the path to our file and some parameters:

library(bigrquery)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
path_to_bq_creds <- "data607/gcp_bq_auth/gcp_bq_auth.json"

project_id <- "cuny-msds"
dataset_name <- "data607"
raw_data_tablename <- "assignment4_data"

bigrquery::bq_auth(path = path_to_bq_creds)
## ! Using an auto-discovered, cached token.
##   To suppress this message, modify your code or options to clearly consent to
##   the use of a cached token.
##   See gargle's "Non-interactive auth" vignette for more details:
##   <https://gargle.r-lib.org/articles/non-interactive-auth.html>
## ℹ The bigrquery package is using a cached token for 'rtk0211@gmail.com'.

With the set up complete, we can now interact with the database. Let’s do so to get the data from our table and order by row_id:

dataset <- bq_dataset(
  project_id, # nolint
  dataset_name
)

flight_table <- bq_dataset_query(
  dataset, # nolint
  paste("SELECT * FROM ", dataset_name, ".", raw_data_tablename)
)
## Auto-refreshing stale OAuth token.
# As a reminder, we want the initial dataframe to preserve the order from
#   the question's prompt.
flight_data <- bq_table_download(flight_table) |>
arrange(row_id)

flight_data
## # A tibble: 5 × 8
##   row_id column1 column2 `Los Angeles` Phoenix `San Diego` `San Francisco`
##    <int> <chr>   <chr>           <int>   <int>       <int>           <int>
## 1      1 ALASKA  on time           494     221         212             503
## 2      2 <NA>    delayed            62      12          20             102
## 3      3 <NA>    <NA>               NA      NA          NA              NA
## 4      4 AM West on time           694    4840         383             320
## 5      5 <NA>    delayed           117     415          65             129
## # ℹ 1 more variable: Seattle <int>

3. Tidy the data for analysis

To tidy this data, we’ll need to do a few things. Firstly, I would like to define how I prefer this data to be oriented. I would prefer that there are 4 columns:

  1. Airline
  2. City
  3. On Time
  4. Delayed

In order to get the data into this format, we will need to perform the following transformations to the dataframe:

  1. Remove any fully empty rows
  2. Establish the names of the columns
  3. Clear out any empty cells
  4. Pivot the data from a wide to a long format
  5. Pivot the data to a wide format where the flight status becomes a column

To remove the empty row, we will say that the status column must be populated. That means that if there is nothing entered in the status column then we will remove it:

flight_data <- flight_data |>
  drop_na(column2)# nolint

flight_data
## # A tibble: 4 × 8
##   row_id column1 column2 `Los Angeles` Phoenix `San Diego` `San Francisco`
##    <int> <chr>   <chr>           <int>   <int>       <int>           <int>
## 1      1 ALASKA  on time           494     221         212             503
## 2      2 <NA>    delayed            62      12          20             102
## 3      4 AM West on time           694    4840         383             320
## 4      5 <NA>    delayed           117     415          65             129
## # ℹ 1 more variable: Seattle <int>

Step 2:

colnames(flight_data) <- c(
  'row_id', # nolint
  'airline',
  'flight_status',
  'Los Angeles',
  'Phoenix',
  'San Diego',
  'San Francisco',
  'Seattle'
)

flight_data
## # A tibble: 4 × 8
##   row_id airline flight_status `Los Angeles` Phoenix `San Diego` `San Francisco`
##    <int> <chr>   <chr>                 <int>   <int>       <int>           <int>
## 1      1 ALASKA  on time                 494     221         212             503
## 2      2 <NA>    delayed                  62      12          20             102
## 3      4 AM West on time                 694    4840         383             320
## 4      5 <NA>    delayed                 117     415          65             129
## # ℹ 1 more variable: Seattle <int>

For step 3, I’m making the assumption that the airline in row 2 is ALASKA and AM West in row 5. I’m making this assumption based on looking at Phoenix. Assuming that most airline flights are on time (which sometimes doesn’t feel that way), we can eliminate row 5 as being part of ALASKA as there are 415 delayed flights in row 5 compared to the 221 on time flights for ALASKA. Then, by the process of elimination, we can safely assume that row 5 represents AM West delayed flights.

Now that we know this, we can use the fill() function to fill the missing entries and then remove the row_id:

flight_data <- flight_data |>
  fill(airline) |> # nolint
  select(!row_id)

flight_data
## # A tibble: 4 × 7
##   airline flight_status `Los Angeles` Phoenix `San Diego` `San Francisco`
##   <chr>   <chr>                 <int>   <int>       <int>           <int>
## 1 ALASKA  on time                 494     221         212             503
## 2 ALASKA  delayed                  62      12          20             102
## 3 AM West on time                 694    4840         383             320
## 4 AM West delayed                 117     415          65             129
## # ℹ 1 more variable: Seattle <int>

For step 4, we can employ the pivot_longer() function:

flight_data <- flight_data |>
  pivot_longer( # nolint
    cols = !c(airline, flight_status), # nolint
    names_to = "city",
    values_to = "observations"
)

head(flight_data)
## # A tibble: 6 × 4
##   airline flight_status city          observations
##   <chr>   <chr>         <chr>                <int>
## 1 ALASKA  on time       Los Angeles            494
## 2 ALASKA  on time       Phoenix                221
## 3 ALASKA  on time       San Diego              212
## 4 ALASKA  on time       San Francisco          503
## 5 ALASKA  on time       Seattle               1841
## 6 ALASKA  delayed       Los Angeles             62

Finally, for step 5:

flight_data <- flight_data |>
  pivot_wider( # nolint
    names_from = flight_status, # nolint
    values_from = observations
)

colnames(flight_data) <- c('airline','city','on_time','delayed')

flight_data
## # A tibble: 10 × 4
##    airline city          on_time delayed
##    <chr>   <chr>           <int>   <int>
##  1 ALASKA  Los Angeles       494      62
##  2 ALASKA  Phoenix           221      12
##  3 ALASKA  San Diego         212      20
##  4 ALASKA  San Francisco     503     102
##  5 ALASKA  Seattle          1841     305
##  6 AM West Los Angeles       694     117
##  7 AM West Phoenix          4840     415
##  8 AM West San Diego         383      65
##  9 AM West San Francisco     320     129
## 10 AM West Seattle           201      61

4. Flight Analysis

For this portion, we will compare the performance by city for each airline:

flights_summary <- flight_data |>
  mutate(total_flights = on_time + delayed) |> # nolint
  mutate(delay_pct = delayed/total_flights)

ggplot(flights_summary, aes(x = airline, y = delay_pct, fill = airline)) +
  geom_col() + # nolint
  facet_wrap(~city) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Delay Percentage") +
  xlab("")

This graph above shows pretty clearly that AM West pretty reliably has a higher percentage of delayed flights over ALASKA. For both airlines, Pheonix was the airport with the lowest delay percentage.

flights_summary |>
  filter(city == "San Francisco") # nolint
## # A tibble: 2 × 6
##   airline city          on_time delayed total_flights delay_pct
##   <chr>   <chr>           <int>   <int>         <int>     <dbl>
## 1 ALASKA  San Francisco     503     102           605     0.169
## 2 AM West San Francisco     320     129           449     0.287
.287/.169
## [1] 1.698225

Visually, it seems that San Francisco has the greatest difference in on time percentage where AM West is about has a delay rate 69.8% greater than that of ALASKA.

Looking at the total number of flights:

ggplot(flights_summary, aes(x=airline, y=total_flights, fill=airline)) +
  geom_col() + # nolint
  facet_wrap(~city) +
  geom_label(aes(label = round(100*delay_pct,0)))

Now looking at this new figure, we have the delay percentage as the text on the bars and the number of flights at each airport. From here we can see that although Phoenix had the lowest delay percentage, it sees the most flights.

Conclusion

By cleaning up some relatively unstructured data about flight information, we were able to analyze the data and determine that anyone flying AM West out of San Francisco can expect to be delayed.

We can see the delay percentage by city and compeitor. From here, we can see that AM West consistently performs worse than ALASKA. This is disheartening as they have many more flights than ALASKA.