In this assignment, we will be tidying a dataset containing delay and ontime information for two airlines by city.
The dataset resembles the table below:
| Los Angeles | Phoenix | San Diego | San Francisco | Seattle | ||
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1,841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| AM WEST | on time | 694 | 4,840 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
Firstly, we will need to have this data available from somewhere in the format above. I decided to use GCP’s BigQuery and initialized the table with the below SQL statements:
drop table if exists data607.assignment4_data; # nolint: error.
CREATE TABLE data607.assignment4_data (
row_id int64,
column1 STRING,
column2 STRING,
`Los Angeles` INT64,
`Phoenix` INT64,
`San Diego` INT64,
`San Francisco` INT64,
`Seattle` INT64
);
INSERT INTO data607.assignment4_data
(row_id, column1, column2, `Los Angeles`, `Phoenix`, `San Diego`, `San Francisco`, `Seattle`) # nolint: line_length_linter.
VALUES
(1,'ALASKA', 'on time',494,221,212,503,1841),
(2,NULL,'delayed',62,12,20,102,305),
(3,NULL,NULL,NULL,NULL,NULL,NULL,NULL),
(4,'AM West', 'on time',694,4840,383,320,201),
(5,NULL,'delayed',117,415,65,129,61);
A row_id was necessary as the order of the rows mattered and there was an issue where rows would be ordered by alphabetical order on the first column, disrupting the order.
Since we are using biguery, we will need the path to our file and some parameters:
library(bigrquery)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
path_to_bq_creds <- "data607/gcp_bq_auth/gcp_bq_auth.json"
project_id <- "cuny-msds"
dataset_name <- "data607"
raw_data_tablename <- "assignment4_data"
bigrquery::bq_auth(path = path_to_bq_creds)
## ! Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to
## the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## <https://gargle.r-lib.org/articles/non-interactive-auth.html>
## ℹ The bigrquery package is using a cached token for 'rtk0211@gmail.com'.
With the set up complete, we can now interact with the database. Let’s do so to get the data from our table and order by row_id:
dataset <- bq_dataset(
project_id, # nolint
dataset_name
)
flight_table <- bq_dataset_query(
dataset, # nolint
paste("SELECT * FROM ", dataset_name, ".", raw_data_tablename)
)
## Auto-refreshing stale OAuth token.
# As a reminder, we want the initial dataframe to preserve the order from
# the question's prompt.
flight_data <- bq_table_download(flight_table) |>
arrange(row_id)
flight_data
## # A tibble: 5 × 8
## row_id column1 column2 `Los Angeles` Phoenix `San Diego` `San Francisco`
## <int> <chr> <chr> <int> <int> <int> <int>
## 1 1 ALASKA on time 494 221 212 503
## 2 2 <NA> delayed 62 12 20 102
## 3 3 <NA> <NA> NA NA NA NA
## 4 4 AM West on time 694 4840 383 320
## 5 5 <NA> delayed 117 415 65 129
## # ℹ 1 more variable: Seattle <int>
To tidy this data, we’ll need to do a few things. Firstly, I would like to define how I prefer this data to be oriented. I would prefer that there are 4 columns:
In order to get the data into this format, we will need to perform the following transformations to the dataframe:
To remove the empty row, we will say that the status column must be populated. That means that if there is nothing entered in the status column then we will remove it:
flight_data <- flight_data |>
drop_na(column2)# nolint
flight_data
## # A tibble: 4 × 8
## row_id column1 column2 `Los Angeles` Phoenix `San Diego` `San Francisco`
## <int> <chr> <chr> <int> <int> <int> <int>
## 1 1 ALASKA on time 494 221 212 503
## 2 2 <NA> delayed 62 12 20 102
## 3 4 AM West on time 694 4840 383 320
## 4 5 <NA> delayed 117 415 65 129
## # ℹ 1 more variable: Seattle <int>
Step 2:
colnames(flight_data) <- c(
'row_id', # nolint
'airline',
'flight_status',
'Los Angeles',
'Phoenix',
'San Diego',
'San Francisco',
'Seattle'
)
flight_data
## # A tibble: 4 × 8
## row_id airline flight_status `Los Angeles` Phoenix `San Diego` `San Francisco`
## <int> <chr> <chr> <int> <int> <int> <int>
## 1 1 ALASKA on time 494 221 212 503
## 2 2 <NA> delayed 62 12 20 102
## 3 4 AM West on time 694 4840 383 320
## 4 5 <NA> delayed 117 415 65 129
## # ℹ 1 more variable: Seattle <int>
For step 3, I’m making the assumption that the airline in row 2 is
ALASKA and AM West in row 5. I’m making this
assumption based on looking at Phoenix. Assuming that most airline
flights are on time (which sometimes doesn’t feel that way), we can
eliminate row 5 as being part of ALASKA as there are 415
delayed flights in row 5 compared to the 221 on time flights for
ALASKA. Then, by the process of elimination, we can safely
assume that row 5 represents AM West delayed flights.
Now that we know this, we can use the fill() function to
fill the missing entries and then remove the row_id:
flight_data <- flight_data |>
fill(airline) |> # nolint
select(!row_id)
flight_data
## # A tibble: 4 × 7
## airline flight_status `Los Angeles` Phoenix `San Diego` `San Francisco`
## <chr> <chr> <int> <int> <int> <int>
## 1 ALASKA on time 494 221 212 503
## 2 ALASKA delayed 62 12 20 102
## 3 AM West on time 694 4840 383 320
## 4 AM West delayed 117 415 65 129
## # ℹ 1 more variable: Seattle <int>
For step 4, we can employ the pivot_longer()
function:
flight_data <- flight_data |>
pivot_longer( # nolint
cols = !c(airline, flight_status), # nolint
names_to = "city",
values_to = "observations"
)
head(flight_data)
## # A tibble: 6 × 4
## airline flight_status city observations
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 494
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
Finally, for step 5:
flight_data <- flight_data |>
pivot_wider( # nolint
names_from = flight_status, # nolint
values_from = observations
)
colnames(flight_data) <- c('airline','city','on_time','delayed')
flight_data
## # A tibble: 10 × 4
## airline city on_time delayed
## <chr> <chr> <int> <int>
## 1 ALASKA Los Angeles 494 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San Diego 212 20
## 4 ALASKA San Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM West Los Angeles 694 117
## 7 AM West Phoenix 4840 415
## 8 AM West San Diego 383 65
## 9 AM West San Francisco 320 129
## 10 AM West Seattle 201 61
For this portion, we will compare the performance by city for each airline:
flights_summary <- flight_data |>
mutate(total_flights = on_time + delayed) |> # nolint
mutate(delay_pct = delayed/total_flights)
ggplot(flights_summary, aes(x = airline, y = delay_pct, fill = airline)) +
geom_col() + # nolint
facet_wrap(~city) +
scale_y_continuous(labels = scales::percent) +
ylab("Delay Percentage") +
xlab("")
This graph above shows pretty clearly that AM West
pretty reliably has a higher percentage of delayed flights over
ALASKA. For both airlines, Pheonix was the airport with the
lowest delay percentage.
flights_summary |>
filter(city == "San Francisco") # nolint
## # A tibble: 2 × 6
## airline city on_time delayed total_flights delay_pct
## <chr> <chr> <int> <int> <int> <dbl>
## 1 ALASKA San Francisco 503 102 605 0.169
## 2 AM West San Francisco 320 129 449 0.287
.287/.169
## [1] 1.698225
Visually, it seems that San Francisco has the greatest difference in
on time percentage where AM West is about has a delay rate
69.8% greater than that of ALASKA.
Looking at the total number of flights:
ggplot(flights_summary, aes(x=airline, y=total_flights, fill=airline)) +
geom_col() + # nolint
facet_wrap(~city) +
geom_label(aes(label = round(100*delay_pct,0)))
Now looking at this new figure, we have the delay percentage as the text on the bars and the number of flights at each airport. From here we can see that although Phoenix had the lowest delay percentage, it sees the most flights.
By cleaning up some relatively unstructured data about flight
information, we were able to analyze the data and determine that anyone
flying AM West out of San Francisco can expect to be
delayed.
We can see the delay percentage by city and compeitor. From here, we
can see that AM West consistently performs worse than
ALASKA. This is disheartening as they have many more
flights than ALASKA.