Libraries

I’ll first pull in the tidyverse, which contains both dplyr and tidyr for tidying data sets, as well as additional tools for analysis and visualization downstream.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Data Import

Next I’ll import my data, which I created in Google Sheets–replicating the format from the pdf–and exported as a .csv file.

df <- read.csv('week5_assignment_messy_data.csv')

df
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Tidying

First, I need to add column names to the two columns with empty headers.

colnames(df)[1] <- 'airline'
colnames(df)[2] <- 'status'

df
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Next, I’ll drop the long, fully null row between the two airlines. It is important to note that this function drops rows with any nulls, so I would use caution with a larger file where nulls may appear elsewhere.

df <- drop_na(df)

df
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

In the airline column, the airline is only named once and implied to apply down the table (until another value is named). I’ll use the fill function in tidyr to mitigate this.

#make sure blank strings are interpreted as null

df$airline[df$airline == ''] <- NA

#fill
df <- df %>% 
  fill(airline)

df
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Because the data is in a ‘wide’ format, I need to pivot it ‘longer’. Isolating all the columns that account for cities, I can put them in one column and assign the name ‘city,’ while each number of flights will fall under ‘count.’

df <- df %>%
  pivot_longer(
    cols = seq(3,7),
    names_to = 'city',
    values_to = 'count'
  )

df
## # A tibble: 20 × 4
##    airline status  city          count
##    <chr>   <chr>   <chr>         <int>
##  1 ALASKA  on time Los.Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San.Diego       212
##  4 ALASKA  on time San.Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los.Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San.Diego        20
##  9 ALASKA  delayed San.Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los.Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San.Diego       383
## 14 AM WEST on time San.Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los.Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San.Diego        65
## 19 AM WEST delayed San.Francisco   129
## 20 AM WEST delayed Seattle          61

The city names in the table now have periods instead of spaces–a holdover from when they were column names, which cannot have spaces–I will use gsub to get spaces back into the values.

df$city <- gsub('\\.', ' ', df$city)

df
## # A tibble: 20 × 4
##    airline status  city          count
##    <chr>   <chr>   <chr>         <int>
##  1 ALASKA  on time Los Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San Diego       212
##  4 ALASKA  on time San Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San Diego        20
##  9 ALASKA  delayed San Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San Diego       383
## 14 AM WEST on time San Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San Diego        65
## 19 AM WEST delayed San Francisco   129
## 20 AM WEST delayed Seattle          61

Analysis

First, I want to make a plot demonstrating the volume of flights for each city, split by airline. Additionally, I’ll color each bar based on the split of on-time vs. delayed flights.

ggplot(df) +
    geom_col(aes(city, count, fill = status)) +
    facet_wrap(~airline) +
    coord_flip() +
    scale_fill_manual('status', 
                      values = c('on time' = "deepskyblue3", 
                                 'delayed' = "deepskyblue4"))

From this chart, we can see that Alaska Airlines had many more flights into Seattle than AM West, while AM West was massively skewed toward Phoenix.

Because delayed flights are a relatively small percentage of flights across cities, it can be difficult to tell the frequency of delays across cities visually. To account for this, I can use position = ‘fill’ as an argument of aes within geom_col() to make each bar fill the graph 100%. Regardless of volume, we can now get a sense of the percentage of arrivals into each city were delayed, split by the two airlines.

ggplot(df) +
    geom_col(aes(city, count, fill = status),
             position = 'fill') +
    facet_wrap(~airline) +
    coord_flip() + 
    scale_fill_manual('status', 
                      values = c('on time' = "deepskyblue3", 
                                 'delayed' = "deepskyblue4"))

We get several additional learnings from this chart: